Prevent duplicate during import

Can there be a function to prevent duplicates when importing items? I have 10k items in my library and removing duplicates manually is a hassle. There used to be an add-on (https://github.com/corajr/zotero-prevent-duplicates) that does just that but it doesn't work with the current version of Zotero.

«1
  • This is a great idea, because it would be clearer and more convenient than the list of mostly old false positives in the "duplicates" screen.
  • There's a longstanding ticket for something along these lines (https://github.com/zotero/zotero/issues/1007 ) so it's planned, but turns out to be trickier than it seems (the prevent duplicates add-ons only ever worked for the version of Zotero that ran entirely inside Firefox, which made this easier).
  • Yes, yes, yes! This would be very beneficial. Even if Zotero could have a little pop-up box when it finds an item with a similar title and would ask, "Is this a duplicate or not?"
  • +1
    Even a simple warning of a possible duplicate (same logic as in the 'Duplicate Items' menu) would be of much help!
  • +1 please at least a warning!

    A popup warning of possible duplication when importing PDFs would save me both time and storage space.

    Alternately, the ability to choose which attachments to keep when merging dupes would help.
  • edited December 2, 2021
    YES please.
    And related: it would be nice if the merge duplicate function would recognize two identical attachments.
    At present, merging will simply merge all attachments into the single entry. More often than not, this will create duplicate attachments within the same entry. Not saving any disk space.
  • This shouldn't be particularly difficult to implement if one compares the hashes of the file presently in one's Zotero library with the hash of the file attempting to be added.

    Yes, it's possible that "hash collision" can occur wherein completely different files *could* have the same hash, but it's exceedingly rare given the size of hash values. Comparing the hashes of the files AND other metadata that can be pulled from both really decreases the chances of collision.

    Just me thinking out loud...
  • @tpcarter: 1) This would be based on data, not files, which aren't necessarily present. 2) Gated files are often watermarked, so hashes wouldn't match.

    In any case, see the GitHub issue linked above for a discussion of the problems. Your suggestion is more relevant to duplicate detection after the item is in the library, which is a separate feature that already exists.
  • Duplicating references on import is frustrating, uses storage capacity unnecessarily, denormalizes the Zotero database, and probably costs users' organizations a ton of money. Even a partial solution, such as checking the PMID of the reference to be imported against PMIDs already in the database, would go a long way toward solving what I see as the weakest feature of Zotero.

    I love Zotero. I use Zotero every day. +1 for this feature request.
  • edited January 3, 2022
    @rphair
    Re: Rejecting a 'duplicate' because the PMID is already in your Zotero database.

    PubMed will update an ahead-of-print record when the publisher later provides volume, issue, and pagination metadata -- PubMed doesn't change the PMID. If, as you suggest, downloads are pre-screened for duplicate PMIDs then the updated metadata might not make it to your Zotero record.

    Also-
    It is not common but certainly not very rare for the same journal article to have different DOIs or PMIDs. Recent examples:

    10.1016/j.rgmx.2021.01.012 and 10.1016/j.rgmxen.2021.01.002

    10.12688/f1000research.51209.1 and 10.12688/f1000research.51209.2

    I can't immediately provide examples of duplicate articles in PubMed with different PMIDs but these too are rather common especially within the first days to weeks of the records appearance in the database. These duplicates with different PMIDs can linger for several months. Then one of the duplicate articles is kept and the other(s) vanish (a PMID search on the deleted records will not be forwarded to the kept record).

  • @DWL-SDCA Thanks for this insight. I have seen, just as you say, cases of multiple DOIs for the same journal article, but I had no idea that PMIDs were similarly fragile. Thanks for the heads-up.

    Does PubMed maintain a table of the many-to-one relationships of PMIDs to journal articles?
  • edited January 4, 2022
    Does PubMed maintain a table of the many-to-one relationships of PMIDs to journal articles?

    Alas, no. But PMIDs are not reused. In my experience I believe that PMIDs are substantially more likely to be duplicated than are DOIs. This is because of the way that, even well-established publishers provide metadata to the NLM. Some publishers are not very good at "following the rules" when via FTP or their API providing unduplicated article metadata to indexers. The PubMed system is able to detect many of the duplicates, but not all -- especially with ahead of print articles by authors who have names such as Nicholas Smith, Nick Smith, N. Smith, etc. Many publishers' author metadata does not fully agree with the authors' full names as "printed" on the final printed page or PDF.

    edit: I mentioned the name problem because I found that it is one of the main reasons that new PMIDs are assigned to duplicate articles. Also, essentially every article with multiple DOIs is at least briefly assigned multiple PMIDs but this is generally fixed right away.
  • +1 for preventing duplicates during import.
    Any update?
    When there are thousands of duplicates created during import, is it possible to merge them automatically?
  • No update I'm aware of.
    When there are thousands of duplicates created during import, is it possible to merge them automatically?
    Not in Zotero itself, no. If these are duplicates because of repeated import, sorting by date added and deleting might be an option. Otherwise, this tool was just introduced in another thread and looks like it'd do this: https://forums.zotero.org/discussion/95058/zoterotide-cleanup-and-reporting-app-for-zotero#latest -- it relies on the web API, so you need to be synced. There's also some javascript code in a thread that basically just clicks "Merge" in the duplicate view 100 times that you might find helpful.
  • edited October 7, 2022
    +1

    I would love to have the ability to remove duplicates on import.
  • +1
    At least a warning would be really helpful.
  • +1
    Very much needed.
  • @sfdsfsdf , @davidsf , @janakact , @antanij , @jazow

    I write a plugin to warn user when a duplicated item is imported.
    please see https://github.com/northword/zotero-format-metadata#duplication-check
  • @northword Thanks a lot. "zotero-format-metadata.xpi"? I will download it now.
  • I see that v0.4.4 of that plugin is the version that should be used for Zotero 6. Later versions are for Zotero 7 (currently still in beta).
  • @tim820 , yes, and version 0.4.4 does not yet support duplicate warnings.

    ---

    Also I need to remind some that the plugin is warning after the item has already been saved to the library and the next step is to go to merge. Part of this thread discusses alerting the user before the items are saved to the library, which the plugin still doesn't implement.
  • Thanks, northword, for the plugin. I could easily turn off the metadata editing in the Preferences (which I did not need), and the duplicate item detection worked very well. This really helps using Zotero.

    It is not clear to me what is the best way for Zotero to behave when it detects a duplicate item being added. But the choice to go the Merge window is not necessarily a bad thing. It helped me quickly evaluate which collections the items existed. I could manually delete the new item there, too. Merge is also useful if you want to add new attachments to the old item, or update data.
  • @DWL-SDCA Re: Rejecting a 'duplicate' because the PMID is already in your Zotero database.

    Wouldn't it be possible to warn, if duplicate PMID or DOI is detected (before adding via Connector) and then showing the metadata of existing entry and the one one is about to add side-by-side?

    In general, what I am really missing is a visual, immediate indication on the icon of the Zotero connector icon in the browser toolbar for "something with the same identifier is already in your library".

    That with the PMID/DOI based warning + side-by-side metadata would be ideal, I guess.
  • In general, what I am really missing is a visual, immediate indication on the icon of the Zotero connector icon in the browser toolbar for "something with the same identifier is already in your library".
    That's not possible, as discussed in the linked GitHub ticket. The Connector often doesn't know what you're saving until it performs additional requests after you click save. And doing this just for URLs (which itself would be hit-or-miss due to query parameters) would be misleading and confusing.
  • +1
    I would find this feature very useful as well, it would be a very significant time saver for some of the work I do using Zotero.
  • For those using Zotero 6, it seems like there is a way to get this with reasonable effort: https://github.com/zotero/zotero/issues/1007#issuecomment-1432300453
    I'm not the author of that post, but he states to be actively using it (and the post is from Feb 2023).

    Thought it'd be a good idea to post this here, as there seems to be large interest in this feature since years and the linked post has not gotten much attention yet.
Sign In or Register to comment.