Prevent duplicate during import

Can there be a function to prevent duplicates when importing items? I have 10k items in my library and removing duplicates manually is a hassle. There used to be an add-on (https://github.com/corajr/zotero-prevent-duplicates) that does just that but it doesn't work with the current version of Zotero.

  • This is a great idea, because it would be clearer and more convenient than the list of mostly old false positives in the "duplicates" screen.
  • There's a longstanding ticket for something along these lines (https://github.com/zotero/zotero/issues/1007 ) so it's planned, but turns out to be trickier than it seems (the prevent duplicates add-ons only ever worked for the version of Zotero that ran entirely inside Firefox, which made this easier).
  • Yes, yes, yes! This would be very beneficial. Even if Zotero could have a little pop-up box when it finds an item with a similar title and would ask, "Is this a duplicate or not?"
  • +1
    Even a simple warning of a possible duplicate (same logic as in the 'Duplicate Items' menu) would be of much help!
  • +1 please at least a warning!

    A popup warning of possible duplication when importing PDFs would save me both time and storage space.

    Alternately, the ability to choose which attachments to keep when merging dupes would help.
  • edited December 2, 2021
    YES please.
    And related: it would be nice if the merge duplicate function would recognize two identical attachments.
    At present, merging will simply merge all attachments into the single entry. More often than not, this will create duplicate attachments within the same entry. Not saving any disk space.
  • This shouldn't be particularly difficult to implement if one compares the hashes of the file presently in one's Zotero library with the hash of the file attempting to be added.

    Yes, it's possible that "hash collision" can occur wherein completely different files *could* have the same hash, but it's exceedingly rare given the size of hash values. Comparing the hashes of the files AND other metadata that can be pulled from both really decreases the chances of collision.

    Just me thinking out loud...
  • @tpcarter: 1) This would be based on data, not files, which aren't necessarily present. 2) Gated files are often watermarked, so hashes wouldn't match.

    In any case, see the GitHub issue linked above for a discussion of the problems. Your suggestion is more relevant to duplicate detection after the item is in the library, which is a separate feature that already exists.
  • Duplicating references on import is frustrating, uses storage capacity unnecessarily, denormalizes the Zotero database, and probably costs users' organizations a ton of money. Even a partial solution, such as checking the PMID of the reference to be imported against PMIDs already in the database, would go a long way toward solving what I see as the weakest feature of Zotero.

    I love Zotero. I use Zotero every day. +1 for this feature request.
  • edited January 3, 2022
    @rphair
    Re: Rejecting a 'duplicate' because the PMID is already in your Zotero database.

    PubMed will update an ahead-of-print record when the publisher later provides volume, issue, and pagination metadata -- PubMed doesn't change the PMID. If, as you suggest, downloads are pre-screened for duplicate PMIDs then the updated metadata might not make it to your Zotero record.

    Also-
    It is not common but certainly not very rare for the same journal article to have different DOIs or PMIDs. Recent examples:

    10.1016/j.rgmx.2021.01.012 and 10.1016/j.rgmxen.2021.01.002

    10.12688/f1000research.51209.1 and 10.12688/f1000research.51209.2

    I can't immediately provide examples of duplicate articles in PubMed with different PMIDs but these too are rather common especially within the first days to weeks of the records appearance in the database. These duplicates with different PMIDs can linger for several months. Then one of the duplicate articles is kept and the other(s) vanish (a PMID search on the deleted records will not be forwarded to the kept record).

  • @DWL-SDCA Thanks for this insight. I have seen, just as you say, cases of multiple DOIs for the same journal article, but I had no idea that PMIDs were similarly fragile. Thanks for the heads-up.

    Does PubMed maintain a table of the many-to-one relationships of PMIDs to journal articles?
  • edited January 4, 2022
    Does PubMed maintain a table of the many-to-one relationships of PMIDs to journal articles?

    Alas, no. But PMIDs are not reused. In my experience I believe that PMIDs are substantially more likely to be duplicated than are DOIs. This is because of the way that, even well-established publishers provide metadata to the NLM. Some publishers are not very good at "following the rules" when via FTP or their API providing unduplicated article metadata to indexers. The PubMed system is able to detect many of the duplicates, but not all -- especially with ahead of print articles by authors who have names such as Nicholas Smith, Nick Smith, N. Smith, etc. Many publishers' author metadata does not fully agree with the authors' full names as "printed" on the final printed page or PDF.

    edit: I mentioned the name problem because I found that it is one of the main reasons that new PMIDs are assigned to duplicate articles. Also, essentially every article with multiple DOIs is at least briefly assigned multiple PMIDs but this is generally fixed right away.
  • +1 for preventing duplicates during import.
    Any update?
    When there are thousands of duplicates created during import, is it possible to merge them automatically?
  • No update I'm aware of.
    When there are thousands of duplicates created during import, is it possible to merge them automatically?
    Not in Zotero itself, no. If these are duplicates because of repeated import, sorting by date added and deleting might be an option. Otherwise, this tool was just introduced in another thread and looks like it'd do this: https://forums.zotero.org/discussion/95058/zoterotide-cleanup-and-reporting-app-for-zotero#latest -- it relies on the web API, so you need to be synced. There's also some javascript code in a thread that basically just clicks "Merge" in the duplicate view 100 times that you might find helpful.
Sign In or Register to comment.