Prevent duplicate during import
Can there be a function to prevent duplicates when importing items? I have 10k items in my library and removing duplicates manually is a hassle. There used to be an add-on (https://github.com/corajr/zotero-prevent-duplicates) that does just that but it doesn't work with the current version of Zotero.
Even a simple warning of a possible duplicate (same logic as in the 'Duplicate Items' menu) would be of much help!
A popup warning of possible duplication when importing PDFs would save me both time and storage space.
Alternately, the ability to choose which attachments to keep when merging dupes would help.
And related: it would be nice if the merge duplicate function would recognize two identical attachments.
At present, merging will simply merge all attachments into the single entry. More often than not, this will create duplicate attachments within the same entry. Not saving any disk space.
Yes, it's possible that "hash collision" can occur wherein completely different files *could* have the same hash, but it's exceedingly rare given the size of hash values. Comparing the hashes of the files AND other metadata that can be pulled from both really decreases the chances of collision.
Just me thinking out loud...
In any case, see the GitHub issue linked above for a discussion of the problems. Your suggestion is more relevant to duplicate detection after the item is in the library, which is a separate feature that already exists.
I love Zotero. I use Zotero every day. +1 for this feature request.
Re: Rejecting a 'duplicate' because the PMID is already in your Zotero database.
PubMed will update an ahead-of-print record when the publisher later provides volume, issue, and pagination metadata -- PubMed doesn't change the PMID. If, as you suggest, downloads are pre-screened for duplicate PMIDs then the updated metadata might not make it to your Zotero record.
Also-
It is not common but certainly not very rare for the same journal article to have different DOIs or PMIDs. Recent examples:
10.1016/j.rgmx.2021.01.012 and 10.1016/j.rgmxen.2021.01.002
10.12688/f1000research.51209.1 and 10.12688/f1000research.51209.2
I can't immediately provide examples of duplicate articles in PubMed with different PMIDs but these too are rather common especially within the first days to weeks of the records appearance in the database. These duplicates with different PMIDs can linger for several months. Then one of the duplicate articles is kept and the other(s) vanish (a PMID search on the deleted records will not be forwarded to the kept record).
Does PubMed maintain a table of the many-to-one relationships of PMIDs to journal articles?
Alas, no. But PMIDs are not reused. In my experience I believe that PMIDs are substantially more likely to be duplicated than are DOIs. This is because of the way that, even well-established publishers provide metadata to the NLM. Some publishers are not very good at "following the rules" when via FTP or their API providing unduplicated article metadata to indexers. The PubMed system is able to detect many of the duplicates, but not all -- especially with ahead of print articles by authors who have names such as Nicholas Smith, Nick Smith, N. Smith, etc. Many publishers' author metadata does not fully agree with the authors' full names as "printed" on the final printed page or PDF.
edit: I mentioned the name problem because I found that it is one of the main reasons that new PMIDs are assigned to duplicate articles. Also, essentially every article with multiple DOIs is at least briefly assigned multiple PMIDs but this is generally fixed right away.
Any update?
When there are thousands of duplicates created during import, is it possible to merge them automatically?
I would love to have the ability to remove duplicates on import.
At least a warning would be really helpful.
Very much needed.
I write a plugin to warn user when a duplicated item is imported.
please see https://github.com/northword/zotero-format-metadata#duplication-check
---
Also I need to remind some that the plugin is warning after the item has already been saved to the library and the next step is to go to merge. Part of this thread discusses alerting the user before the items are saved to the library, which the plugin still doesn't implement.
It is not clear to me what is the best way for Zotero to behave when it detects a duplicate item being added. But the choice to go the Merge window is not necessarily a bad thing. It helped me quickly evaluate which collections the items existed. I could manually delete the new item there, too. Merge is also useful if you want to add new attachments to the old item, or update data.
Wouldn't it be possible to warn, if duplicate PMID or DOI is detected (before adding via Connector) and then showing the metadata of existing entry and the one one is about to add side-by-side?
In general, what I am really missing is a visual, immediate indication on the icon of the Zotero connector icon in the browser toolbar for "something with the same identifier is already in your library".
That with the PMID/DOI based warning + side-by-side metadata would be ideal, I guess.
I would find this feature very useful as well, it would be a very significant time saver for some of the work I do using Zotero.
I'm not the author of that post, but he states to be actively using it (and the post is from Feb 2023).
Thought it'd be a good idea to post this here, as there seems to be large interest in this feature since years and the linked post has not gotten much attention yet.