marking non-duplicates
I have just started using Zotero seriously (even though I had it installed from its beginnings and followed its development). One problem that I seem to have with my data is that of "non-duplicates", ie. items which Zotero thinks are the same while in fact they are not. The two examples I have in my data are: a journal article and a book by the same author with the same title, and two anonymous reviews of the same book in different journals.
Is there a way to tell Zotero these are not duplicates?
Is there a way to tell Zotero these are not duplicates?
Direct-use Values of Non-Timber Forest Products...
Direct-use Values of Secondary Resources...
They also have one author in common. Is this the same issue as voynich?
- title
- author(s), editor(s), etc.
- volume
are the same.
If you want to have different volumes of the same encyclopedia in your library, at the moment, they will all show up as duplicates because the title is obviously the same.
Being able to define the criteria for qualification as duplicate myself would be perfect, but the above mentioned would already make the feature much more useful for me and is much easier to implement, I guess.
I agree the method zotero uses is too limited (described here:
https://www.zotero.org/support/duplicate_detection
)
If you import http://ftp.math.utah.edu/pub/tex/bib/tugboat.html
you will find lots of duplicates, which are actually different installments of recurring columns.
It doesn't seem to be fixed by now. Does anyone know a fix or a workaround?
Paul
“Alice in Wonderland.” Life, vol. 30, June 1951, pp. 85–87.
“Alice in Wonderland.” Library Journal, vol. 76, Aug. 1951, p. 1239.
Zotero marks these and 10 more like them to be duplicates.
Is there any reason that it would be a bad strategy for Zotero to require matching publication title to mark articles as a duplicate?
@paulvdh, here's the duplicate criteria:
"Zotero currently uses the the title, DOI, and ISBN fields to determine duplicates. If these fields match (or are absent), Zotero also compares the years of publication (if they are within a year of each other) and author/creator lists (if at least one author last name plus first initial matches) to determine duplicates. The algorithm will be improved in the future to incorporate other fields."
So only info in those fields would affect the duplicate evaluation.
Motivation: In a big library like mine, the Duplicate Items collection is perennially populated by at least 150 items that are non-duplicates, with the effect that true duplicates are really hard to find (especially since sorting by date makes pairs harder to spot and compare).
An option to to mark a "false duplicates" as a "non duplicates" would be a great relieve fo me as well.
So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".
I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.
I hope this make sense. Thanks for this great software!
Perhaps an interface similar to the conditions for saved searches that would allow adjusting the criteria for the Duplicate Items collection could be a better solution. If all criteria were listed with checkboxes, you could tighten or relax the conditions for the duplicate items detection. It wouldn't be a problem to add new conditions that could be switched off by default, e.g., "Not duplicate if different item type", "Not duplicate if different year".
I'm not sure why the detection mechanism is designed in such a way, and I have raised this issue in 2019 and somehow this issue still persists, even though it seems trivial to me to change its detection mechanism to handle this issue.
The reason that I come back to this issue is that, I found some plugin that can merge the items in the duplicates folder in batch, but the existence of the false duplicates, especially those conference papers with different titles, makes the job quite cumbersome.
For my case, I occasionally do literature reviews on different topics, and hence, I need to import the relevant papers at different times. Many times, some duplicates will be imported. If the false duplcates issue can be solved, that would save me a lot of time of manually cleaning up the library, or from suffering of have different versions of the same paper.
I regard this as a fundamental function of a database management software, because importing data is the beginning of everything. If possible, I hope that the team would consider giving higher priority to this issue.
the first seems to hit the #papercut sweet spot of being both easy to implement and solving a fairly big and recurring annoyance, at least for power users!
I've not seen anything on improving the detection algorithm (or allowing some level of customization) but that doesn't mean no one is thinking about it, just that there isn't any public code.