false duplicates in the "duplicate items", which happens especially for the Item type "book"
hello, I notice that in the "duplicate items", plenty of "journal articles" which are clearly not duplicates (like with different titles, authors) are marked as duplicates; and from some previous threads, it seems that this is more likely to happen when the items have the type of "book". I checked their "item type" and indeed most of these duplicates are "book".
This is probably because that some of them are published as chapters in books or articles in books, such as a book of the collection of the conference papers. Most of the references are imported from Web of science (a small portion from google scholar). Even though I don't know why they are regarded as books, but they are clearly different items according to the title or authors. I wonder if this is due to a wrong identification of the item types or the way how items are identified as duplicates? This does not seem to be a very difficult issue to implement. BTW, the latest version 5.0.76 has the same problem.
Is there is any solution for this? This is an important function, allowing the users to import records without the need to manually identify the potential duplicates.
Finally, I also support the suggestion by some others that to allow the users to mark some records as non-duplicates.
Thanks a lot and my best wish to Zotero, the best reference management software I have been used so far.
This is probably because that some of them are published as chapters in books or articles in books, such as a book of the collection of the conference papers. Most of the references are imported from Web of science (a small portion from google scholar). Even though I don't know why they are regarded as books, but they are clearly different items according to the title or authors. I wonder if this is due to a wrong identification of the item types or the way how items are identified as duplicates? This does not seem to be a very difficult issue to implement. BTW, the latest version 5.0.76 has the same problem.
Is there is any solution for this? This is an important function, allowing the users to import records without the need to manually identify the potential duplicates.
Finally, I also support the suggestion by some others that to allow the users to mark some records as non-duplicates.
Thanks a lot and my best wish to Zotero, the best reference management software I have been used so far.
There's an open ticket for marking items as non-duplicates, so generally agreement that's a good idea.
It seems that this problem has been brought up several years ago by several differnt users, and still haven't been solved. It would be very helpful when the database becomes bigger and bigger, when one has to face the problem of identifying duplicates from multiple imports from different sources.
If the option of marking items as non-duplicates is available, it at least offers a way to mannually tackle this issue. Thanks for the information.
Actually, I regard dealing with duplicates as a very fundamental requirement for many researchers, which happens very often: each time one conducts a differnt research topic, he may look up references in, e.g. web of science, and pick up around 40 references to import in batch, then it's highly likely that some of the references are already in zotero imported when dealing with relevent topics before. At this time duplicates occurs.
Besides, it doesn't seem a difficult technical issue to rewrite the conditions for identifying duplicates. I don't understand why this issue has last for years. This causes a big headache to me when I start to use zotero as my primary management software. I hope this simple yet very annyoing bug could be considered to be fixed in the future. Finally, many thanks to the contributors to this great software.
Hi, I wonder if there is any progress on this issue? I see the replies in some other posts stating the difficulty of implementing the function of "marking items as non-duplicates". So, is it possible to handle the issue mentioned above upon improving the detection mechanism so that it will be able to introduce much fewer "false duplicates"?
The reason that I come back to this issue is that, I found some plugin that can merge the items in the duplicates folder in batch, but the existence of the false duplicates, especially those conference papers with different titles, makes the job quite cumbersome. For my case, I occasionally do literature reviews on different topics, and hence, I need to import the relevant papers at different times. Many times, some duplicates will be imported.
If the false duplcates issue can be solved, that would save me a lot of time of manually cleaning up the library, or from suffering of have different versions of the same paper. I believe that this function is also needed by other students or researchers who need to regulary do literature surveys.
I'm not sure why some of these reference data (especially those conference papers that have been published as book collections) are reconginzed as books; not sure if there is anything wrong with the zotero translator.
Correcting the data manually is definitely one potential solution, but wouldn't it make more sense to let the duplicator detector to be able to take care of this issue? After all, items with clearly different title and authors definitely means different items, in my opinion.
Simply ignoring the problem leads to other issues down the line, such as inaccuracies in search reporting and lack of retrievability that harm the validity of the systematic review process.
For best practices, it should be a routine part of your data import process to check items after import and clean up erroneous data. This is a core part of my systematic review process in Zotero
It doesn't seem to me a difficult task to change the code related to duplicates detector so that it will also compare the titles or authors, etc., when the item type is "book"; also, I can't imagine that this change will affect other functions; however, since I'm not sure how this mechanism takes effect exactly, I agree that such things could happen.
"It should be a routine part of your data import process to check items after import and clean up erroneous data"——This is exactly what I want to accomplish, but just as I said, manually changing the item types one by one doesn't seem a promising option for me. For example, next time when I import the same data, those false duplicates are again recognized as "books", meaning that I have to, again, change the item types of them one by one manually.
Maybe the problem comes from the zotero translator, but I'm really not familiar with this, so I'm not sure if this is the case.
It doesn't seem to me a difficult task to change the code related to duplicates detector so that it will also compare the titles or authors, etc., when the item type is "book"; also, I can't imagine that this change will affect other functions; however, since I'm not sure how this mechanism takes effect exactly, I agree that such things could happen. If this is indeed the case, is it possible to offer the users an option to choose which way the users wants the software to function in terms of duplicates detection, or allow the users to add customized rules?
"It should be a routine part of your data import process to check items after import and clean up erroneous data"——This is exactly what I want to accomplish, but just as I said, manually changing the item types one by one doesn't seem a promising option for me. For example, next time when I import the same data, those false duplicates are again recognized as "books", meaning that I have to, again, change the item types of them one by one manually.
Maybe the problem comes from the zotero translator, but I'm really not familiar with this, so I'm not sure if this is the case.
While I fully agree with @bwiernik @adamsmith 's comments above that with proper information in a Zotero record that the false duplicate problem should go away; there are exceptions. These are problems caused by editors and publishers -- even those with a stellar reputation. For example, JAMA will often publish a set of letters responding to an article with replies by the original author(s) to the responding letter. It is not uncommon for 2 or more different reply letters by the original authors to appear on the same page. JAMA is not alone in their practice of assigning a single title to a series of multiple letters and to assign a single DOI to the series. It may be necessary to cite only one of the letters so it makes sense to have a different Zotero record for each published letter. Items on the same page with the same title, author(s), and DOI will look to be a duplicate to an automated duplicate test. However, this test identifies probable duplicates. There is currently no way designate in the Zotero database that record number 12345 and record number 23456 are not duplicates.
I may be very wrong about this but suspect from Zotero developer comments on other threads that there are fundamental issues with the structure of the SQLite database that make marking duplicates impossible. [However, I might be mistaken and this could be a value / philosophical judgment concerning the risk of falsely marking true duplicates as not. Citing two different Zotero records of the same item causes ugly problems with citation styles.]
I curate an online bibliographic data base (currently 690,000 records). Of those we have marked fewer than 1000 as not-duplicate. We use a MySQL database and have tables for articles, reports, presentations, books and chapters, etc. Each of the tables has a field for "not a duplicate with record number". [In my case this is important only on the administrative side (not the public face) so that we can cope with true duplicates (see below).]
Setting a process for potential-duplicate detection and marking non-duplicates was not onerous in my system. I suspect that, given years of requests for the marking of false duplicates, that this is very difficult with the current Zotero database structure. Non-duplicate marking for "my" single online database is probably easier than with Zotero's need to handle such marking in each of the hundreds of thousands of individuals who use their own Zotero database wherein each has its own record IDs. Yet, because Zotero records can be merged and the result is citations to either pre-merge record will cite the single post-merge record, maybe not so difficult.
There are other examples of databases with duplicate-handling problems. PubMed handles the problem of true exact duplicates by arbitrarily deleting one or more of the records (even if the records have different publisher article numbers and different DOIs, a very common occurrence with some "predatory" publishers such as Frontiers and MDPI). This means that a PMID search for the deleted record will find nothing. (In the case of my database, a search for the deleted/merged old record ID will point the seeker to the good single record.) With Zotero one should merge true duplicates to avoid citation ugliness.
My apologies if this post is too long and too annoying.
1. Does Zotero sometimes erroneously identify properly input items as duplicates that aren't (yes) and should it allow users to mark those as non duplicates (yes, and there's been some work on that, but it's not an easy issue.) I don't think there's any need for further discussion of this, though. That said, it wouldn't solve the main problem posed by the OP here, because it'd still be a one-item-at-a-time operation, so not less work than fixing item types.
2. Could Zotero duplicate detection be improved, especially with an eye towards avoiding false positives? (probably, yes, but not clear how quickly we'll enter the land of false positive vs. false negative trade-offs)
3. Should Zotero duplicate detection be designed to not detect false positives for *incorrectly* input data (such as chapters with a book item type). That's the main concern of OP here and this would indeed be technologically fairly simple (tweaking the algorithm for duplicate detection isn't hard), but will almost certainly have significant costs in terms of false negatives and given how Zotero thinks about duplicates -- given that they're screened before merging, false positives are preferable to false negatives -- I don't think this is super likely.
4. Should Zotero allow some customization for duplicate detection? That seems attractive on a number of levels, but significantly more complex technologically and causes a range of documentation & support-related issues, so no idea where Zotero devs fall on this.
thanks for your comment. From my point of view, we don't have to make the "duplicate detection" function perfect in one shot; at least let us try to solve the issue I mentioned in this post, which is very commonly encountered by researchers. Making the detection algorithm of duplicates of books to compare the titles and authors besides ISBN would be a fairly secure way, in my opinion.
Once this issue is fixed, I will be able to resolve all the duplicates in batch using some plugin, (of course I will manually inspected all the items to make sure that there are no postive falses left). In this case, zotero will be able to offer us performance similar to endnote in terms of removing dupliates. I have been using zotero as my main reference management software for years, and this issue is the one that brings me trouble most.
I just checked several false duplicates of conference papers (or bookcollections) recongnized as books. I notice that, many of these "books" have sth like "DOI,Pages, Publication Title" in their "Extra" filed, and the "Publication Title" shows the conference name. So I suspect that this issue is more related to the translator of zotero, which fails to recognize the file formats properly.
I believe that improving the translator would probably solve the issue to a large extent. For example, if one entry bears the field of DOI, it cannot be a book. Since the ris/ciw files have many fields one can access, I believe that a more sophiscticated way can be figured out to handle this issue.
BTW, I'm talking about importing ris (and ciw) files exported from web of knowledge.
Kinematics analysis and simulation of a new parallel mechanism with two translational and one rotational outputs; Computer vision based calibration of the purely translational Orthopod manipular.
Their extra fields are:
Pages: 9
DOI: 10.1109/ICINFA.2009.5205136
Pages: 73
DOI: 10.1109/ICINFA.2009.5205167
For one of them, could you export as "Plain Text" --> Full Record, open the output with a text editor and paste the output here?
Also, could to find the accession number? That should look something like
WOS:000277076800282 and be at the bottom of the full record or at the end of the URL in the address bar (the WOS: would be another acronym if using a different database)
I just created a open zotero group library, where I imported an ris, created in web of knowledge, using the "all database", and exported as "ris" while selecting all the 11 available fields.
You can find the items, the ris file, the snapshot of the wos page, the snapshot of the duplicate item folder.
Notice that, the duplicates of this library involves two types: some distinct conference papers recongnized as "books", the same conference paper that appears twice, in the formats of either conference paper or journal article. The latter issue is caused by the database itself.
https://www.zotero.org/groups/4707971/sharefileswww/library
BTW, I don't know if one can download the items from the website page of the libraries, because I can't figure out a way to download the files I uploaded to the library above. If this is the case, I will have to try to figure out a way to share a file