false duplicates in the "duplicate items", which happens especially for the Item type "book"

livey_liwei · October 16, 2019

hello, I notice that in the "duplicate items", plenty of "journal articles" which are clearly not duplicates (like with different titles, authors) are marked as duplicates； and from some previous threads, it seems that this is more likely to happen when the items have the type of "book". I checked their "item type" and indeed most of these duplicates are "book".

This is probably because that some of them are published as chapters in books or articles in books, such as a book of the collection of the conference papers. Most of the references are imported from Web of science (a small portion from google scholar). Even though I don't know why they are regarded as books, but they are clearly different items according to the title or authors. I wonder if this is due to a wrong identification of the item types or the way how items are identified as duplicates? This does not seem to be a very difficult issue to implement. BTW, the latest version 5.0.76 has the same problem.

Is there is any solution for this? This is an important function, allowing the users to import records without the need to manually identify the potential duplicates.

Finally, I also support the suggestion by some others that to allow the users to mark some records as non-duplicates.

Thanks a lot and my best wish to Zotero, the best reference management software I have been used so far.

adamsmith · October 16, 2019

If Zotero sees two books with the same ISBN it treats them as duplicates without looking further (because the ISBN is a unique identifier). I assume that's the case here.

There's an open ticket for marking items as non-duplicates, so generally agreement that's a good idea.

livey_liwei · October 16, 2019

Hi adamsmith, thanks for your explanation. It's true that those with items types of books and the same ISBN theoretically should be the same book, but in practice due to some reasons some articles are classfied as books (I don't know why), in this case, which is not rare for researchers who read chapters in books or conference papers from a book collection, this problem arise, even if they clearly have distinct titles and authors.

It seems that this problem has been brought up several years ago by several differnt users, and still haven't been solved. It would be very helpful when the database becomes bigger and bigger, when one has to face the problem of identifying duplicates from multiple imports from different sources.

If the option of marking items as non-duplicates is available, it at least offers a way to mannually tackle this issue. Thanks for the information.

DWL-SDCA · October 16, 2019

One option for you is to change the item type to "book section". However, for many of these, the publication whether a conference proceeding, or an annual book in a series, the publication will also have an ISSN. For book sections within publications that have an ISSN you may consider making these journal articles. Several bibliographic databases (PubMed & WoS, for example) do this.

livey_liwei · October 17, 2019

DWL-SDCA, thanks for your advice, but this doesn't seem to me a very practical way. My major concern here is: when this problem persist, each time when I import some new references (possibly in batch), I can't safely discard the duplicates at the import stage (this is possible in endnote, but seems not possible for zotero yet); besides, if I follow your way, next time when I import some potential duplicates, they will be again of the item type "book". these duplicates may fail to be identified as duplicates as those ones already in the database are mannually modified other types; moreover, when this happens, the new ones are again of the type "book". This means that, each time we import new references, we need to deal with the potential duplicates mannually by setting them as other types, which is quite annoying, and mannually operation like that tend to introduce mistakes. but thanks for your kind response.

Actually, I regard dealing with duplicates as a very fundamental requirement for many researchers, which happens very often: each time one conducts a differnt research topic, he may look up references in, e.g. web of science, and pick up around 40 references to import in batch, then it's highly likely that some of the references are already in zotero imported when dealing with relevent topics before. At this time duplicates occurs.

Besides, it doesn't seem a difficult technical issue to rewrite the conditions for identifying duplicates. I don't understand why this issue has last for years. This causes a big headache to me when I start to use zotero as my primary management software. I hope this simple yet very annyoing bug could be considered to be fixed in the future. Finally, many thanks to the contributors to this great software.

livey_liwei · May 21, 2022

@adamsmith

Hi, I wonder if there is any progress on this issue? I see the replies in some other posts stating the difficulty of implementing the function of "marking items as non-duplicates". So, is it possible to handle the issue mentioned above upon improving the detection mechanism so that it will be able to introduce much fewer "false duplicates"?

The reason that I come back to this issue is that, I found some plugin that can merge the items in the duplicates folder in batch, but the existence of the false duplicates, especially those conference papers with different titles, makes the job quite cumbersome. For my case, I occasionally do literature reviews on different topics, and hence, I need to import the relevant papers at different times. Many times, some duplicates will be imported.

If the false duplcates issue can be solved, that would save me a lot of time of manually cleaning up the library, or from suffering of have different versions of the same paper. I believe that this function is also needed by other students or researchers who need to regulary do literature surveys.

bwiernik · May 21, 2022

If two book items have the same ISBN but are different (eg, because they are actually chapters), that’s a case of bad data in the library. The solution is to correct the data, not try to work around the incorrectly imported data

livey_liwei · May 21, 2022

@bwiernik, thanks for your comment. For my case, the data are mostly imported from web of science, which, generally speaking, offer a high standard of reference formats as far as I know. Even so, I have encountered this problem, and I have seen several different posts on this issue. Since most of researchers need to read conference papers, this issue could be troublesome to many people.

I'm not sure why some of these reference data (especially those conference papers that have been published as book collections) are reconginzed as books; not sure if there is anything wrong with the zotero translator.

Correcting the data manually is definitely one potential solution, but wouldn't it make more sense to let the duplicator detector to be able to take care of this issue? After all, items with clearly different title and authors definitely means different items, in my opinion.

bwiernik · May 21, 2022

My point is that the marking them as non-duplicates doesn’t fix the core problem that the item data is very wrong. An ISBN is a unique identifier for a book. If two books have the same identifier, a database should and will treat them as the same book. The fact that some items have bad data and incorrectly entered as books doesn’t change that. The solution is to clean up the data, either in the Zotero translator or after import.

Simply ignoring the problem leads to other issues down the line, such as inaccuracies in search reporting and lack of retrievability that harm the validity of the systematic review process.

For best practices, it should be a routine part of your data import process to check items after import and clean up erroneous data. This is a core part of my systematic review process in Zotero

livey_liwei · May 21, 2022

I agree that marking things as non-duplicates could be troublesome, but this is not what I'm requesting; all I'm thinking of is a quick way to handle the issue of "false duplcates", especially when the items are regarded as "books", because this issue occurs very often to users who need to import conference papers.

It doesn't seem to me a difficult task to change the code related to duplicates detector so that it will also compare the titles or authors, etc., when the item type is "book"; also, I can't imagine that this change will affect other functions; however, since I'm not sure how this mechanism takes effect exactly, I agree that such things could happen.

"It should be a routine part of your data import process to check items after import and clean up erroneous data"——This is exactly what I want to accomplish, but just as I said, manually changing the item types one by one doesn't seem a promising option for me. For example, next time when I import the same data, those false duplicates are again recognized as "books", meaning that I have to, again, change the item types of them one by one manually.

Maybe the problem comes from the zotero translator, but I'm really not familiar with this, so I'm not sure if this is the case.

livey_liwei · May 21, 2022

I agree that marking things as non-duplicates could be troublesome, but this is not what I'm requesting; all I'm thinking of is a quick way to handle the issue of "false duplcates", especially when the items are regarded as "books", because this issue occurs very often to users who need to import conference papers.

It doesn't seem to me a difficult task to change the code related to duplicates detector so that it will also compare the titles or authors, etc., when the item type is "book"; also, I can't imagine that this change will affect other functions; however, since I'm not sure how this mechanism takes effect exactly, I agree that such things could happen. If this is indeed the case, is it possible to offer the users an option to choose which way the users wants the software to function in terms of duplicates detection, or allow the users to add customized rules?

"It should be a routine part of your data import process to check items after import and clean up erroneous data"——This is exactly what I want to accomplish, but just as I said, manually changing the item types one by one doesn't seem a promising option for me. For example, next time when I import the same data, those false duplicates are again recognized as "books", meaning that I have to, again, change the item types of them one by one manually.

Maybe the problem comes from the zotero translator, but I'm really not familiar with this, so I'm not sure if this is the case.

DWL-SDCA · May 21, 2022

It is far from unusual to have duplicate / non-duplicate issues with databases even with important ones such as PubMed. Database curators handle these problems in accordance with the rules and conventions of the database and within the constraints imposed by the database structure. Each of us is curator of our own Zotero library and that requires effort to properly manage each record.

While I fully agree with @bwiernik @adamsmith 's comments above that with proper information in a Zotero record that the false duplicate problem should go away; there are exceptions. These are problems caused by editors and publishers -- even those with a stellar reputation. For example, JAMA will often publish a set of letters responding to an article with replies by the original author(s) to the responding letter. It is not uncommon for 2 or more different reply letters by the original authors to appear on the same page. JAMA is not alone in their practice of assigning a single title to a series of multiple letters and to assign a single DOI to the series. It may be necessary to cite only one of the letters so it makes sense to have a different Zotero record for each published letter. Items on the same page with the same title, author(s), and DOI will look to be a duplicate to an automated duplicate test. However, this test identifies probable duplicates. There is currently no way designate in the Zotero database that record number 12345 and record number 23456 are not duplicates.

I may be very wrong about this but suspect from Zotero developer comments on other threads that there are fundamental issues with the structure of the SQLite database that make marking duplicates impossible. [However, I might be mistaken and this could be a value / philosophical judgment concerning the risk of falsely marking true duplicates as not. Citing two different Zotero records of the same item causes ugly problems with citation styles.]

I curate an online bibliographic data base (currently 690,000 records). Of those we have marked fewer than 1000 as not-duplicate. We use a MySQL database and have tables for articles, reports, presentations, books and chapters, etc. Each of the tables has a field for "not a duplicate with record number". [In my case this is important only on the administrative side (not the public face) so that we can cope with true duplicates (see below).]

Setting a process for potential-duplicate detection and marking non-duplicates was not onerous in my system. I suspect that, given years of requests for the marking of false duplicates, that this is very difficult with the current Zotero database structure. Non-duplicate marking for "my" single online database is probably easier than with Zotero's need to handle such marking in each of the hundreds of thousands of individuals who use their own Zotero database wherein each has its own record IDs. Yet, because Zotero records can be merged and the result is citations to either pre-merge record will cite the single post-merge record, maybe not so difficult.

There are other examples of databases with duplicate-handling problems. PubMed handles the problem of true exact duplicates by arbitrarily deleting one or more of the records (even if the records have different publisher article numbers and different DOIs, a very common occurrence with some "predatory" publishers such as Frontiers and MDPI). This means that a PMID search for the deleted record will find nothing. (In the case of my database, a search for the deleted/merged old record ID will point the seeker to the good single record.) With Zotero one should merge true duplicates to avoid citation ugliness.

My apologies if this post is too long and too annoying.

adamsmith · May 22, 2022

I think there are three or four separate issues here:

1. Does Zotero sometimes erroneously identify properly input items as duplicates that aren't (yes) and should it allow users to mark those as non duplicates (yes, and there's been some work on that, but it's not an easy issue.) I don't think there's any need for further discussion of this, though. That said, it wouldn't solve the main problem posed by the OP here, because it'd still be a one-item-at-a-time operation, so not less work than fixing item types.

2. Could Zotero duplicate detection be improved, especially with an eye towards avoiding false positives? (probably, yes, but not clear how quickly we'll enter the land of false positive vs. false negative trade-offs)

3. Should Zotero duplicate detection be designed to not detect false positives for *incorrectly* input data (such as chapters with a book item type). That's the main concern of OP here and this would indeed be technologically fairly simple (tweaking the algorithm for duplicate detection isn't hard), but will almost certainly have significant costs in terms of false negatives and given how Zotero thinks about duplicates -- given that they're screened before merging, false positives are preferable to false negatives -- I don't think this is super likely.

4. Should Zotero allow some customization for duplicate detection? That seems attractive on a number of levels, but significantly more complex technologically and causes a range of documentation & support-related issues, so no idea where Zotero devs fall on this.

dstillman · May 22, 2022

I may be very wrong about this but suspect from Zotero developer comments on other threads that there are fundamental issues with the structure of the SQLite database that make marking duplicates impossible.

There's no problem with the database structure. We've developed a feature to mark items as different, and it should roll out soon.

livey_liwei · May 22, 2022

@adamsmith
thanks for your comment. From my point of view, we don't have to make the "duplicate detection" function perfect in one shot; at least let us try to solve the issue I mentioned in this post, which is very commonly encountered by researchers. Making the detection algorithm of duplicates of books to compare the titles and authors besides ISBN would be a fairly secure way, in my opinion.

Once this issue is fixed, I will be able to resolve all the duplicates in batch using some plugin, (of course I will manually inspected all the items to make sure that there are no postive falses left). In this case, zotero will be able to offer us performance similar to endnote in terms of removing dupliates. I have been using zotero as my main reference management software for years, and this issue is the one that brings me trouble most.

livey_liwei · June 5, 2022

@adamsmith, @dstillman
I just checked several false duplicates of conference papers (or bookcollections) recongnized as books. I notice that, many of these "books" have sth like "DOI,Pages, Publication Title" in their "Extra" filed, and the "Publication Title" shows the conference name. So I suspect that this issue is more related to the translator of zotero, which fails to recognize the file formats properly.

I believe that improving the translator would probably solve the issue to a large extent. For example, if one entry bears the field of DOI, it cannot be a book. Since the ris/ciw files have many fields one can access, I believe that a more sophiscticated way can be figured out to handle this issue.

BTW, I'm talking about importing ris (and ciw) files exported from web of knowledge.

adamsmith · June 5, 2022

Could we get a couple of examples from WoK where that's the case? E.g DOI or their unique WoK identifier?

livey_liwei · June 5, 2022

not sure how to upload files. You could try search for the following items:
Kinematics analysis and simulation of a new parallel mechanism with two translational and one rotational outputs; Computer vision based calibration of the purely translational Orthopod manipular.
Their extra fields are:
Pages: 9
DOI: 10.1109/ICINFA.2009.5205136
Pages: 73
DOI: 10.1109/ICINFA.2009.5205167

livey_liwei · June 5, 2022

BTW, seems that downloading the same item from WOK using core database or "all database" could yield different format, because sometimes I got the same item, either in the format of conference paper or booksection (or book).

adamsmith · June 6, 2022

Hmm, I'm not getting the second one at all and the first one comes in as a conference paper without DOI. I'm guessing it's from a database we're not subscribing to. Can you tell which data base these bad records are coming from?
For one of them, could you export as "Plain Text" --> Full Record, open the output with a text editor and paste the output here?
Also, could to find the accession number? That should look something like
WOS:000277076800282 and be at the bottom of the full record or at the end of the URL in the address bar (the WOS: would be another acronym if using a different database)

livey_liwei · June 6, 2022

@adamsmith, @dstillman
I just created a open zotero group library, where I imported an ris, created in web of knowledge, using the "all database", and exported as "ris" while selecting all the 11 available fields.

You can find the items, the ris file, the snapshot of the wos page, the snapshot of the duplicate item folder.

Notice that, the duplicates of this library involves two types: some distinct conference papers recongnized as "books", the same conference paper that appears twice, in the formats of either conference paper or journal article. The latter issue is caused by the database itself.

https://www.zotero.org/groups/4707971/sharefileswww/library

BTW, I don't know if one can download the items from the website page of the libraries, because I can't figure out a way to download the files I uploaded to the library above. If this is the case, I will have to try to figure out a way to share a file