Checking for duplicates

mjthoraval · July 24, 2022

I am coming from Mendeley, where I was importing most of my references manually from pdf files downloaded to a local folder. I could import the whole folder every time, without having to worry about duplicates, because Mendeley would simply identify that the PDF file was already in my library. I understand that this is done by a simple check of some kind of unique identifier of each file imported.

This does not seem to be implemented in Zotero. Importing a pdf file 3 times in a row imports that same file 3 times in different entries, with exactly the same file stored multiple times on my computer.
Zotero seems to check duplicates only based on the metadata, without any check on the file imported?
https://www.zotero.org/support/duplicate_detection

Is there a design or technical reason preventing Zotero from checking that the file is not already in the library, as done by Mendeley?

I have also explored the "Duplicate Items" folder. When I expand the items to see the attached files identified as duplicates, I see that the large majority of them only show one attached pdf file. This is very suspicious, because I know that nearly all entries in my library do have a PDF attachment. It means that the "Duplicate Items" folder does not show me all the files that are attached to the entries to merge.
I have checked this manually by searching the item in my library. I wanted to copy the DOI from the metadata panel... to realize that I cannot copy any of the metadata from the duplicate folder view. So I had to open the pdf file (selected by Zotero) to copy the title to search back into the library. I indeed found 2 copies of that entry, but with 2 different files, not only 1 as shown in the duplicates folder.

If there are different files, what is going to happen to my files if I merge the items? Is Zotero going to decide for me without any warning and delete the file that it has decided is not good?
In Mendeley, I could simply see the two files in the right panel, have the ability to look at them, and decide whether or not to delete some of them.

The argument that "there should be only one valid copy of the paper published" is not valid to me. The user should be free to decide what to keep under an item, be it Supplementary Materials, Journal Pre-proofs, paper retraction notice, commented versions of a pdf file, ...

I feel that the perspective of Zotero is organized around the metadata rather that the attached files. Therefore, duplicates entries only consider the metadata. My perspective from Mendeley is different, with the core data being the files (mostly PDF files). I view my reference manager as a files organizer primarily.

This limitation of Zotero makes it much more demanding to grow my library in Zotero rather than Mendeley. I must check manually every entry separately before importing it to Zotero. Or pay the price of a messy heavily duplicated library and with only limited tools to identify the duplicates. I understand that some improvements are planned for the future. But this limitation is still putting Zotero behind Mendeley Desktop for library management for my workflow.
I guess there are other workflows more "Zotero friendly". But it would make a huge difference for me not to have to worry anymore about files duplications in Zotero.

Please let me know if I have missed something in Zotero.

dstillman · July 24, 2022

I view my reference manager as a files organizer primarily.

Then Zotero may not be for you. That's not how Zotero is designed.

I am coming from Mendeley, where I was importing most of my references manually from pdf files downloaded to a local folder.

This just isn't how Zotero is designed to be used. Zotero is a web-first tool. You add items from the web when you want to save them, and Zotero automatically downloads the PDF if available:

https://www.zotero.org/support/adding_items_to_zotero

Zotero also allows you to add the same file to your library more than once — e.g., perhaps you want to keep one pristine copy of the file and another that you later annotate in an external PDF reader.

If there are different files, what is going to happen to my files if I merge the items? Is Zotero going to decide for me without any warning and delete the file that it has decided is not good?

If the files are identical or essentially identical, it will remove one. If they're different, it won't.

The argument that "there should be only one valid copy of the paper published" is not valid to me. The user should be free to decide what to keep under an item, be it Supplementary Materials, Journal Pre-proofs, paper retraction notice, commented versions of a pdf file, ...

Zotero has always been designed — much more than Mendeley — to support multiple files under the same parent item. I'm not sure why you think otherwise.

mjthoraval · July 24, 2022

I had the impression that Zotero was moving away from its original design as a "web-first tool", through the standalone app and the very nice recent pdf reader with annotations and notes. I am trying to point some of the limitations I experience in my workflow with the current tools available in Zotero to understand the underlying reasons of these choices, and therefore understand whether Zotero is suitable for me or not.

I use Zotero for reference management in scientific research, where the core information is the scientific publication itself (mostly pdf files), while its metadata is the information needed to organize them and cite them.

I really do not see the point of having two strictly identical files stored twice on my computer. In the example you provide, making annotations in an external PDF reader, the annotations process actually creates a different file, with a different identifier. I completely agree that it is perfectly fine in that case to have another copy of the file after editions added to a separate item. But until these changes are actually made, you are just storing the file twice without any other added value that I can see.
With the new PDF reader, Zotero now lets you always "keep one pristine copy of the file" and still add annotations in its viewer. If you prefer making annotations outside of Zotero, you can always go back to that original copy whenever you want. No need to store it twice.

Even if for some reasons you still want to duplicate the exact same attachment file in your library, that may still be added through some options. But my guess is that this is not necessary for the large majority of cases and users. At least I would find the ability to prevent adding strictly identical files twice, and correct for it later if it still happened, extremely valuable.

I understand that Zotero probably has very good reasons to select one of the files to keep in the item when dealing with duplicates. But however good these arguments are, that choice may not be what the user actually wants. The main issue I am pointing at is that in the current duplicates folder, the user does not have any way of judging the decision made by Zotero. I think that making an informed decision is important when deleting files.
The only case scenario which should be treated differently, in my view, is where the files are actually strictly identical, based on some unique identifier. But even in that case, it would still be useful to receive clearly this information from Zotero before deleting the duplicate file.

In your example of keeping one clean copy and one annotated copy of a pdf file in your library, I guess that Zotero will consider these two files "identical" when searching for duplicates. It is critical at that point to keep the control over the merging process, and keep the ability to disagree with the decision made by Zotero.

I have listed a few cases in which Zotero has considered that the "identical" files in my library should be deleted, while I actually wanted to keep the two files within the same item. I can therefore say from what I have experienced that I disagree with what Zotero considers "identical" files. I guess other users may have similar worries when pressing the merge button.

I have a lot of respect for the work achieved in the development of Zotero. Hopefully, the feedback I have provided above can still be useful to consider future development decisions.
I am perfectly fine if the conclusion of the discussion is still that Zotero is not for me.
Having some feedback from other people on their workflow may also help me reconsider my approach and redesign it in a way taking advantage of the current design choices made in Zotero.

To reply to your last question:
If an item has multiple attachments, Mendeley will show a different icon, so that I have clear visual information about it directly from the main panel of my library.
In Zotero, items with one or multiple attachments will be displayed in exactly the same way.

dstillman · July 25, 2022

I had the impression that Zotero was moving away from its original design as a "web-first tool", through the standalone app and the very nice recent pdf reader with annotations and notes.

I'm not sure what you mean by that — these things aren't in conflict. The Zotero Connector is a core part of Zotero, and the documentation I linked to makes very clear how we recommend saving to Zotero. Zotero isn't designed around a folder of files on your computer that you manage manually — we think, in the context of a metadata database, that's a job for a computer.

I really do not see the point of having two strictly identical files stored twice on my computer. In the example you provide, making annotations in an external PDF reader, the annotations process actually creates a different file, with a different identifier. I completely agree that it is perfectly fine in that case to have another copy of the file after editions added to a separate item.

I don't know what you mean by "different identifier" here. The file hash would be different after editing it, but the point is that you can open a PDF from Zotero in an external reader and save it directly, and strictly preventing identical files, even among child items as Mendeley Desktop does, would prevent that and require a more tedious workflow. Keeping two copies of a file isn't something we particularly encourage, and obviously the Zotero PDF reader is designed to allow keeping a pristine file even after annotations, but the point is just that people have expectations that this is the sort of thing they can do in Zotero.

More importantly, there are a huge number of ways to get things into Zotero, and we take both metadata quality and work processes seriously, so the sort of file-first forced deduplication that Mendeley Desktop does would break user expectations. If someone adds an item and file from the web, and then they import a BibTeX file with file paths, there should be two separate items with different metadata that could be merged. Zotero shouldn't just ignore the second item, or swap in the first item with different metadata and possibly a different into the imported collection.

Now, maybe there's an argument that, if you have a file in a library, and you add the exact same PDF as a top-level item, Zotero should simply keep the existing item (and add it to the current collection if necessary) rather than running metadata retrieval on the new PDF. That would break current user expectations — someone might be used to testing whether they get better metadata via PDF metadata retrieval — but it might be desirable enough of the time to justify the change. And this would presumably address your main use case.

(Also, just to note, Mendeley Reference Manager doesn't do this deduplication, though that's presumably due to its being a bare-bones web app wrapper rather than an intentional change from Mendeley Desktop.)

In your example of keeping one clean copy and one annotated copy of a pdf file in your library, I guess that Zotero will consider these two files "identical" when searching for duplicates.

Zotero doesn't "search" for identical files. In my example, the file would be attached to the same item, so duplicate detection isn't relevant.

I'm confused here, though — you seem to be simultaneously arguing 1) that Zotero should automatically deduplicate files when added and 2) that Zotero should never automatically deduplicate files.

But also, this whole thing seems like a misunderstanding:

I have also explored the "Duplicate Items" folder. When I expand the items to see the attached files identified as duplicates, I see that the large majority of them only show one attached pdf file. This is very suspicious, because I know that nearly all entries in my library do have a PDF attachment. It means that the "Duplicate Items" folder does not show me all the files that are attached to the entries to merge.

Duplicate Items shows you exactly what's in your library. Removing identical files happens after you merge, not in Duplicate Items. So I think you're just confused about what you're looking at here. If you think you're seeing something wrong, take a screenshot, upload it somewhere (e.g., Dropbox or Google Drive), and provide a link here.

I wanted to copy the DOI from the metadata panel... to realize that I cannot copy any of the metadata from the duplicate folder view.

You can use the arrow keys to select individual items or standard modifier keys to deselect specific items in the set.

mjthoraval · July 25, 2022

Here are two simple examples of the problems I describe:

1) I usually download a list of PDF files on my computer to have a look at them first and eventually edit them or remove the irrelevant ones before adding the selected ones to my library. This initial scanning through the files lets me control what goes into my library and keep it cleaner.
After that, I simply directly import the files to Zotero, following the instructions in "Adding PDFs and Other Files", either through drag and drop in the main library, or “Store Copy of File…”.
Here is the result of adding the exact same file multiple times, after creating manually the correct item from the DOI for the first time it was added: Zotero_Duplicates_01.png
It is easy to spot the duplicates here, because they are added successively, with the same file name containing clear information. But I could also add the same file again after several weeks or months, with another file name. In that case, I do not see any way in Zotero to identify these duplicates.
What I would like is:
- If I add the same exact file again in my library, I would like to have the option in Zotero that it takes me to the existing entry in my library, rather than copy the exact same file multiple times.
- For cases in my library where this has already happened, I would like Zotero to be able to tell me that I have multiple copies of the same file in my library, from file hash, and give me the option to merge them.
I understand that strictly preventing identical files is not necessarily suitable for everyone. But having the ability to receive the information of duplicate files (same file hash) from Zotero is definitely something that should be useful to everyone, however they decide to deal with these duplicates.

2) The second example is from the folder "Duplicate Items": Zotero_Duplicates_02.png
As you can see, the merging tool shows that there are 2 different items, but I can see only one PDF file in the main panel. This suggests that Zotero has decided to keep only that PDF file in the merged item, even though the two PDF files are clearly different. I can navigate between the two items in the right panel, but it does not show me anything about the attachments. So I have no idea of what will happen to the attachments after merging.
In Mendeley Desktop (the excellent software that was "replaced" by the useless Mendeley Reference Manager) the Check for Duplicates tool would bring the two items identified as duplicates, give me that ability to inspect the content of each of them easily, show the content of the proposed merged item, and finally give me the option to remove some of the attachments if I think they are not needed in the merged item: Zotero_Duplicates_03.png.

My argument here is that file duplicates (from file hash) are fundamentally different from metadata duplicates (items in Zotero). I want to be able to keep different files under the same item. But I do not want to create a new item when a file already in my library is added another time as a top-level item.
Having the ability to identify file duplicates and decide how to deal with them should be useful to everyone. As I understand, Zotero does not do anything at the moment about file duplicates, and does not even search for file hash duplicates.
Deciding the right interface to deal with file duplicates in Zotero may need some thoughts to accommodate everyone's workflow, but the ability to identify file duplicates from file hash should be fairly easy I guess?

dstillman · July 25, 2022

As you can see, the merging tool shows that there are 2 different items, but I can see only one PDF file in the main panel.

No, you're misunderstanding that. When you click an item in Duplicate Items, Zotero automatically selects all the items in that set. You're just sorting by Date Added, so the other item is out of view, and you didn't scroll down to see it. If you sort by almost any other field they'll be sorted together. Again, Duplicates Items shows you the exact same items in the exact same way as the other views.

mjthoraval · July 25, 2022

Ok, I see it now.

If you sort by almost any other field they'll be sorted together.

This statement is based on the assumption that the metadata is already "almost" identical, probably related to the way that these duplicates are identified. But if one of them has different content in the field selected to sort the entries, the only way I would have to see it is to scroll down the hundreds of potential duplicate items. Grouping together the potential duplicates would work better for me.
In the case where identical files (but different names) are also suggested as potential duplicates, files without metadata would not be possible to group by ordering.

Again, Duplicates Items shows you the exact same items in the exact same way as the other views.

I would prefer to see the proposed content of the merged item directly as a new item in the items list, followed by the original items that will be merged. This will still fulfil your requirements, but still make it easier to navigate the duplicates.
This will likely require more work to implement, but I do not feel confident in merging duplicates with the current interface.

And I still do not know what is going to happen to the PDF files (or other child items) if I decide to click on merge.

dstillman · July 25, 2022

But if one of them has different content in the field selected to sort the entries, the only way I would have to see it is to scroll down the hundreds of potential duplicate items.

I mean, it just doesn't really matter. The relevant info is the metadata, and that's all in the right-hand pane. If you like Mendeley's approach better, OK, sorry. Our approach adds the flexibility of sorting by any column — including Date Added — and seeing results from all versions, which is incompatible with always grouping together.

I would prefer to see the proposed content of the merged item directly as a new item in the items list

You can see the exact metadata that will be kept in the right-hand pane, down to the field level, which you can override using the field version selectors if you want to.

And I still do not know what is going to happen to the PDF files (or other child items) if I decide to click on merge.

You're making too big a deal of the file thing. For most of Zotero's existence, we didn't deduplicate files at all when merging, and we got constant requests to deduplicate files as well, so a few months ago we added automatic merging of identical files as well as files that are extremely likely to be the same based on content (to account for watermarks, basically). The "deleted" files are still kept in the trash if you really want to review them. If you find an actual example where near-identical files that shouldn't be merged are merged, let us know and we'll see if we agree and can avoid it.

All other child items are kept.

mjthoraval · July 25, 2022

I have tested to merge the items in this example: Zotero_Duplicates_02.png.
The merging process did not take me to the resulting merged item. So I had to search for it manually in the main library, through ordering by "Date Modified" for example. The result was one item with two attached pdf files. Depending on the user, one could argue that this is good... or not... Arguing on what should be done is missing the point.

The main thing that would be beneficial for me is not really to let Zotero make any kind of smart decision, either to remove the duplicate file or not, but simply to give me the control on which files will ends up in the merged item. In exactly the same way as you give the ability to decide which metadata to keep in the merged item.
I guess that my feedback is very similar to people asking for the ability to deduplicate files. The underlying feature request is exactly the same, which is to be able to control the merging process. You can never produce a satisfactory automatic deduplication process, because every user will have a different expectation.

It is easy to choose in Mendeley Desktop which file is kept in the merged item, as you can see all attached files at the bottom in the right side panel. It may be more difficult to do in Zotero with its approach to show child items only inside the central panel. But hopefully there is a nice way to do it also in Zotero, giving the user the control over the merging process.

Our approach adds the flexibility of sorting by any column — including Date Added — and seeing results from all versions, which is incompatible with always grouping together.

I don't think grouping by items to merge is incompatible with sorting. I appreciate the ability to sort by any field, but you are sorting on the fields of the items to be merged, which is probably not what you really want to do. You could keep this ability to sort by any field rather on the proposed merged item.

dstillman · July 25, 2022

The merging process did not take me to the resulting merged item.

Yes? The point of this pane is to merge duplicate items, and items are no longer duplicates once they've been merged. Again, you can view the final data in the right-hand pane before merging.

I don't think grouping by items to merge is incompatible with sorting.

As I said, Zotero lets you sort on any field from all items. I understand Mendeley sorts only on the proposed item — I think that's nonsensical. There's no point in sorting the list by decisions you haven't made yet, and there's no point in offering to sort by Date Added (or any field, really) if you can't actually sort all items.

mjthoraval · July 25, 2022

Thank you for your replies. I now understand better the reasoning behind the decisions for the design, although my workflow does not completely agree with them. Hopefully this discussion can help bring a different light on other related comments from other users.