PDF files duplicates produced when importing from Mendeley

I imported my library from Mendeley on the 13th and 14th of July, following the instructions received here: Mendeley import for a large library.

I had noticed an oversized storage size in Zotero compared to the original library in Mendeley, but I did not find the origin of the discrepancy at the time:
The "storage" folder has 32 Go of data, larger than the estimated Mendeley library size of 31.3 Go. Another computer with the synced Mendeley library only has 30.5 Go. It is surprising that the "storage" folder is already larger than my whole Mendeley library. My Zotero was completely empty before starting the import.
I have recently sorted my library by "Notes", to realize that many PDF files have been duplicated as child items of the same parent item: Zotero_Duplicates_04.png.
[Side note: I cannot see any arrow on the column header. Is it related to some bug in my library as reported here, or can the problem be reproduced by others?]

I have verified in Mendeley for a few cases, and they were not duplicated in there. The "Date Added" information of the PDF files shows that the duplicates were added at fairly large time intervals, which could correspond to repeated attempts of import from Mendeley, due to the crashes reported earlier.

I understand that these duplicates were not supposed to occur. But they still did somehow. I don't know if it is related to the crashes, or to continuing the import from another computer on a different OS.
The duplicates that can be identified here all appear to have a table of contents, which was already present in Mendeley. But there are also many other items with also table of contents in Mendeley which did not lead to duplicate PDF files in Zotero. So that aspect is probably just helping me see the problem rather than be related to its cause.

I could identify this problem on around 200 parent items. This is still ok compared to the 15k total number of items in my library. But still quite painful to deduplicate manually.
My questions would be:
1) Is there any way to remove the duplicates other than manually? I cannot simply remove all PDF files for items with multiple PDF files, because some cases are valid multiple attachments. I have tried to do it manually for a few, but the two PDF files are stored in different folders. So it is quite painful to check the size of the file to see if they are different.
2) Beyond the first point, I also do not know how I can find these duplicates. It seems that the number of notes worked nicely to identify the problem, but I don't know if other duplicates are missed by this ordering. It seems that I cannot order on the number of attachments?

I could purge my Zotero library and import all over again from Mendeley. But that would take another day or two to complete, and I will loose all the work I did in my Zotero library since transferring from Mendeley.

This problem could be a good case to support having file hash deduplication, but I will be happy to consider other options available.

Using Zotero 6.0.12-beta.2+fc0f6157d on Windows 10.
  • The "PDF" ones from the 14th are from the import — they have internal Mendeley identifiers, as do the parent items. The attachments from the 13th do not.

    You had ZotFile installed as of a previous error report. If you also had that installed at the time of the import, I'd guess that the other attachments were created/recreated by ZotFile in a way that caused the Mendeley identifiers to be lost, so then when you repeated the import it dutifully imported files that it had no record of previously importing. The files also aren't named in a way that Zotero itself would've done.

    Best I can suggest would be to press + in the items list to expand all items and quickly go through and select the duplicates for the ones with "PDF". (I would think the supplementary files would be named something different?)

    Or you can write a script for the Run JavaScript window — var attachmentItems = Zotero.Items.get(item.getAttachments(true)), attachmentItem.attachmentHash, attachmentItem.deleted = true; await attachmentItem.saveTx(), etc. — but I'm afraid you'd be on your own for that.
  • Thank you for your reply. I guess the "PDF" naming identification is probably the easiest way to go for me here.
    Where is this naming coming from?
    It seems indeed that the name of supplementary files is still the same as what was in Mendeley. But then I don't understand why these "PDF" files did not also keep the same name as in Mendeley?

    What is an internal Mendeley identifier and why would ZotFile modify that information?
    Is the naming of the files synced between different computers? I have seen that the other computer also has the same name as on my main computer (different from what I had in Mendeley), although the ZotFile settings are different.

    Finally, is there any way to fix ZotFile so that it does not change the "internal Mendeley identifiers", or change Zotero to store this information in a more permanent way?
    Or maybe add a warning when someone imports from Mendeley with the ZotFile plugin activated?
  • edited July 27, 2022
    Attachment title != filename. When you save to Zotero via a translator (i.e., from the web) and a file is attached, the attachment title will be set to something appropriate for the translator (e.g., "Full-Text PDF"). Mendeley doesn't have a concept of an attachment title, so the main file is just called "PDF" for the primary imported file. Other files use their filenames for the title, since there's nothing else to call them.
    What is an internal Mendeley identifier and why would ZotFile modify that information?
    The identifier comes from Mendeley and is how Zotero avoids duplicating items on reimport. ZotFile doesn't know anything about that and could easily wipe those out if run on a file, but we have nothing to do with that.
    Is the naming of the files synced between different computers?
    Yes, of course. Zotero manages this for you.
    Finally, is there any way to fix ZotFile so that it does not change the "internal Mendeley identifiers"
    ZotFile is unmaintained and has been for years. But also, who cares? Unless you're planning to run ZotFile on all your files and then reimport from Mendeley, this isn't an issue. Most people import from Mendeley once, and people who need to do it multiple times generally don't run ZotFile on their files before doing so.
  • I probably do not understand yet how to use properly the concept of different attachment title and filename in Zotero.

    I like having the ability to choose the name of the pdf files. The reason behind is that I find it more practical when sharing them with others by email. This is why I am using ZotFile.
    Considering this, I have indeed run ZotFile on all my files. If this is not something encouraged by Zotero, what would you suggest that could achieve nicely formatted file sharing by email?
    What is the default filename in Zotero before using ZotFile, and where can I read about it, including why this choice is preferred by Zotero? Is there any disadvantage of choosing the naming of the files using ZotFile?

    I see that the ZotFile plugin is not moved to the "Unmaintained" plugins, so it may still be used by many users. So I guess other users probably also like this ability to control the naming of the files.
    If adding a warning in the software is too cumbersome, may I add a warning on the plugins page saying something like: "Using this plugin may interfere with importing from Mendeley."?
    https://www.zotero.org/support/plugins
    Yes, of course. Zotero manages this for you.
    To clarify this point, can you confirm that Zotero will sync both the attachment title and filename between computers?
  • If this is not something encouraged by Zotero, what would you suggest that could achieve nicely formatted file sharing by email?
    Zotero already names files automatically — files from the web in all cases and files you add manually if you haven't disabled "Automatically rename attachment files using parent metadata" in the General pane of the preferences. The default format is "Creator - Year - First 50 characters of title.pdf", which you can see with right-click → Show File with ZotFile disabled. ZotFile provides additional customization options, but we'll be adding similar options to Zotero in an upcoming version.
    To clarify this point, can you confirm that Zotero will sync both the attachment title and filename between computers?
    Yes. The point of using stored files is that Zotero takes cares of them for you.
  • I have finally made a new clean import from Mendeley without ZotFile. I do not see any duplicates anymore, so this is consistent with pointing to ZotFile as the culprit.

    Some other remarks on the naming of the files. From what I can see:
    1) The naming of the files appear to follow the pattern you describe only for some of the files: Zotero_FileName_01.png. I don't really understand what happens for the others. Could it be related to the crash during import, or is it the expected behaviour? Is it something broken that can be repaired manually?

    2) The "Attachment title" seems to be "PDF" only when there is a single pdf file attached: Zotero_FileName_01.png. Whenever there are multiple attached files, none of them is called "PDF".

    3) I have seen that I can change manually the "Attachment title" by right click and "Rename files from parent metadata". But I can only do it if I only select the attachment files. Just selecting one parent item and the option disappears. Is there any way to rename the "Attachment titles" systematically in the library?

    4) The "Attachment title" does not seem to be consistent with anything, not the filename nor the default file open when double click. I could understand that some people want to have a different "Attachment title" from the "filename". But I do not really see the value of the default settings implemented in Zotero.

    5) I thought that the "Attachment title" called "PDF" could help me identify which file is opened when I double click on the parent item. But that does not seem to work.
    Then I thought that the order of the attachments could give this information, with the first pdf file being the one opening by double click. Again, this is not correct: double clicking on the item "von Kármán Vortex Street within an Impacting Drop" opens the last pdf file listed.

    It seems that the "Attachment title" and the "filename" can both be set manually in Zotero, but Zotero does not provide much tools to manage them systematically. So I guess ZotFile is still the only option at the moment to change the filename?
  • I don't remember exactly what we do with filenames for Mendeley imports. It's possible we just keep the filename as is. You can rename existing files based on parent metadata from the attachment context menu (which will also set the attachment title to the filename, though whether it should is debatable).
    The "Attachment title" seems to be "PDF" only when there is a single pdf file attached: Zotero_FileName_01.png. Whenever there are multiple attached files, none of them is called "PDF".
    Yes, that's right — if there's a single file in Mendeley with a filename ending in .pdf, the attachment created is called "PDF".
    The "Attachment title" does not seem to be consistent with anything, not the filename nor the default file open when double click.
    I don't know what you mean by that. As I say, it's set by the translator based on what exactly is being saved. Usually that's something like "Full Text PDF". Mendeley import is a totally different situation, and there's not much we can do other than "PDF" or the existing filename.
    I thought that the "Attachment title" called "PDF" could help me identify which file is opened when I double click on the parent item. But that does not seem to work.
    No, the tab uses the parent item title. Most people don't have multiple files per item, so this generally isn't an issue, but it would probably make sense to show either the attachment title or the filename when opening a secondary file. I've created a ticket for that. There are also requests for an option to just use the filename in all cases.
    double clicking on the item "von Kármán Vortex Street within an Impacting Drop" opens the last pdf file listed.
    Double-clicking the parent item opens the first-added PDF that matches the URL of the parent item, followed by the first-added PDF that doesn't match the parent, followed by other things. The Mendeley data model doesn't associate URLs with files, so it's a bit of a special case.
    Zotero does not provide much tools to manage them systematically. So I guess ZotFile is still the only option at the moment to change the filename?
    No. Again, Zotero automatically renames files based on parent metadata when you save via translators or drag local files onto existing parent items, and you can rename existing files from the context menu. You don't need ZotFile to rename files if you're happy with the default filename format.

    One thing Zotero doesn't do is keep the primary attachment filename updated as you change metadata, but we'll likely add that in a future version. For now you would just have to run Rename File from Parent Metadata again if you changed the title, year, or creator.
  • edited July 31, 2022
    Thank you for the explanations. I understand that Zotero has some standard behaviour for the "Attachment title" and the "filename" that are set during import through Zotero tools. But, the current limitations are:
    1) The user does not have any systematic control over the "Attachment title" and the "filename". Manual edits and "Rename File from Parent Metadata" are not practical to update these names over a large existing library.
    2) The standard behaviour of Zotero is fixed at the time import, and only works if the entries were imported from Zotero. This means that due to point 1, the user does not have any way to address the naming issues produced by Mendeley import, or metadata updates.
    3) Zotero makes some kind of automatic decision on which attachment is the primary attachment at the time of importation. Again, the user does not have any control of it.

    Your main answer to these points is that you do not think that file management issues are important features for most users, considering that Zotero is mostly focused on reference management, organised around metadata rather than files. You do agree that some minor improvements may be useful and will probably be implemented in the future.
    I would still argue that Zotero is de facto a file management tool, even if it was not its primary goal originally. The current naming of the "Attachment title" and the "filename" are part of this file management tool. I think it would make sense to have a global strategy on how the file management issues should be handled by Zotero, and eventually give control to the user on what they prefer.
    These improvements would be very useful for me, to be able to share files outside Zotero, and organise better the many entries with supplementary files.

    Making a more consistent behaviour in the default settings of Zotero would be the first step. But it will probably not fit well with the expectations of all users. So providing systematic tools to control these issues would probably be very helpful to adapt to the large range of users workflows.

    Deciding on what should be used in the tab title for secondary file attachments can only have a very limited impact if you do not have much control over the information you want to use for that, either filename, attachment title, or any other field that you cannot control systematically. Most importantly, changing the behaviour of the tab title for "secondary file attachments" will probably make the problem even worse if the user cannot even decide in the first place what is the "primary" file and what are the "secondary" files.
  • Your main answer to these points is that you do not think that file management issues are important features for most users
    No, that's not my answer. My answer is that we get literally thousands of feature requests, and we have hundreds of features that have been planned for years, and the ones that are important to you after importing a huge library from Mendeley with lots of secondary child attachments are not necessarily going to be the ones prioritized over other things.

    Zotero has always renamed new files and set attachment titles automatically in regular usage. It just works for most people. An import is not regular usage.

    Various things will likely happen in the future. These include:

    - Additional filename customization options, similar to ZotFile
    - Automatic renaming of the primary file when parent metadata is changed
    - The ability to change the primary file (for the rare case where the primary file isn't added first, or weird import edge cases)
    - The ability to use things other than the parent item title for tab titles
    - Fixing "Rename File from Parent Metadata" to not change the attachment title
    - Perhaps the ability to use the filename as the attachment title, if only because a lot of people don't even realize that Zotero has always renamed files itself

    These have all been planned for a long time. You don't need to make the case for them or make sweeping statements about our not caring about file management, which I feel like you're getting mostly from a specific philosophical disagreement over a watch-folder feature.

    But for now, if you have a huge library with some unusual features imported from a different tool with a totally different data model, you're going to have to either do some manual work to clean it up or care a bit less about the specifics.
  • Thank you very much for your reply.
    I appreciate that these issues are known and will likely be fixed in the future.
    And I understand that development priorities are better decided from a larger perspective of all users feedback, with the issues I describe having a lower priority at the moment. Hopefully my feedback can play a small role in that process.
Sign In or Register to comment.