Is there a buildin function of "find broken & duplicate attachment links"?

I have to combine some duplicate bibliographies in Zotero, while they all have correct PDF &note attachments of the paper.

I found that after the duplicate bibliographies were combined, the PDF &note attachments were all gathered together under the remaining bibliography. Now it's time for PDFs &notes to be duplicate...And the storage file of Zotero is about 1GB now with 300 bibliographies...

So I use a third-party software to directly remove the duplicate PDFs in the storage file of Zotero. According to the size and CRC32 of the PDFs, most of the duplicate PDFs are happily deleted(For the same paper, the PDFs with different notes inside are remained due to different size and CRC32, but I don't know them). Unfortunately, now it's time for me to have a lot of broken attachment links...

I hope there could be a buildin function, which will:
1. Scans your storage for missing attachments and possible duplicates.
2. Directly delete the broken attachment links.
3. Let users to compare and choose, which version of PDF &note should be remained. Then delete the abandoned PDFs &notes.

In fact, the broken attachment links are useless. We can download the PDF again only if we know which bibliography has no attachment. Meanwhile, the duplicate PDFs are really agonies, increasing the size of the storage file of Zotero and confusing the latest version of the PDF &note attachments. Just let the users to choose the right version of PDFs &notes, please. This save time to develop the Zotero and allow users to rescan their notes and achievements.
  • I'm currently resurrecting the storage scanner plugin.
  • I believe the function is necessary for the Zotero, as most of the users migrate data from other software to the Zotero and generate a lot of duplicate and broken links.

    @emilianoheyns Are the listed 3 functions critical and necessary? I think this kind of interaction logic is more urgent, but is it easy to implement at the code level?
  • Offtopic: Ooh @emilianoheyns I'm really looking forward to that. Especially if it can also scan for stray files in the storage directory, which is the problem that's most frequent in my >10k items library.
  • (the expectation & recommendation is to not muck about with the data directory at all, which also means there wouldn't be a need for this, so this almost certainly won't become part of standard Zotero)
  • Thanks to the effort of @emilianoheyns !! The new version could solve the issue.
    https://github.com/retorquere/zotero-storage-scanner

    I believe that this function is necessary, since the dispose of broken links to files and duplicates in zotero are somehow unfriendly to the researchers who hold lots of references.

    P.S. My colleagues in my research team refuse to use the zotero. After my introduction of zotero, we all agreed that the software of reference management should be a tool with essential functions inside after we download the standalone version. For a newbie, they wish that they would just download the software (such as the cracked version of Endnote X8) and transfer the library to it. Then within 10 minutes they could continue their reference reading in the new software.

    Now the software began to show its ambitions, trying to contain all in the field of the knowledge management, including personal blog and site, the data interface to the APIs, different programming languages. However, for researchers like students and professors, they just need to find the literature, read the literature, write the notes, and then insert literature into the MS Word as required by most of the publisher and conferences. The necessary functions and plugins in this work flow should be refined and integrated in the standalone version.

    The development of the software should be focused on deeper and high-efficiency function for the specific clients. Look at the Endnote X8. You have no plugins to add, but you could leave all the problems of references and PDFs to it. It will prevent the failure of attachments. Your job is to search the keywords and writes notes about them——enough for authors.

    A version that is compatible with all requirements is destined to lose both mild and core users, leaving only a few hobbyists interested in programming. Because of the uneven quality of the various plugins, a more focused software will take most of the non-professional users away when the enthusiasm is gone. I sincerely hope that the zotero could do better on the the standalone version itself. Thanks for your selfless efforts!
  • edited January 10, 2018
    @adamsmith — based on my experience, stray files do tend to accumulate in the /storage/ folder, most commonly because of attach or rename operations gone wrong, and because the 'missing attachment' popup provides no information about the folder or file it expected to find.

    For instance, doing an advanced search for all PDF attachments in my library, I find 10047 items. Outside Zotero, if I locate all PDFs in the /storage/ folder, I find 11925 files. So over ~10 years, I have accumulated about 1900 stray PDFs. Not too big a worry in terms of file size etc. (especially since unliked stuff doesn't sync) but it would be great to be able to clean that up.
  • @kld123509945 While your requirements are real and adding them to Zotero will improve the software, based on my experience I can say Zotero does the job as good as any other reference management software, and often better. Specifically, the part where you say:
    "
    However, for researchers like students and professors, they just need to find the literature, read the literature, write the notes, and then insert literature into the MS Word as required by most of the publisher and conferences. The necessary functions and plugins in this work flow should be refined and integrated in the standalone version.
    "

    Zotero is VERY skilled at this. I hear complains about EndNote often at my research organization. The few people who use EndNote and Zotero (in the way it should be used), almost never go back to EndNote.

    As for transferring libraries, I have heard EndNote is worse than Zotero.
  • @pjweiss I believe so. The web page of https://www.zotero.org/support/plugins should be updated according to the plugin update on the github.
  • @mark the stray PDFs become a serious problem as we usually synchronize the /storage/ folder on the OneDrive/Dropbox/iCloud, etc. The increasing size of the folder will add the difficulty to the synchronization between devices. Also we have to buy larger storage space on the network backup.
  • edited January 13, 2018
    @gurdas Well, the experience on the zotero is better than endnote X8 in some specific functions, such as the three major function of ZotFile plugin. It's really amazing and unique. I really NEED to extract the annotations from the PDFs automatically, increasing my efficiency on the reading. That' s the reason for me to insist on the zotero, and I also VERY willing to persuade my colleagues to turn to zotero.

    However, let's focused on the Endnote X8, the version released in 2016. Here are some basic functions of it.

    1. Delete the attachment when delete the bibliography
    In zotero, we have to use the Zotfile as it is really a core plugin for the standalone version of zotero. PDF attachments added by Zotfile are shown as attachments which are linked to a PDF in the /storage/ folder. When I delete a bibliography from the Trash collection, the PDF attachment is not deleted.
    In Endnote X8, all the attachments will be deleted when you delete the bibliography.

    2.Auto index the PDF without failure
    In zotero, some PDFs which are downloaded in the ACS publications could not be indexed. See my discussion https://forums.zotero.org/discussion/69771/unindexed-pdf-attachment-cannot-be-indexed-by-click-the-indexing-button-the-green-one#latest
    In Endnote X8, the bibliography which are listed in my discussion could be correctly indexed and searched.

    3. Find Duplicates and Broken attachment links
    In zotero, the function of “Find Duplicates” is deficient and “Find Broken attachment links” is missing, as discussed above. The crude combination of the field of the duplicated bibliography creates a lot of duplicate attachments.
    In Endnote X8, 3.1. Go to Menu - References - Find Duplicates. You will see a popup window which lists the duplicated bibliography one by one. You could remain and modify a version of the duplicated bibliography in the window, and then remove the other one. 3.2. Go to Menu - Tools - Find Broken attachment links. The Endnote X8 will search the stray links and remove them directly.

    4. Highlight the words when you search them
    In zotero, when you search some phrases such as “mass transfer coefficient” and “CO2 dissolution”, the words are broken up so that I cannot judge whether the search results are suitable. There is No highlight of the searched words in the listed result. I have to open them one by one to check.
    In Endnote X8, when I search phrases such as “mass transfer coefficient” and “CO2 dissolution”, the software will highlight the listed results with yellow shadings on the phrases in the fields of the bibliography (even in the Notes field). The I could only read the references with a glimpse on the search results.

    5. Insert citation in the standalone version.
    In zotero, when I want to insert a citation into the MS Word, I have to copy the name of the reference and turn to the MS Word to search the name in the Quick Format bar. See my discussion https://forums.zotero.org/discussion/69318/add-an-icon-of-insert-citation-and-search-paper-more-efficiently#latest
    In Endnote X8, I just click the Insert Citation button in the menu, then the bibliography will be inserted in the MS Word. The Endnote X8 just check the front-most document.

    In general, we conclude the research hotspot with Web of Science, search papers in Google Scholar, download PDFs with Sci-Hub, read PDFs in Foxit Reader and write papers with MS Word as the templates are provided by most of the journals. The software all have low learning threshold and high integration degree. Now it is 2018. For zotero, as a reference management software, it lacks of some essential function comparing with its major competitors. Though the plugins are colorful, the developers’ enthusiasm for plugin update is ephemeral. In addition, too many users are fickle. A highly integrated standalone version is appropriate for most of the researchers. If they decide to insist in zotero in the future after trail, they no longer need to transfer the libraries among the software.


  • From https://www.zotero.org/support/forum_guidelines#etiquette:
    Different people use Zotero in different ways. Not everybody shares your priorities, sometimes you're the only person experiencing a particular problem, or maybe you just didn't discover the best way to perform a task. Avoid generalizing statements like "it's obvious that without this feature, Zotero is useless for anybody".
    It's fine to explain what features you'd personally find useful — of which some will surely be implemented — but please don't assume that they'll be higher priorities than the thousands of features people have requested in Zotero over the last decade. If you find that EndNote's current workflow or feature set works better for you, you can of course use that instead.
    1. […] When I delete a bibliography from the Trash collection, the PDF attachment is not deleted.
    Because you're using linked attachments. Zotero never modifies files outside the data directory, by design, because it has no idea what you're doing with them. Stored files are deleted when you delete the attachments from Zotero.
    2. […] some PDFs which are downloaded in the ACS publications could not be indexed
    I only tested one of the PDFs you provided, but it worked fine for me. We can follow up in the other thread.
    3. […] In zotero, the function of “Find Duplicates” is deficient and “Find Broken attachment links” is missing, as discussed above.
    As adamsmith says, you're not meant to modify the 'storage' data directory directly. Anything that changes 'storage' should be done from within Zotero (e.g., as a plugin) so that it can change the database as well.

    Better handling of identical files when merging is planned, but it's far from simple, because of the possibility of different metadata or different notes. And many downloaded PDFs are watermarked, so they wouldn't match anyway.
    4. […] In zotero, when you search some phrases such as “mass transfer coefficient” and “CO2 dissolution”, the words are broken up so that I cannot judge whether the search results are suitable
    If you put them "in quotes" then they'll only show up when they match as phrases.

    Richer search results with snippets are planned but for technical reasons can't happen for a while.
    5. […] In zotero, when I want to insert a citation into the MS Word, I have to copy the name of the reference and turn to the MS Word to search the name in the Quick Format bar.
    You certainly don't need to go to Zotero first and copy names. Just type the name of the thing you're trying to cite. If you prefer to browse by collection, you can use the classic view, or make that the default. As we say in the linked thread, a collection browser will likely be integrated into the Quick Format bar in a future version.
    For zotero, as a reference management software, it lacks of some essential function comparing with its major competitors.
    Just to set your expectations appropriately, with the exception of a collection browser in the Quick Format bar, nothing you've mentioned in this thread is anywhere close to a high priority relative to other things.
  • edited January 17, 2018
    @dstillman I apologize for my expectation. I think it is difficult to maintain a neutral attitude in the discussion, as I always hold the enthusiasm on the zotero.

    Based on the current version, is there any methods to check that which file attachment is broken?

    Or do I have any methods to return the PDF links created by the zotfile to the PDF attachment(the attachment file type is PDF)?
Sign In or Register to comment.