So many duplicated attachments created by merging duplicated reference

I tried Zotero on Mac and love it very much, then decide to abandon Endnote and continue Zotero on CentOS. Following instructions (https://www.zotero.org/support/kb/importing_records_from_endnote), I spend one day to select and merge the duplicated references from Endnote by click and click like a robot... After that I downloaded ZotFile and planned to rename them all. Then... I found there are so many duplicated attachments created by merging...why does not Zotero keep only one attachment I truly need...I was planning to sync these references to net disk tonight. But so tired, what should I do next? Any hero?
  • But I looked into storage directory (/home/[USER]/Zotero/storage), found no duplicate PDF file in each sub-directory. So I can not delete the duplicate attached PDFs by script. Then next?
  • There's no automated way to do this currently. Merging duplicates doesn't currently try to deduplicate attachment files, and even if it did, it often wouldn't work because the files would be slightly different due to watermarks, etc. We'll likely implement deduplication of exact file matches at some point, though, so if the files are in fact identical (e.g., exact same file size) you may want to just ignore this until then.
  • @dstillman Thanks for explanation, so I was wondering where are those duplicated PDFs or links stored. What if we could delete them directly.
  • Attachments are all stored in separate directories, which you can see with Show File.
  • edited July 2, 2022
    Well . . . I guess that's one way to "encourage" people to pay for storage: Don't design simple data redundancy mitigation methods into the software.

    ~/Zotero> jdupes --summarize --recurse storage
    Scanning: 43308 files, 1099 items (in 1 specified)
    34096 duplicate files (in 2467 sets), occupying 401 MB

    Most of those duplicates are probably not pdf files. But the ones taking up the bulk of the wasted space are.

    jdupes is fast, partly because it has smart heuristics, taking advantage of file systems. But just a simple (file size check then) hash would be quick and easy. Then maybe have options for creating symlinks prompts for which one to keep or default policy.

    I've frequently donated both money and developer time to OSS projects. But it's practices like this that give me a little pause. It should be pretty trivial for Zotero to know when it has needless redunancy in it's artifacts.
  • But just a simple (file size check then) hash would be quick and easy.
    Zotero implemented that on merge several months ago.
  • Does it take "several months" for things to make it into releases?
    I don't see this working in the latest version (6.0.9).

    Can you point me to the merge commit?

    Thanks
  • (To be clear, this was included in Zotero 6.0 in March.)
Sign In or Register to comment.