So many duplicated attachments created by merging duplicated reference

hulalalb · December 17, 2019

I tried Zotero on Mac and love it very much, then decide to abandon Endnote and continue Zotero on CentOS. Following instructions (https://www.zotero.org/support/kb/importing_records_from_endnote), I spend one day to select and merge the duplicated references from Endnote by click and click like a robot... After that I downloaded ZotFile and planned to rename them all. Then... I found there are so many duplicated attachments created by merging...why does not Zotero keep only one attachment I truly need...I was planning to sync these references to net disk tonight. But so tired, what should I do next? Any hero?

hulalalb · December 18, 2019

But I looked into storage directory (/home/[USER]/Zotero/storage), found no duplicate PDF file in each sub-directory. So I can not delete the duplicate attached PDFs by script. Then next?

dstillman · December 18, 2019

There's no automated way to do this currently. Merging duplicates doesn't currently try to deduplicate attachment files, and even if it did, it often wouldn't work because the files would be slightly different due to watermarks, etc. We'll likely implement deduplication of exact file matches at some point, though, so if the files are in fact identical (e.g., exact same file size) you may want to just ignore this until then.

hulalalb · December 18, 2019

@dstillman Thanks for explanation, so I was wondering where are those duplicated PDFs or links stored. What if we could delete them directly.

dstillman · December 18, 2019

Attachments are all stored in separate directories, which you can see with Show File.

brianklahn · July 2, 2022

Well . . . I guess that's one way to "encourage" people to pay for storage: Don't design simple data redundancy mitigation methods into the software.

~/Zotero> jdupes --summarize --recurse storage
Scanning: 43308 files, 1099 items (in 1 specified)
34096 duplicate files (in 2467 sets), occupying 401 MB

Most of those duplicates are probably not pdf files. But the ones taking up the bulk of the wasted space are.

jdupes is fast, partly because it has smart heuristics, taking advantage of file systems. But just a simple (file size check then) hash would be quick and easy. Then maybe have options for creating symlinks prompts for which one to keep or default policy.

I've frequently donated both money and developer time to OSS projects. But it's practices like this that give me a little pause. It should be pretty trivial for Zotero to know when it has needless redunancy in it's artifacts.

adamsmith · July 2, 2022

But just a simple (file size check then) hash would be quick and easy.

Zotero implemented that on merge several months ago.

brianklahn · July 3, 2022

Does it take "several months" for things to make it into releases?
I don't see this working in the latest version (6.0.9).

Can you point me to the merge commit?

Thanks

adamsmith · July 3, 2022

https://github.com/zotero/zotero/commit/ef82becf004e8192a29ca04a0d63432aaf9b7e85

dstillman · July 4, 2022

(To be clear, this was included in Zotero 6.0 in March.)