I don´t know how I managed to produce this, but I found out that many of my attached pdf files are stored several times in the storage folder (blowing up the size of the storage folder). I´m not talking about duplicated database entries - the attached pdf for a single database entry is stored up to four times in the Storage folder (every pdf in a different subfolder). When I use "show file" from within Zotero, I´m directed to one of the four files. I don´t know if the other three are still linked to Zotero in any way.
Does this sound familiar to anyone? Is there a possibility to get rid of the extra files?

Sorry, I posted it first in General, but I think the 1.5 Beta is the right forum
  • Am I really the only one with this problem? Did anyone check his storage folder for duplicates?
  • I don't really understand what you mean. Some example folders and file names would be helpful. If you're talking about different randomly named folders directly under 'storage', then they're not linked to the same item.
  • Sorry for not making myself clear ... I try it with an example.

    Let´s say I have this one paper added to Zotero:
    Riedl, J. et al., 2008. Lifeact: a versatile marker to visualize F-actin. Nat Meth, 5(7), 605-607.
    As an attachment I have saved the pdf file. If I click on "Show File" for the pdf, an explorer windows opens and shows me the pdf stored in this place:
    \zotero\storage\RFE3HA5A\Riedl et al.pdf

    However, if I search in \zotero\storage\ for Riedl et al.pdf I get the following result:
    \zotero\storage\M7F745MP\Riedl et al.pdf
    \zotero\storage\RFE3HA5A\Riedl et al.pdf
    \zotero\storage\RGTBXTE6\Riedl et al.pdf
    \zotero\storage\US3T6TRK\Riedl et al.pdf

    That means the pdf is stored 4 times (This is true for many, but not all of my attachments). In Zotero, however, this paper is only entered once.
    I hope the problem is clear now. Is there a possibility to delete all the redundant data (automatically)?

    Thanks a lot. Martin
  • edited March 24, 2009
    I ran 'fdupes' on my somewhat smaller library (ca. 500 items, most of which have PDF attachments) that I use everyday. This database has existed in some form before the sync plugin & has been used on multiple platforms, both with rsync and with the sync plugin. I've sometimes added dups to it (either through oversight of what was local or by not syncing it early/often). So, I'd expect it to be "dirtier" than even my other, larger libraries.

    I noticed only five duplicated pdfs. I haven't been following this closely, but most seem to be through my fault, rather than Zotero defects. The repeated PDFs are:

    from a reference retrieved via UnAPI: three files with the same creation date and name. I see that two are currently attached to the same zotero item & have access dates that are 30 seconds apart (from September of last year). The shows up as being unattached to any item.

    two references downloaded via sciencedirect where I do have duplicate items (these were both most likely my fault & I will be deleting one of each of them). One of these references is the only example I have where the same file has a different name.

    one reference from Wiley interscience, added just this month that has the same filename & date. I only see one item with one attachment. This was likely a dup that was my fault & I deleted the other one. EDIT: yes, it is still in the trashcan, with the second attachment.

    A 'test.pdf' from Oct of last year that I used for testing. This was an item that I manually added twice & did not delete.

    In summary: I don't personally have that many dups (1%) & most of my problems seem to exist between the keyboard and chair. This certainly represents an area will the upcoming duplication detection could help. And an explicit filename search could be useful (as has been called for before); searching by title didn't always work, as the filename could be different than the Zotero title for the object.

    You might check your trash, to see if the items are there. And also try to search for titles with the name of your filename, to see if you can see multiple items.
  • I didn't want to blame zotero .... I think the most likely reason for the mess are some interrupted synchings or something like that. Or it's my fault. Also not unlikely.

    I just wanted to know if there is a possibility to get rid of the duplicates without loosing the real data. And without checking manually for every duplicate which of the files is the one used by zotero.

    I did check the trash. Empty.
    And items are not added to zotero under different names or something like that
  • There's no way currently via Zotero to purge orphaned files from the local storage folder, but we can probably add a (non-UI-accessible, since this shouldn't actually happen) feature to do that.

    A technical solution would be to open the Zotero SQLite database in an SQLite client, generate a list of the keys in the 'items' table, generate a list of the directories in the storage folder, compare the two (e.g., do a diff), and then remove any in the latter that didn't appear in the former.
  • I've had plenty of interrupted syncs & don't have the duplicates that you have. I believe that attachment IDs should be unique to each attachment referred to in the database & should also be conserved on sync.

    Did you perform a title search for some of those filenames? For me, many of the duplicated files were hiding a bit...
  • I am also experiencing this problem
  • @ noksagt: I did perform a title search within Zotero. No duplicates found. Just the above mentioned ones in the Storage folder.

    @asplundj: I still don´t know how I managed to produce the duplicates, but I got rid of them by exporting the library and reimporting it into a fresh firefox profile. The size of the Storage folder shrunk to approx. 1/3 (can´t say exactely, because I used the move to clean up the library from some entries I did not need).
  • edited June 8, 2009
