Pruning database, but cannot find pdf file size

I need to prune my database to bring it less than 6GB, but the listing contains no column for pdf size. I should start by deleting duplicates, but it is not clear if I can simply delete all the duplicates leaving the original intact? Next I can probably delete any really large files like books but I need file details. I'd like to keep any pdf upon which I have annotations. Help would be appreciated. Thanks.
  • I have no responses to my query. To be more specific, I recently completed a one-day long attempt to delete (merge) duplicate files. I had scores of duplicates because I must have doubled up on a bulk transfer. Is there a way to do a bulk delete duplicates if they were added on the same day? I thought the deletions looked like they were successful, but I see no reduction in my quota below my 6GB, and I cannot sync files with my iPad, so I'm suspecting this is related. Advice would be appreciated. Thanks.
  • A bit hard to say what you did.
    Deleting and merging duplicates is not the same thing and merging will not remove duplicate attachments (items will keep both attached files to prevent you from losing, e.g., a PDF file with annotations).

    If you've already merged the items, it's going to be tricky to now remove individual files other than by going into the items and removing them individually.

    There's no way to show PDF file sizes in Zotero itself, but you can just run a storage analyzer app on your Zotero data directory and that will show you the largest files.
  • Thanks Adam. I much appreciate your response.
    To the new user the help available is a steep learning curve. I didn't pick up the existence of the storage analyser. Even as I Google "Zotero Storage Analyser" now it scores no hit. Maybe its existence could be added to the "Tools" menu? If it depends upon Cloud connection, that topic is not clear.
    It might now be easier for me to go back to my original pdf folders and start again from scratch and to find a way to import them all as a batch process which can be done while I'm asleep.
    I guess my first wishlist item is for the user manual to be clearer on the issue of pruning, specifically to say whether the duplicate list does, or does not, include all the copies, to distinguish from my expectation that the "original" copy is not a duplicate. I searched on pruning and didn't find anything directly relevant, but what I came up with is that in Zotero duplicate management is achieved by merging. So I merged 6000 pdfs+duplicates laboriously, one by one, and that wiped out a day.
    Secondly, I would wish that duplicate management is handled at the importing stage (which I mistakenly thought/hoped it was).
    Thirdly I would request that the duplicate list shows which pdfs have annotations and could be preferenced to begin by deleting duplicate whose metadata is clearly incomplete, and/or duplicates without annotations. I have no knowledge of SQL, but in MATLAB I could scan my 50 libraries of pdfs, and try to look for duplicates that way. But that sounds like reinventing the wheel?
    Fourthly, I would like a means of handling references (metadata no pdfs = 'vanillas') so that I can routinely record the existence of a paper, even if I cannot currently obtain a copy, and thus make a "to get list" from that.
    Fifthly, I wish for a database which can handle unpublished material for which no pubmed metadata is available; I simply make my own.
    If any of these wishes are already handled and possess links e.g. to the storage analyser which I didn't know existed, I would much appreciate links to the help file or to YouTube tutorials?
    Thanks 1e+06 for your further comments.
  • My central confusion is that I formed the impression that Zotero only keeps one copy of every pdf (implying that it must be doing its housekeeping on duplicates at the time of import) so that you can make multiple libraries with overlapping topics, but still only keep one copy of each pdf. If this is not the case, then that property needs to be made clear at the time a user gets interested in Zotero.
  • edited March 10, 2019
    I didn't pick up the existence of the storage analyser. Even as I Google "Zotero Storage Analyser" now it scores no hit.
    You misunderstood this. A storage analyzer has nothing to do with Zotero — it's just a type of program you can run on your computer, or a specific directory (such as the 'storage' folder within the Zotero data directory), to figure out what folders and files are taking up space. You can then delete those files from Zotero. (You can search for the eight-character 'storage' folder name in Zotero in "All Fields & Tags" mode to find the associated attachment, and then delete the attachment item and empty the trash to delete the folder.)
    My central confusion is that I formed the impression that Zotero only keeps one copy of every pdf […] If this is not the case, then that property needs to be made clear at the time a user gets interested in Zotero.
    Nothing in Zotero documentation should give you the impression that that's the case. Some other programs — which also store annotations outside of the PDF in a non-standard manner — will store only one copy of PDFs, but that really has no bearing on what Zotero does. You can easily Show File on two PDFs in Zotero to see that they're separate files, and annotations you make to one — in your PDF reader of choice — have no effect on the other.

    As for clearing your duplicates, if you really just repeated bulk imports, you can sort by Date Added in the middle pane and delete entire ranges of items that way, though if you've already gone through and merged duplicates that wouldn't be as safe of an option.
  • Sorry for the multiple takes. I would appreciate a quick response. I get an error message from Papership on my iPad, "Document not found (error 404). Please make sure your library is properly synchronised and that your files are uploaded to the server". Zotero doesn't give me enough feedback to solve the problem. Is the reason that I cannot sync because going the through the motions of syncing is not in fact completing? An error message from Zotero on my Mac confirming this would be great.
  • edited March 10, 2019
    We don't make PaperShip, but you're at your storage quota, so if you're trying to open files that haven't uploaded because of that, those wouldn't be accessible in your web library or on other devices. When you're at your sync quota, Zotero displays a warning on every sync telling you that additional files won't be uploaded.
  • It all boils down to economy. I reckon I can work within the 6GB limit, but trying to figure out how to drive Zotero is proving much more expensive in time than paying the extra for unlimited.... if the quota is the basic problem.
  • Thanks. But please forgive me. I didn't understand merging files, and I still do not. What exactly is being merged? It seems a very odd term to use if the object is to remove duplicates. What does merging achieve if it is not to do that?
    Why does the Zotero online help instruct that the manner to remove duplicates is to merge files, if merging them does not reclaim all the storage being used? I laboriously selected one file of often three or four, not being able to look at the file during that selection process, not being able to tell if they had annotations.
    Please recommend the most efficient means of dealing with this? Wiping my database and account and starting again from scratch? If so, what is the standard means of importing an existing database and checking each pdf to see if it already exists amongst those already imported?
    Papership seemed to be the only way I could look at my pdfs stored in Zotero on my iPad. Is this not correct?
    Appreciated.
  • It's not merging files, it's merging top-level entries, the main unit with which Zotero operates. That prevents, e.g., having the same item appear twice when you search or having it appear as two different items when you cite.

    Depending on how easy it is to re-import your existing items, wiping your database may well be the easiest way to go, yes. You'd do so by: 1) Disabling sync, 2.) Closing Zotero 3.) Moving your zotero data folder to a different location 4.) Restarting Zotero and 5.) Using "Restore to Online Library" from the "Reset" tab in the Zotero Sync preferences.
    (There are other ways to do this; this is the safest and most painless one).

    I don't quite understand the question about standard ways checking for duplicate PDFs. Is there any reason to believe you should have duplicate PDFs after importing?

    (Also, some of the confusion you have seems to stem from a confusion about what Zotero _is_. It's a reference manager, _not_ a PDF manager, though it allows you to work with PDFs attached to references. But Zotero's model always thinks about references, never about PDFs (or any other file) as the principal unit)
  • Thanks for these clarifications. The history is only incidental interest i.e. my migrating from Qiqqa - which was both a reference manager (brilliant with the exception than it only cited in MSWord) and file manager (poor).

    However, Qiqqa became orphanware and other users chose to migrate to Zotero. In Qiqqa I had close to 10K pdfs organised in 50 folders devoted to separate topics for which many publications belonged to more than one topic but the system had no way of keeping just one copy of each, so there were hundreds of duplicates. The attraction for me to Zotero was my impression that it only needed one copy of any file, but each could referred to from multiple topic folders. (That would be my ideal but with Zotero I have evidently dipped out again).

    Fortunately, I over the last 35 or so years, have (re)named all my pdf files systematically and uniquely, but without indicating the existence of annotations.

    Last night I have merged all 50 folders into a single folder and largely pruned them down to a minimum set (alas losing their folder associations and many annotations).

    Today I will save the original Zotero data folder to HD, and bring the merged data into the MacBook and Restore to Online.

    Zotero seems to be a steeper learning curve than Qiqqa was. I've yet to try citing with it and I'm interested in what other wordprocessors can use it, since I dislike MSWord with a passion. Thanks.
  • The attraction for me to Zotero was my impression that it only needed one copy of any file, but each could referred to from multiple topic folders. (That would be my ideal but with Zotero I have evidently dipped out again).
    Yes, that's exactly how Zotero collections work, see https://www.zotero.org/support/collections_and_tags#the_zotero_collections_model
    Not sure why you think they don't. I think you're still conflating references and PDFs (e.g., a single reference can have multiple PDFs attached in Zotero. That's what you were seeing after merging duplicates).
    Last night I have merged all 50 folders into a single folder and largely pruned them down to a minimum set (alas losing their folder associations and many annotations).
    I don't quite understand why you did that, but hopefully not for working with Zotero, because that obviously wouldn't have been necessary.
  • Time is precious. Smaller database sounds like a basic aim. Certainly not obvious to this user. The Zotero user manual is a hard read particularly when one's mindset is about 39 light-years away from managing a database. Your help is the best I've enjoyed since starting with it. The idea of allowing for multiple pdfs for one reference sounds like bells and whistles for the basic user so the concept of pdf merging to allow for them was beyond this new user's need. Thanks.
  • The idea of allowing for multiple pdfs for one reference sounds like bells and whistles for the basic user
    Really depends on the discipline. Where article supplements are common, e.g., this is quite useful.
  • Of course. For me, switching to Zotero is like changing horses near the end of the race.
  • I'm trying to import my large folder of my pdfs. If I point the data directory to this folder, does Zotero reconfigure them all into its multifolder format? Does it clear out all the files in the previous format?

    Secondly, I'm still confused about duplicates. I'm always going to have the problems of duplicates, because I cannot always be sure whether I've previously loaded in a pdf. Is there a way I can configure Zotero to prune the database of duplicates of the main pdfs relieving the user of that worry?
    Thanks.
  • Short answer to all these questions is no.

    Don't point your data directory to anything other than an empty folder.

    Before you do anything -- what is your preferred outcome in terms of where files are etc.? As I said, Zotero is not a PDF manager. It gives you a fair bit of flexibility about how to store your PDFs and you should decide that first and then we can look at how Zotero can (or maybe cannot) help you with that.

    Second -- the typical way to move from one reference manager to the other isn't to just dump your PDFs in. That's going to overall not give you great results, especially if some of them are older. Qiqqa should export to common formats like bibtex and RIS that Zotero can import, typically including the files.

    Third, duplicates: The only thing Zotero does in terms of duplicates is to show you which _references_ (again, not PDFs) are duplicate and allow you to merge the _references_. If you're concerned about duplicate files, you'll have to individually delete them (how depends on the set-up chosen above).
  • Thanks.
    Re-exporting these files is no longer an option. Qiqqa is defunct and no longer runs under the latest version of Windows. I had previously exported them as pdfs.
    I now have a folder on my Macbk (running Parallels) under User:Me "Zotero" largely cleaned of duplicates size 9.81GB, for 7779 pdfs.
    I have increased my quota to "unlimited".

    Until I started today's reconfiguration Zotero and Papership were doing a reasonable job of allowing me to see and annotate these pdfs. I had not yet tried to use it to cite.

    Since then I've tried to follow:

    "Depending on how easy it is to re-import your existing items, wiping your database may well be the easiest way to go, yes. You'd do so by: 1) Disabling sync, 2.) Closing Zotero 3.) Moving your zotero data folder to a different location 4.) Restarting Zotero and 5.) Using "Restore to Online Library" from the "Reset" tab in the Zotero Sync preferences."

    It seemed to me that your step 3) had to be run before step 2). But for me 3) is ambiguous. Do you mean literally moving the folder, or just changing the link?

    Restarting Zotero I am confronted with how I disable Sync? There is no "disable Sync button". But I tried unchecking Sync Automatically. When I go to Reset tab I see "Library" with a blank box beside it without any suggestion of how it is to be filled in.
    If it is my login OAERICLE and (since I have not unlinked my account) why is it not already filled in?

    Next "Restore to On-line library". Restore doesn't mean much in the context that I don't know whether Zotero will try to match my filelist with existing entries and folder topics? This carries many implicit concepts which are obviously clear to the designers but alas not to me just wanting to re-import all the files in my folder. A little help box suggesting that this will overwrite its content is there is any. I don't care what it does. I'd just like to be able to search and read my files on my iPad.

    I have now copied the original contents of Zotero folder to external HD and created a new folder "Zotero" in its place. Nothing sensible happens.

    My difficulty is that skimming the help manual I do not see any general discussion of concepts, like why it places some files in folders and other not.

    Please describe what I need to do to from scratch, including if necessary deleting my account, but keeping my credit. I'm sure I will be delighted once its working. Thanks very much.
  • Thanks for your previous responses.

    You say that Zotero is a reference manager, but not a file-manager, but it will do some file management, but not including removal of duplicates.

    Because of the explosion of scientific literature, this is going to be an ongoing problem for most users like myself. Any Google Scholar search these days throws up dozens of relevant articles, not just one or two, many of which may already be in the user's database. Over years one just cannot remember, so one downloads just in case. Preferably at the time of the search any manager needs to 1) check if an article already exists in the user's database and ideally, inhibit downloading the same file and 2) check if there exist annotations and flag that. So I guess this is a wishlist.

    Meanwhile, Zotero has gone ToZero. To try to become operational again, I have acted to remove the duplicates in my list of files and need to start again, dragging my files into a new database. But it is not clear to me how I do that and Zotero storage is relatively expensive. Some more details about the options available in the sync dialog box would be helpful. Thanks much.

  • What does "Zotero has gone ToZero" mean? You have an empty database? And have you checked that the version online is empty too (click on "My Library" at the top of this page to check)?
    You asked about starting from scratch, so that seems like we're were we'd want to be.
  • I'm sorry for the delay. I've been getting Apple Support for another problem
    I mean I have no functionality.
    I need to understand the dialog boxes
  • Can you do screen share?
  • No, but you can upload screenshots somewhere (e.g., Dropbox) and post a link here.
  • I have cleaned out my original Zotero data folder (moved it all to HD) and repopulated it with my duplicate free (largely) set of 7779 files. I told it to sync, it gave me an error message that it was empty, but it seems to be going ahead okay, but I guess will be a while before it completes. The original folders seem to have been preserved. I will have success when they sort themselves out on my iPad via Papership. Is there an alternative to Papership? I couldn't find one. Thanks much.
  • With Zotero seemng to perform so well, and Papership doing a fair job, I upgraded to the top level and completed the upload of my complete library of pdfs. Since then Papership went through the motions of syncing my whole library taking a couple of hours and the listing on the iPad is in fact complete after automatic syncing. However, when I try to access any of the later additions it throws an error 404 "File not available to download (I thought it was supposed to have downloaded after all that time and bandwidth), please make sure your library is properly uploaded and synced. My problem is that neither Zotero, nor Papership provide any decent diagnostics.

    Zotero issue: How do I tell if my Zotero Cloud holding is in fact complete? How do I tell how many pdfs are up there to compare with what shows on my macbook? I've submitted support queries, and received a deafening silence. Can I get a summary comparison of what Zotero thinks is in the cloud and on my MacBook?

    I even uninstalled and reinstalled PaperShip and downloaded the lot as second time. It seems from the comments online that Papership has been orphaned. I wish I'd known that before paying so much for Zotero.

    Zotero MUST have a large number of loyal users using iPads (in my case with top memory). So what does Zotero recommend as a Papership replacement for reading, annotating and resyncing with MacBook? Thanks much.
  • For iPad use, I use Zotfile’s Send to Tablet feature, along with Apple Books, Adobe Reader, or GoodReader on the iPad.
  • Zotero is also developing its own iOS app, though I have no idea if that's going to be months or a year + out.

    papership isn't actively developed or even supported anymore, so I'd be wary of it.

    You can't easily compare local vs. cloud Zotero storage. The assumption should be that all your files are synced, though, and they normally are if you're not receiving sync errors.
    You can test whether individual files are only by checking if you can open them through.
Sign In or Register to comment.