Disk space efficiency
I recently started using Zotero and like it quite a bit so far. However, Zotero is highly inefficient in terms of disk space usage. Most of my records come from a handful of websites. So, each time I get a snapshot of a paper from a site X all the page decoration elements get saved to harddrive. I end up with literally hundreds of copies of the same journal logo in my zotoro/storage folder, not to mention all the buttons, and other fluff. I propose to stop wasting harddrive like so. More importantly, it will make it for a smaller (and hence easier) backup.
The solution is simple: for each file maintain a hash value (for example, md5) and usage counter if a newly downloaded file has the same has value, instead of storing a copy of the file increment usage counter and either create a hardlink or change the link to point to the correct copy.
Bottom line: eliminate redundancy in downloaded files to reduce the size of the backup.
Boris
The solution is simple: for each file maintain a hash value (for example, md5) and usage counter if a newly downloaded file has the same has value, instead of storing a copy of the file increment usage counter and either create a hardlink or change the link to point to the correct copy.
Bottom line: eliminate redundancy in downloaded files to reduce the size of the backup.
Boris
I understand your problem. Whenever possible I use 'printer friendly versions' to avoid it. They come with all the information but nearly without the - how you call it? - fluff.
If this is no option for you (eg. because your favoured site doen't offers printer-friendly-views) I can offer you a work-around 'til the zotero-team possibly implements your request, wich I think is reasonable. (But - my guess - I wouldn't expect a change in zoteros behaviour any time soon.)
You could use AdBlock+ (https://addons.mozilla.org/en-US/firefox/addon/1865). It's intended for blocking advertisements, but should do the job for your needs. How?
Install it w/o downloading filterlists. then go to your usual site and teach (it's not difficult) AdBlock that every element on the page you don't want to see in your zotero-database is unwanted. Now, turn AdBlock off (It's just a click.) Do your search on the site with the still inactive AdBlock and when you have the results, turn AdBlock back on and reload (Ctrl/F5 or button in the NavBar) the page (maybe it's not necessary to reload, I haven't tried). AdBlock will now filter all the unwanted elements and you should have a nice slim page for zotero. Save it in zotero and turn AdBlock off again to continue your work.
There are more possibilities: You can also instruct AdBlock to do certain operations only on a specified site or path. I think it may be helpful for you to have a look on it.
Greets, Jan.
However, I found a way around. This works a *nix systems, and people who have not yet upgraded to *nix will not be able to use it. There is a script called hardlink.py which can be downloaded from http://www.sodarock.com/hardlink/ that traverses a specified part of the directory tree, and replaced duplicate files by hardlinks. Running the script reduced the size of zotero storage by 30 percent for me without any side-effects.
Boris
Links save on space somewhat, but moving those thousands of copies of the same file takes time in itself, even if their size is negligible. I'm backing up my bibliography right now, and I know that there's hundreds of copies of the same files that take up most of the time.
I would like to see Zotero only save the files that it needs. Then I could still use hardlink.py. Assuming FF will tell an extension what media a page uses, this should be relatively easy to fix.
The tool you mention (hardlink.py) seems to be at http://code.google.com/p/hardlinkpy/ now. I mention that here in case the redirect is removed in the future.
It just occurred to me that I could delete all of the files outside of Zotero. Does anyone know if Zotero tracks files or only the main file and directory? Would deleting these unused media files get Zotero out of sync or cause any other problems?
If people wanted to do some tests on difficult web pages and compare current Zotero snapshots and File->Save Page As ("Web Page, complete"), we could see if switching back made sense.
However, Scot is correct that Zotero tracks only the main file, so deleting unwanted ancillary files isn't a problem (though this may become more complicated if we start offering uncompressed file syncing).
While Zotero could track digests of the files it has saved in order to avoid duplicates, HTML attachments would then only be viewable via Zotero and not via the filesystem. I suspect more people would complain about that than complain about disk space efficiency, but perhaps not. It's even possible that Zotero could dynamically serve the proxied page even if the HTML file was opened via the filesystem.
@bdarcus - hmm... image... isn't pdf more close to postscript? For my feeling PDF is way less evil than BMP. OK, links may not work after some time (if not the information itself is already outdated) - but this is not because of the format. Actually, many "modern" HTML pages rather feel like image graphics, no?.. that's why people complain about the space requirements.. just too many small files for stupid things.
@adam.smith: This extension I don't know, but PDFs I like for archiving, sharing and printing (with Foxit PDF reader opening much faster than acrobat, and independetly of FF)
Of course, displaying PDFs inside a webbrowser is not what it has been made for - but for archiving I don't feel wrong at all with this format.
---
Zotero is really great, thanks a lot!
https://addons.mozilla.org/en-US/firefox/addon/636
it lets you convert a webpage to .pdf with the click of one button - you could then put it into Zotero with a snapshot.
The reason I was asking, though, is that I'm frequently not very convinced by the quality of web-page to .pdf conversion and that alone would, for me, be a strong argument against.
Webpagedump was designed to capture a web page as accurately as possible (in a visual sense) and tries to save everything which could be relevant for the visual rendering.
Therefore webpagedump is not targeted at efficient saving.
I wonder if either of the following would make sense / could be incorporated easily. Have a snapshot of the page option or setting that makes Zotero:
-only save files from the same domain as the page - i.e. not from ad servers (this might miss a few legit files) or
-save as html - i.e. no gifs, jpgs etc.
Just an idea. Thanks again for a great product.
https://www.readability.com/
(it's free and open source)