Disk space efficiency

bbukh · October 23, 2007

I recently started using Zotero and like it quite a bit so far. However, Zotero is highly inefficient in terms of disk space usage. Most of my records come from a handful of websites. So, each time I get a snapshot of a paper from a site X all the page decoration elements get saved to harddrive. I end up with literally hundreds of copies of the same journal logo in my zotoro/storage folder, not to mention all the buttons, and other fluff. I propose to stop wasting harddrive like so. More importantly, it will make it for a smaller (and hence easier) backup.

The solution is simple: for each file maintain a hash value (for example, md5) and usage counter if a newly downloaded file has the same has value, instead of storing a copy of the file increment usage counter and either create a hardlink or change the link to point to the correct copy.

Bottom line: eliminate redundancy in downloaded files to reduce the size of the backup.

Boris

Jnic · October 23, 2007

Hi Boris!
I understand your problem. Whenever possible I use 'printer friendly versions' to avoid it. They come with all the information but nearly without the - how you call it? - fluff.

If this is no option for you (eg. because your favoured site doen't offers printer-friendly-views) I can offer you a work-around 'til the zotero-team possibly implements your request, wich I think is reasonable. (But - my guess - I wouldn't expect a change in zoteros behaviour any time soon.)

You could use AdBlock+ (https://addons.mozilla.org/en-US/firefox/addon/1865). It's intended for blocking advertisements, but should do the job for your needs. How?
Install it w/o downloading filterlists. then go to your usual site and teach (it's not difficult) AdBlock that every element on the page you don't want to see in your zotero-database is unwanted. Now, turn AdBlock off (It's just a click.) Do your search on the site with the still inactive AdBlock and when you have the results, turn AdBlock back on and reload (Ctrl/F5 or button in the NavBar) the page (maybe it's not necessary to reload, I haven't tried). AdBlock will now filter all the unwanted elements and you should have a nice slim page for zotero. Save it in zotero and turn AdBlock off again to continue your work.

There are more possibilities: You can also instruct AdBlock to do certain operations only on a specified site or path. I think it may be helpful for you to have a look on it.

Greets, Jan.

bbukh · October 23, 2007

Thanks Jan, but I already use AdBlockPlus with Flash disabled. I do thus do not complain about the ads (well, I do complain about them on a deeper philosophical level, but that is irrelevant). What I do complain about is that most sites nowadays use graphical elements and scripts for navigation. Blocking them with AdBlock results in in non-functional websites.

However, I found a way around. This works a *nix systems, and people who have not yet upgraded to *nix will not be able to use it. There is a script called hardlink.py which can be downloaded from http://www.sodarock.com/hardlink/ that traverses a specified part of the directory tree, and replaced duplicate files by hardlinks. Running the script reduced the size of zotero storage by 30 percent for me without any side-effects.

Boris

roccurve · January 17, 2008

I'd like to support the original poster in this request. Perhaps if there was a way of specifying "favorite" websites/domains, you could save all the associated files for that site in zotero once, and let the snapshot refer to a single local copy of that "fluff" when displaying pages from that domain.

Links save on space somewhat, but moving those thousands of copies of the same file takes time in itself, even if their size is negligible. I'm backing up my bibliography right now, and I know that there's hundreds of copies of the same files that take up most of the time.

yen27028 · October 12, 2008

The problem is a little worse than described above for some web sites. In some cases, you get lots of graphic files that are not even used. This includes when you use adblock and/or "print" versions of web pages. I think that Zotero is saving every graphic mentioned in the css file, even if it is not used for the specific page. FF knows what is being used (see Tools -> Page info -> Media (tab like area)). On one site, the "print" view only needed 2 media items (1 gif and 1 png) while Zotero saves 101 graphic files (85 gif, 2 jpg, and 14 png). If I save 10 articles I will have almost 1000 extra files. If I use the tool you suggested (hardlink.py), I will still have 1000 extra hard links, which will effect backups and processing time.

I would like to see Zotero only save the files that it needs. Then I could still use hardlink.py. Assuming FF will tell an extension what media a page uses, this should be relatively easy to fix.

The tool you mention (hardlink.py) seems to be at http://code.google.com/p/hardlinkpy/ now. I mention that here in case the redirect is removed in the future.

It just occurred to me that I could delete all of the files outside of Zotero. Does anyone know if Zotero tracks files or only the main file and directory? Would deleting these unused media files get Zotero out of sync or cause any other problems?

scot · October 12, 2008

I have freely deleted files (*.png, *.gif, and javascript) files from the zotero data directory and not noticed any consequences except what I intended (a simpler page with 'just the data'). I don't have this on any authority, but I don't think Zotero knows or cares if the files get changed or removed.

dstillman · March 7, 2009

Linked here from another thread, so I'll mention that Zotero uses code from WebPageDump to save web pages because, two years ago, WPD was far more accurate than Firefox 2's built-in saving. We haven't compared it to the built-in saving in Firefox 3.0 or 3.1, and it's possible that those versions are more accurate without needlessly saving many additional files. Using the built-in methods might also avoid unwanted file saving when using page-stripping extensions, though I haven't tested that.

If people wanted to do some tests on difficult web pages and compare current Zotero snapshots and File->Save Page As ("Web Page, complete"), we could see if switching back made sense.

However, Scot is correct that Zotero tracks only the main file, so deleting unwanted ancillary files isn't a problem (though this may become more complicated if we start offering uncompressed file syncing).

While Zotero could track digests of the files it has saved in order to avoid duplicates, HTML attachments would then only be viewable via Zotero and not via the filesystem. I suspect more people would complain about that than complain about disk space efficiency, but perhaps not. It's even possible that Zotero could dynamically serve the proxied page even if the HTML file was opened via the filesystem.

zorro · March 7, 2009

What about the option of saving snapshots as PDF files (by Zotero), with active links?

adamsmith · March 8, 2009

zorro - I don't know - do you like how the "convert web-page to .pdf" (e.g. from the PDF Download extension) works?

bdarcus · March 8, 2009

I really hate the idea of saving web pages as PDFs; taking a structured text format and reducing it to effectively a picture.

dstillman · March 8, 2009

zorro: If you really want PDFs, there are other ways to produce them (e.g., in OS X you can generate PDFs with embedded text from any application), but it's unlikely we'd add PDF saving to Zotero itself (for technical reasons among others).

zorro · March 11, 2009

Well, I really _like_ PDFs for a good reason - efficiency: Just one file, taking little space, the desired information is usually there, and it comes with quite some "snapshot feeling" :-), for example if I want to send it to somebody (portable, compatible, easily printable). Of course there are ways (free dummy printers for both Win and Linux) to generate PDFs and to put it then into Zotero, but I was feeling such a thing might be easy to integrate when there is already some external program used for HTML snapshots. Just got that idea reading the discussion about wasted disk space.

@bdarcus - hmm... image... isn't pdf more close to postscript? For my feeling PDF is way less evil than BMP. OK, links may not work after some time (if not the information itself is already outdated) - but this is not because of the format. Actually, many "modern" HTML pages rather feel like image graphics, no?.. that's why people complain about the space requirements.. just too many small files for stupid things.
@adam.smith: This extension I don't know, but PDFs I like for archiving, sharing and printing (with Foxit PDF reader opening much faster than acrobat, and independetly of FF)

Of course, displaying PDFs inside a webbrowser is not what it has been made for - but for archiving I don't feel wrong at all with this format.

---
Zotero is really great, thanks a lot!

adamsmith · March 11, 2009

@zorro - you may want to check out the extension then:
https://addons.mozilla.org/en-US/firefox/addon/636
it lets you convert a webpage to .pdf with the click of one button - you could then put it into Zotero with a snapshot.

The reason I was asking, though, is that I'm frequently not very convinced by the quality of web-page to .pdf conversion and that alone would, for me, be a strong argument against.

bdarcus · March 11, 2009

@bdarcus - hmm... image... isn't pdf more close to postscript? For my feeling PDF is way less evil than BMP.

Sure, but I think much worse than a structured format like (X)HTML, where you can do funky things like this.

bernhard · March 16, 2009

@yen27028
Webpagedump was designed to capture a web page as accurately as possible (in a visual sense) and tries to save everything which could be relevant for the visual rendering.

Therefore webpagedump is not targeted at efficient saving.

jmac62 · March 8, 2011

Actively using Zotero for several months now (and it's great!). I am glad to see this topic is being talked about. I use print-friendly pages whenever possible, but sometimes they aren't available and some pages can have hundreds of gifs, jpgs etc.

I wonder if either of the following would make sense / could be incorporated easily. Have a snapshot of the page option or setting that makes Zotero:

-only save files from the same domain as the page - i.e. not from ad servers (this might miss a few legit files) or

-save as html - i.e. no gifs, jpgs etc.

Just an idea. Thanks again for a great product.

adamsmith · March 8, 2011

you can look at the readability bookmarklet if this is something that concerns you - it's a bookmarklet that displays (almost) any page as a nice, clean, and customizable text. Zotero can then take a snapshot of that.
https://www.readability.com/
(it's free and open source)