tar-ball-esque snapshot file storage?

I was recently messing around with remote storage options for my library and found that an enormous amount of time was spent copying all the tiny images/scripts/css files that come with each snapshot (seems like the average was around 150-200 files per snapshot.

Is there a way to mash all these files together into one single file for storage purposes so that only 1 transfer need be made for an entry, like the unix tar kind of thing? No need for compression... just aggregation. I would gladly accept slightly slower unpacking/viewing if it meant i could synchronize faster.

I don't want to have to do away with my snapshots, but I guess I will if I have to.

-Keith
  • While i'm thinking about it, is it possible to add tar-like functionality to the entire storage procedure? The online storage thing would go MUCH faster if it were only storing one single, large file.
  • I don't know how much of a difference the number of files really makes. There's obviously some overhead in the separate request headers, but total file size probably has much more of an effect.

    What exactly do you mean by "remote storage options"?

    For the current file sync implementation in the Zotero 1.5 Sync Preview, we actually use one compressed ZIP file per directory to speed things up, but we'll likely be adding an option to upload individual files, since there would be some big advantages to being able to pull files directly off the server without needing to uncompress them first (e.g., from mobile devices).

    If you're talking about something other than file sync in 1.5, though, it doesn't really have anything to do with Zotero—it's the job of whatever other tool you're using to upload things efficiently, and there are, for example, clever online backup tools that use hashing to upload only a single instance of files that exist in multiple locations on disk.

    And even if you're not using a clever tool, there's a rather simple solution, suggested by Scot on another thread: set something up to periodically delete all the unnecessary files in the storage directory, since 75% or so of the files in snapshots from advertising-supported sites are probably unnecessary and undesired.

    Storing files on the local disk efficiently, as discussed in that other thread, is another, more complex matter, but it's not really related to the question of online storage. But as for storing all Zotero attachments in a single, huge, corruptible file, well, you don't really want that.
  • I don't know how much of a difference the number of files really makes. There's obviously some overhead in the separate request headers, but total file size probably has much more of an effect.
    In my experience, the number of files makes a huge difference when writing to WebDAV. I've tried the same thing as kajeling, namely syncing my storage folder to my WebDAV storage to see whether I could bypass Zotero's storage syncing while it's still a little buggy, but I found that it was agonizingly slow (WinXP).
  • In my experience, the number of files makes a huge difference when writing to WebDAV. [...] I found that it was agonizingly slow (WinXP).
    That may have more to do with the quality of the WebDAV implementation (or the latency of the connection) than the number of files.

    At least, in a test using Transmit on OS X with 107 storage folders totaling 18MB (not counting filesystem block overhead), transferring the folders took 6:30. Transferring an uncompressed ZIP of all the files took 5:15. That's a moderate difference, but not huge, and it's between uploading 2989 files and uploading a single file, which isn't going to happen.

    This part is a discussion for the other thread, but Zotero could, of course, use a more efficient storage mechanism, not duplicating redundant files—but then the files would be accessible only through Zotero, wouldn't be indexed by system search tools, etc.
  • Thanks for the replies. It could certainly be the implementation, but I can definitely say that the # of files increased the transfer time by orders of magnitude. I had 2000 files totalling about 70mb and it took 5 min to upload the zip (still about 70mb), and after nearly two hours I just gave up on the individual-file-based transfer.

    Might not help that I'm also going through some software to mount the webdav as a drive, but it would still improve the transfer of anything with snapshots to do it all together.

    My current solution is just to run zotero off a local copy of my library and just do some manual, external syncing with the webdav so at least it doesn't have to do this every time i retrieve/store a file. Also I removed all the snapshots as even with this solution it is just too time-consuming to transfer.

    Just to reiterate, the option I was referring to wouldn't have to involve any compression, which would take time to compress/decompress. Just lumping the files together with tar would pretty much solve the transmission overhead problem without any need for EXTRA compression overhead. The types of files that are taking up most of the space in the attachments (images, pdfs with a lot of images) aren't going to compress anyway.

    -Keith
  • Just lumping the files together with tar would pretty much solve the transmission overhead problem without any need for EXTRA compression overhead
    Right, but again, this is a problem for you and your remote storage process, not for Zotero, which isn't going to store all attachments on disk in a single huge file (as that would be crazy).
Sign In or Register to comment.