how to reduce the # files stored in web page attachments?

pwhallinan · February 13, 2008

I may be a little confused about attachments vs snapshots.
Let's say I've found a webpage I want to store as a new item, but its full of ads and has some extraneous divs. I delete the unwanted divs using the aardvark firefox extension. Now I save the current page as an new item.
Then I click "view snapshot". The result is just what I expected. Then I go to "show file", and find that there are lot of files downloaded from the site that are not used at all.

As a second test, I stored a webpage without doing any manipulation, verified that the snapshot was fine, and then went and deleted most of the files in "show file", and got the same webpage back.

As a result, it looks to me like zotero is storing a lot of files that it shouldn't have to. Since I'm going to end up with 1000's of references, this could really add up...
plus it makes syncing between computers with foldershare impossible, since there's a 10K file limit.

Am I misunderstanding something? How can I cut down the # of files being stored in attachments?

scot · February 13, 2008

I also wonder if there might be an easy way to just 'get the goods' from the web pages I want. I'm currently syncing my zotero directory via USB. I have a large-ish database, and I really notice the file *count* add up. I have gone to using scrapbook for non-academic web pages, to reduce the filecount in my zotero (and thus my sync) directory.

It is great to be able to capture web pages so accurately when you need it, but it would be nice to have a "simple snapshot" option which, at a minimum, didn't download pictures...

dstillman · February 13, 2008

I may be a little confused about attachments vs snapshots.

Snapshots are just a type of attachment.

As a second test, I stored a webpage without doing any manipulation, verified that the snapshot was fine, and then went and deleted most of the files in "show file", and got the same webpage back.

I can't really comment on this without an example. Our snapshot code comes from WebPageDump, which comes from ScrapBook, so I'm not that familiar with it, but I'm guessing that most of the files you're noticing are simply script and image files used for ad banners. Saving a web page to disk and having it resemble the original is actually quite a difficult task, and I'm guessing the WebPageDump code grabs anything that might under any circumstance be displayed on the page.

Linux/Mac users might want to see this thread, which suggests a tool you can use to automatically create hard links (though you'd want to then be using a sync tool that was smart enough to copy the links themselves and not full copies of the linked files).

We usually recommend saving the print-friendly versions of pages when they're available.

dstillman · February 13, 2008

It is great to be able to capture web pages so accurately when you need it, but it would be nice to have a "simple snapshot" option which, at a minimum, didn't download pictures...

That we can do. If somebody wants to take that ticket, by all means do so.

pwhallinan · February 14, 2008

Here's a proposal for a new function to help anyone seeking to clean up their stored webpages manually. It's not quite the "simple snapshot" proposed above, because there are often important images that contain charts, etc. It's really hard to automate a cleanup.

My request is for a function that does step 5 of the following use case:
1) go to the webpage
2) make a new item from the web page
3) show the snapshot attached to the new item
4) edit the displayed snapshot/webpage with aardvark
5) take a new snapshot of the edited webpage using the firefox "save as complete webpage" functionality (except dump the files into the zotero tree structure, not the "web page complete" tree structure) and replace the original snapshot with the new one

The button for the new function for step 5 could be placed next to the add new item button if a checkmark in the preferences section was selected.

BTW, the reason the use case works this way (versus aardvarking a desired webpage and then adding a new item) is that the site translators must be applied to the original webpage, not the aardvarked webpage.

jgr · February 14, 2008

Wouldn't importing from Scrapbook be a decent workaround? Import from Scrapbook has been discussed (http://forums.zotero.org/discussion/145/) but AFAIK nothing has happened so far.
Scrapbook can effectively delete unwanted items on a page.

cfeuerherdt · February 17, 2008

Another alternative could be to capture the webpage as an image (perhaps something like FireShot?). That way only an image of the web page is stored rather than all the associated files.

dstillman · February 17, 2008

I delete the unwanted divs using the aardvark firefox extension. Now I save the current page as an new item. Then I click "view snapshot". The result is just what I expected. Then I go to "show file", and find that there are lot of files downloaded from the site that are not used at all.

I looked at the Aardvark extension, and I think the issue is that Aardvark doesn't actually remove the page elements—it just hides them using custom CSS rules. So when you save snapshots, all the elements are still there.

Note that there's no difference between the WebPageDump code and the built-in Save As Complete Webpage functionality in this regard. If you save an Aardvarked page from File->Save and look at the source, you'll see all the original elements. WebPageDump is just much better at saving pages accurately, so it grabs many more elements that would only be displayed under certain circumstances (which, unfortunately, often means ad content).

You can always save a snapshot of a snapshot if you want to edit pages saved via translators. You lose the original URL (replaced with a file:/// URL), but 1) it's often the URL of the parent item anyway and 2) as an archival tool, Zotero by design doesn't modify existing pages when saving highlighting and annotations, instead saving the data for those elements in the database and overlaying them on the page. Directly supporting modification of saved snapshots tied to an original URL seems a little problematic (even if you can in practice do this anyway by using Aardvark before saving a page)...

Oleg_Gordeyev · November 11, 2009

I have used Zotero (Windows) for only 1 week and have faced an opposite problem - I cannot save any page with the pictures on it. Snapshotting the page Zotero saves only text, scripts and background images. The same procedure with the same page made by means of Scrapbook add-on can be done with success. I have learned all available settings of Firefox and Zotero and found no options that may affect behavior of saving page.

Have anybody an idea what I make wrong?

arggem · November 11, 2009

Hi Oleg!

Welcome to the forums!!

Sorry, I don't have an answer for you, but...Since your problem is different from the one indicated in the thread title, I suggest you start a new thread with your problem. Otherwise people won't know that you have a different problem, and may not read your post.

Oleg_Gordeyev · November 12, 2009

Hello, arggem!

Thanks for your attention and the suggestion. I will follow your advice.