Available for beta testing: Improved webpage snapshots

dstillman · September 25, 2020

The latest Zotero beta and Zotero Connector beta can now take greatly improved webpage snapshots, even on websites that make heavy use of dynamic rendering, deferred images, or advertising. Pages should generally be saved exactly as you see them in your browser, after any JavaScript has run and reflecting any modifications made to the page with other browser extensions. All JavaScript is stripped, so any interactive functionality won't be preserved, but you also won't end up with broken redirects, doubled ad images, or other problems that are common when saving dynamic webpages. (Zotero's previous snapshots were based on Firefox's "Save Page As…" functionality, which, like Chrome's, exhibits many of those problems.) Once this has rolled out, we'll be able to turn snapshots back on on sites where they were previously disabled (e.g., NYT, Twitter), and they'll work again on sites where they had previously stopped working (e.g., Medium).

The new functionality is based on the great SingleFile browser extension (technically, SingleFileZ). We're not currently saving single files — either with encoded embedded resources (SingleFile) or as self-extracting ZIP files (SingleFileZ) — but rather using the SingleFile logic to extract just the cleaned HTML, CSS, image, and font files necessary to display the page as shown. All single-file options involve trade-offs, but for a future version we're considering switching to combining snapshot resources into non-self-extracting ZIP files that would normally be viewed within Zotero but that would still be extractable as standard ZIP files for data portability.

Since the new snapshot functionality does more processing of the page, and particularly since it tries to load deferred images (e.g., images that only appear when you scroll down the page), it can be a little bit slower on complex pages or pages with very large deferred images. When saving from the Zotero Connector, it also runs in the browser itself instead of in Zotero, meaning that if you leave the page before the snapshot finishes it won't be saved to Zotero. We've talked before about reducing the number of pages on which Zotero saves snapshots, and I think we'll want to revisit that to make sure snapshots are only being saved where they provide real value.

If you want to try out the new snapshots when saving from your browser, you'll need both the Zotero beta and the Zotero Connector beta for Firefox. (The Zotero beta alone will use the new snapshot functionality when saving via Add Item by Identifier — e.g., for an arXiv ID.)

Let us know if you run into any problems.

Gurdas_Sandhu · October 2, 2020

This is a welcome improvement. Thank you. Could you speak more on why SingleFileZ and not SingleFile, particularly interoperability and sharing? Other than compression, it seems SingleFile is better because it will open in any browser without compression and thus can be shared with people who do not have Firefox or this add-on.

dstillman · October 2, 2020

As I say, we're not actually using either the SingleFile or SingleFileZ approach to save the files. We're just using the logic that extracts the resources from the page and then saving them as separate files on disk — the same as previous attachments, just cleaner. We have to use SingleFileZ for that because it's the one that extracts individual files instead of embedding them in a, well, single file. Basically we skip the part of SingleFileZ where it ZIPs the files and adds a self-extracting HTML container.

dstillman · October 9, 2020

The new snapshot functionality is available now in Zotero 5.0.91.

emilianoeheyns · October 12, 2020

Why not as a single file though? What's the benefit of having separate files for a snapshot?

dstillman · October 12, 2020

@emilianoeheyns: See the chart I linked to. It explains the problems with the default SingleFile/SingleFileZ approaches — specifically, huge Base64-encoded embedded resources or hard-coded embedded JavaScript code in each file to extract a bizarre embedded ZIP file. They're both bad options.

As I say:

for a future version we're considering switching to combining snapshot resources into non-self-extracting ZIP files that would normally be viewed within Zotero but that would still be extractable as standard ZIP files for data portability

emilianoeheyns · October 12, 2020

Right, I'd forgotten about those.

dmnkvd · October 13, 2020

Great feature, thank you for improving! I am curious, is there presently an option to save a web-article in a reader-friendly PDF with Zotero? (if not, is it on the horizon -- would make annotating web-texts a lot easier!)

Arjan · October 17, 2020

Great to see this change! If it's feasible to have multiple options in the future, I'd like to see the actual SingleFile output as well. I have for the past year or so actually been attaching SingleFiles to Zotero items manually. Primarily to have functional snapshots, but having everything bundled in the html has two additional benefits: 1) not having literally dozens of .js files per snapshot, which slows down syncing over WebDAV; 2) the ease of sending someone a snapshot over email or whatever in just one file that's immediately viewable in any browser. I haven't really noticed any drawback to the base64-encoded images -- I don't normally go looking for the separate images meant to be viewed in context, and if I wanted to I could just save them from the html page.

emilianoeheyns · October 17, 2020

They're much larger when b64 encoded.

Arjan · October 17, 2020

I suppose it would add up for really image-heavy pages, but that's why asked whether multiple options might be feasible. For the types of pages I'm saving it seems to be up to about 20% larger in total which is worth it to me (just checked on a page with 11 main images, 29 including background, sidebar etc. images).

Gurdas_Sandhu · October 17, 2020

I would like to advocate for one file per snapshot for many of the reasons mentioned by @Arjan. I have avoided Zotero's snapshot feature because it creates many dozens of files per snapshot and that's hard to share. My current approach is to use a Firefox add-on to save the page and then attach it to the main item. That's many extra clicks.

dstillman · October 17, 2020

1) not having literally dozens of .js files per snapshot, which slows down syncing over WebDAV

No it doesn't, if you're referring to Zotero file syncing. Zotero has always zipped snapshot contents before syncing.

Yes, the drawback to Base64 encoding is the increased size, which will generally be ~33% larger.

But it's true that Base64 encoding would be better for sharing — in terms of ease if not file size — than the regular ZIP of individual files that I mention as a possibility above. We'll take this into consideration when planning additional options.

dstillman · October 24, 2020

OK, in the Zotero Connector beta, we've switched to the standard SingleFile approach with Base64-encoded resources. We decided it was worth it. Since we're discarding JS and a lot of ad junk, even with the bulkier encoding it's possible for these to end up using significantly less space than the older snapshots, and at the very least it shouldn't be that much bigger. And unlike a ZIP file — either the standard approach or the self-extracting SingleFileZ approach — this keeps snapshots viewable anywhere without any special handling.

We're planning to automatically convert existing multi-file snapshots — both old and new — into single files in a future client version.

dstillman · October 24, 2020

Zotero Connector 5.0.76, available now for Firefox and within a few days for Chrome, will use the single-file approach for snapshots of the current page. Snapshots that have to happen in Zotero itself — such as for search results, or for features such as Add Item by Identifier — will still use the multi-file approach until the next version.

Gurdas_Sandhu · October 26, 2020

This is great, thank you. I was using "Save Page WE" add-on to save my webpages as a single file and for the past month I have been saving some pages with Save Page WE and "SingleFile" add-ons. In most cases the SingleFile saved version is smaller and more true representation of the actual web page in Firefox.

dstillman · November 4, 2020

Zotero 5.0.93, available now, completes the switch to single files for all snapshots.

We have a few more bug fixes for the SingleFile snapshots coming up, and we still may try to convert existing snapshots to single files in a future version.

Closing this thread. Let us know in new threads if you find any other issues with the new snapshots.