snapshot network traffic?

HLHJ · July 14, 2018

When I view an html snapshot, it creates network traffic, but when I view a pdf, it does not. I don't use sync or have any settings that obviously explain this. Could you tell me what the network traffic is doing, please, and how I can configure it?

dstillman · July 14, 2018

It's likely just JavaScript from the page trying to access remote page resources. You could use NoScript or similar to stop JavaScript from running on file:// URLs, but this is just a local HTML file in your browser, so Zotero isn't involved at that point.

HLHJ · September 14, 2018

Sorry not to have responded sooner. I can't seem to stop it through any Noscript setting. Does anyone know of a setting for this?

noksagt · September 14, 2018

This is a question that is better suited to be answered by your browser provider; as Dan notes, Zotero's just handing off a file:// URL. Any traffic will be dependent on your browser, your extensions, and the particular snapshot you're looking at

I'm not aware of a generic way to cut network traffic for all file:// URLs.

You can use "offline mode" if your browser supports it or may set a temporary proxy, but either may impact other use of your browser and/or need to be switched manually.

HLHJ · September 14, 2018

It seems that while Noscript stops the scripts from running, it does not stop the scripts from loading. Perhaps a privacy consideration for the browser. Workaround of sorts; if I have no network connection it doesn't have network traffic. Apparently Firefox had don't-load capabilities but has discontinued them, although some add-ons might be configurable as substitutes;
https://stackoverflow.com/questions/6291916/disabling-loading-specific-javascript-files-with-firefox

Which brings in a related point; some webpages don't work without javascript. The typekit scripts, for instance, blank all the page text if you block them. This is obviously a problem for the web designer, not Zotero. However, it means that when I do not have access to a network connection, even digging into the html source may not retrieve the content I have saved to Zotero. I'm guessing most web designers prefer to work in areas with cheap, reliable internet connections.

Thanks for your help. These are somebody else's problems.

dstillman · September 14, 2018

Apparently Firefox had don't-load capabilities but has discontinued them

I'm not sure what you're referring to, but Firefox still has a "Work Offline" option.

Which brings in a related point; some webpages don't work without javascript.

We're considering a couple alternative saving mechanisms: 1) simplified saving, similar to the reader mode functionality in various browsers, and 2) saving the document as rendered by JavaScript and then removing all JavaScript in the page. It remains to be seen if we can do the second one reliably, but it would produce more reliable snapshots of what you were actually looking at when you saved (even if any interactive functionality didn't work in the snapshot).

HLHJ · September 16, 2018

Sorry, I wasn't clear. If a browser is configured to not run certain scripts, using NoScript or some such, it still downloads the scripts. This seems like a bit of a waste of bandwidth. Since many pages use third-party scripts, it also means that, for instance, Google knows what pages you read on PubMed, which might be a privacy concern for some people.

So when I set NoScript to run no scripts at all, opening a saved page still loaded the scripts from assorted domains.

Firefox used to have a functionality where you could tell it not to load scripts requested by certain domains, ever.(http://kb.mozillazine.org/Security_Policies) This would make browsing with a scriptblocker much less antisocial, costly, and slow when using a limited-bandwidth connection. Apparently this functionality is gone.(https://support.mozilla.org/en-US/questions/1000843)

Working offline stops the scripts from downloading, but it also stops html and everything else from downloading.

Your saving mechanisms sound interesting. Some web designers do pay attention to Zotero function, so just giving a warning when saving a page with problematic scripts might give some an incentive to make websites render script-free. A reader-mode-like functionality would be nice, but I think that the problematic websites are the ones that would not work in reader mode.

I have resorted to a manual version of option 2, copy-pasting text. I would be very happy with a version of option 2 that I could manually preview and OK. I would also appreciate a functionality that let me tell Zotero to strip all of the javascript out of all the saved pages (ideally with the ability to make individual exceptions, but useful interactive content is quite rare).

Saving video media is often awkward because someone has decided to use HTML5, but split the video into dozens of second-long clips, so that saving a video does not work. On a slow connection, this can mean that the video freezes repeatedly, often enough to make speech incomprehensible. Audio sometimes has similar problems.

mgreis · April 1, 2019

Any updates on the plans to add alternative saving mechanisms?

My primary interest is in having an accurate, readable copy of the page content that I can quickly access.

My biggest problem with the current full snapshot is that it can take forever to load (and may not load at all).

I understand that this is because of JavaScript execution and that there may be ways of working around it, but those seem both complicated, browser dependent and not necessarily guaranteed to work. What happens if the page has been moved or was originally (legally) viewed behind a paywall?

Reducing the number (and total size) of the files saved in the library is also a plus (as noted by several users).

Of the two approaches Dan outlined in this thread, the "as rendered" version is appealing because it guarantees that everything I read on the page is present in the snapshot.

On the flip side, an "as rendered" version does capture a lot of irrelevant material (e.g. "You may also be interested in...."). Does an easier-to-implement, "simplified" version in fact capture all of the "relevant" page content? Or is "relevant" page content in the eye of the beholder?

HLHJ · April 3, 2019

One standards-based solution is Scholarly HTML:
https://w3c.github.io/scholarly-html/

Some websites, such as PMC, already have scrapeable fulltexts. Zotero may have to take a site-by-site approach on scrapers for fulltexts, too.

Plan S also has machine-readability requirements. I hope the Zotero team has discussed or will discuss them with the requirement-writers (whom one can write to at coalition-sscienceeurope.org, with a circle-at sign between the two successive esses).

With reference to this discussion:
https://forums.zotero.org/discussion/36151/wikified-copyleft-bibliographic-database
There is also a Wikimedia fulltext database, human-edited into a machine-readable form. It can only cover public-domain and open-licensed materials (CC-BY-SA etc). There are currently lots of books and few articles in it, but this should change.

TiddlyWiki, which has BibTeX integration, is sometimes handy for storing fulltexts, as are assorted static website tools.

mgreis · September 18, 2020

Now successfully using SingleFile to do this. Updating this item with a link to

https://forums.zotero.org/discussion/comment/363634