Feature request: Optionally save web pages as PDF (and a list of tools that might help)

There has been some discussion on this before, most notably:

https://forums.zotero.org/discussion/comment/313592

https://forums.zotero.org/discussion/31876/web-page-annotation-a-working-solution/

https://forums.zotero.org/discussion/9183/htmltopdf-workflow/

https://forums.zotero.org/discussion/23704/create-and-attach-pdf-from-snapshot/

The latest comment on limitations was in 2012 from @adamsmith, who mentioned that this might be more suitable for a plugin and that there might not exist any cross-platform library to use. I wonder if it would be possible receive a comment on how this feature is viewed by the Zotero team today. Is it something that would be interesting if the right tools would be available and there was enough resources to implement them? Or is it seen as a less useful feature and it will likely not be implemented?




My reasons for considering this a highly useful feature, is that while there are other solutions for annotating live webpages (such as hypothes.is), I think PDF is an ideal format for webpage snapshots. It allows for using the same workflow and tools as when annotating research article PDFs, and would work well with the extract annotation feature from Zotfile. It is like printing out a physical copy of a page and taking notes, great for learning and to refer to later (of course keeping in mind when it was retrieved, but many research posts such as blogs rarely updated, rather new posts are made).

Given my interest in this feature, I want to try to help by sharing a few tools, which could be able to provide this feature in Zotero. I am not able to say which (if any) would be suitable choices for incorporating this feature into Zotero. All these tools are cross-platform, open source, and currently maintained.


  • jsPDF is a client-side JS library for pdf generation. Source code is 5 MB, not sure about size of dependencies.


  • wkhtmltopdf has precompiled binaries that are around ~80 MB in size (for all three OSes together). If that would be to big to bundle, maybe Zotero users could be instructed to download the binary themselves if they want to print to PDF, and then there could be an option in Zotero to set the location of wkhtmlbinary to use.


  • Google chrome/Chromium has a headless mode that allows for printing (and can. While not everyone

    chrome –headless –disable-gpu –print-to-pdf file:///path/to/myfile.html


  • PDFShift is an online API. This might include the least resources from the Zotero team. Users would have to be comfortable with that their HTML files are uploaded to a third party for conversion to PDF, but since these are web snapshots they are already public anyways (and could maybe be uploaded without being linked to user id?).


As for the user facing side, there would ideally be an option to save webpage snapshots as HTML, PDF, or both. Saving as only PDF would involve clicking the browser extension button, which would trigger a download of the HTML page, a conversion to PDF, and finally a deletion of the HTML file.

A couple of node-based solution that I don’t know if they are relevant:

https://github.com/foliojs/pdfkit

https://github.com/westmonroe/pdf-puppeteer

Resources:

https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/

https://stackoverflow.com/questions/18191893/generate-pdf-from-html-in-div-using-javascript

https://stackoverflow.com/questions/176476/how-can-i-automate-html-to-pdf-conversions

  • If anyone is looking to implement this for personal use, the pyzotero library might help https://github.com/urschrei/pyzotero
  • For saving a website to Zotero as a pdf, you could follow these three steps, which can mostly be executed through keyboard commands.

    1. Save the website to Zotero using the browser connector. Zotero's default shortcut is "Ctrl+Shift+S". For adjusting the Firefox shortcut see here.

    2. Save the website as a pdf to the Downloads folder using a web browser, e.g., Chrome. The shortcut for printing is usually "Ctrl+P".

    3. Attach the pdf to the Zotero item using the Zotfile add-on, which adds an "Attach New File" option to the item context menu. The Zutilo add-on allows triggering this function through its keyboard shortcuts for Zotfile.

    For the best pdf output, step 2 might require some adjustments. E.g., if the website doesn't offer a printer-friendly version, you could remove irrelevant content using the "only selection" option in the printing dialog. A more automatic way might be possible for capturing the full page, but you might get large documents with unnecessary clutter.

    Some automatic html-to-pdf functionality might be useful for Zotero notes and reports. Zotero's reports and the Report Customizer add-on already provide printing options. For a more automatic export function, it could help to check the code in the ZoteroQuickLook add-on, which saves notes to a temporary html file that can be shown in an external viewer.
  • Thanks for the workflow tip @qqbb ! It sounds quite efficient and there would be advantages to manually printing the page to PDF since, as you said, one can customize the output.

    I'm still interested in solutions that would be entirely automated. After looking at the quicklook plugin you linked, I'm thinking that maybe a good solution would be to add the option to run a custom command after a web snapshot is saved. Then the user could write a script using wkhtmltopdf or whatever to convert the snapshot to html and delete the original. I believe the only parts needed from zotero (or a plug in) would be to pass file path and assign the output of the script as an attachment to the currently selected item. Not sure if this is possible since I have not made any plugins, but conceptually it sounds reasonable.
  • I use the ZotFile trick @qqbb suggests along with PrintFriendly (instead of just printing) to produce a more acceptable-looking PDF.

    I wonder if a PDF resolver might be one way of implementing this. There are some (paid) options that you can feed an HTML address to and will give you a PDF URL in return that could use the PDF resolver API: https://www.zotero.org/support/kb/custom_pdf_resolvers

    Still, one click is better than many...
  • edited April 18, 2020
    I'm as well interested in saving a webpage directly as a PDF (instead of the snapshot, or along with the snapshot). Of course it can be done manually, but having it as part of the default connector would be really handy.

    I have no idea how that could be implemented, but starting to search on the topic, I found that the Chrome DevTools Protocol has a Page.printToPDF method:

    https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF

    A pdf file of a page can also be obtained from a headerless instance of Chrome:

    https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf

    or using Puppeteer with Node:

    https://developers.google.com/web/updates/2017/04/headless-chrome#puppeteer

    None of that is a real solution, but it's reasonable to think that there's a built in method "somewhere", at least in Chrome.

    Installing a Zotero printer driver component is also an option, just as Adobe Acrobat or Microsoft Onenote do.
  • Has there been some thought to using the SingleFile browser addon to snapstop webpages? It also has an highlight/annotate option.

    https://github.com/gildas-lormeau/SingleFile
  • We're going to be redoing how snapshots work soon to fix problems on some JS-heavy sites. SingleFile is one of the possibilities for what we'd base the new snapshots on, but we're going to be reviewing a bunch of options, since much of it depends on how well they can be integrated with Zotero on a technical level.
  • @qqbb , @ahmontgo , thanks a lot, guys, for sharing, these tricks are really appreciated!

    @dstillman , you're saying that the snapshot feature would still use the HTML output, not PDF output, aren't you?
  • I vote for this feature, that would be really useful. imho saving in pdf is much better because it keeps the same workflow as pointed by @cheflo .

    In the meantime, the @qqbb trick saved me precious time saving the webpages.
  • +1 for this feature
  • + 1 for save as pdf integration. I've been printing separately and attaching to a saved website item (per @qqbb -- and you can also just drag the pdf file into zotero on top of the item). But doing this automatically would save a lot of time.
  • This gets a strong vote from me too. The HTML text snapshot is ok for proving text was present, but it doesn't store any graphs, charts or illustrations. As more content is moving online, it would be super useful to have a one-click workflow for this!
  • The HTML text snapshot is ok for proving text was present, but it doesn't store any graphs, charts or illustrations.
    Sure it does. What makes you say that?

    Since the above discussion, we've implemented SingleFile-based snapshots, which save nearly perfect static versions of even interactive pages like this one (with graphs, charts, and illustrations). A PDF of that page from Firefox leaves out lots of content. A PDF from Chrome does a little better but still cuts off significant amounts of content at page breaks. A PDF of the SingleFile snapshot from Firefox just makes it worse, cutting off content and messing up the layout.

    I understand that the previous snapshot functionality — like the HTML saving in Firefox and Chrome — didn't handle all pages well, but the idea that an HTML webpage can be reliably turned into a PDF is unfortunately just mistaken. I'd encourage people to try the new SingleFile-based snapshots, which will be available in Zotero 5.0.91 within the next day.
  • +1 for this feature. I'm using PDFs to mark text and make annotations for later use, and can't do that with Snapshots.
  • I don't think that's going to happen for the reasons given above, sorry.
    There _are_ annotation tools for websites you can use.
Sign In or Register to comment.