Feature request: Optionally save web pages as PDF (and a list of tools that might help)

There has been some discussion on this before, most notably:

https://forums.zotero.org/discussion/comment/313592

https://forums.zotero.org/discussion/31876/web-page-annotation-a-working-solution/

https://forums.zotero.org/discussion/9183/htmltopdf-workflow/

https://forums.zotero.org/discussion/23704/create-and-attach-pdf-from-snapshot/

The latest comment on limitations was in 2012 from @adamsmith, who mentioned that this might be more suitable for a plugin and that there might not exist any cross-platform library to use. I wonder if it would be possible receive a comment on how this feature is viewed by the Zotero team today. Is it something that would be interesting if the right tools would be available and there was enough resources to implement them? Or is it seen as a less useful feature and it will likely not be implemented?




My reasons for considering this a highly useful feature, is that while there are other solutions for annotating live webpages (such as hypothes.is), I think PDF is an ideal format for webpage snapshots. It allows for using the same workflow and tools as when annotating research article PDFs, and would work well with the extract annotation feature from Zotfile. It is like printing out a physical copy of a page and taking notes, great for learning and to refer to later (of course keeping in mind when it was retrieved, but many research posts such as blogs rarely updated, rather new posts are made).

Given my interest in this feature, I want to try to help by sharing a few tools, which could be able to provide this feature in Zotero. I am not able to say which (if any) would be suitable choices for incorporating this feature into Zotero. All these tools are cross-platform, open source, and currently maintained.


  • jsPDF is a client-side JS library for pdf generation. Source code is 5 MB, not sure about size of dependencies.


  • wkhtmltopdf has precompiled binaries that are around ~80 MB in size (for all three OSes together). If that would be to big to bundle, maybe Zotero users could be instructed to download the binary themselves if they want to print to PDF, and then there could be an option in Zotero to set the location of wkhtmlbinary to use.


  • Google chrome/Chromium has a headless mode that allows for printing (and can. While not everyone

    chrome –headless –disable-gpu –print-to-pdf file:///path/to/myfile.html


  • PDFShift is an online API. This might include the least resources from the Zotero team. Users would have to be comfortable with that their HTML files are uploaded to a third party for conversion to PDF, but since these are web snapshots they are already public anyways (and could maybe be uploaded without being linked to user id?).


As for the user facing side, there would ideally be an option to save webpage snapshots as HTML, PDF, or both. Saving as only PDF would involve clicking the browser extension button, which would trigger a download of the HTML page, a conversion to PDF, and finally a deletion of the HTML file.

A couple of node-based solution that I don’t know if they are relevant:

https://github.com/foliojs/pdfkit

https://github.com/westmonroe/pdf-puppeteer

Resources:

https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/

https://stackoverflow.com/questions/18191893/generate-pdf-from-html-in-div-using-javascript

https://stackoverflow.com/questions/176476/how-can-i-automate-html-to-pdf-conversions

«1
  • If anyone is looking to implement this for personal use, the pyzotero library might help https://github.com/urschrei/pyzotero
  • For saving a website to Zotero as a pdf, you could follow these three steps, which can mostly be executed through keyboard commands.

    1. Save the website to Zotero using the browser connector. Zotero's default shortcut is "Ctrl+Shift+S". For adjusting the Firefox shortcut see here.

    2. Save the website as a pdf to the Downloads folder using a web browser, e.g., Chrome. The shortcut for printing is usually "Ctrl+P".

    3. Attach the pdf to the Zotero item using the Zotfile add-on, which adds an "Attach New File" option to the item context menu. The Zutilo add-on allows triggering this function through its keyboard shortcuts for Zotfile.

    For the best pdf output, step 2 might require some adjustments. E.g., if the website doesn't offer a printer-friendly version, you could remove irrelevant content using the "only selection" option in the printing dialog. A more automatic way might be possible for capturing the full page, but you might get large documents with unnecessary clutter.

    Some automatic html-to-pdf functionality might be useful for Zotero notes and reports. Zotero's reports and the Report Customizer add-on already provide printing options. For a more automatic export function, it could help to check the code in the ZoteroQuickLook add-on, which saves notes to a temporary html file that can be shown in an external viewer.
  • Thanks for the workflow tip @qqbb ! It sounds quite efficient and there would be advantages to manually printing the page to PDF since, as you said, one can customize the output.

    I'm still interested in solutions that would be entirely automated. After looking at the quicklook plugin you linked, I'm thinking that maybe a good solution would be to add the option to run a custom command after a web snapshot is saved. Then the user could write a script using wkhtmltopdf or whatever to convert the snapshot to html and delete the original. I believe the only parts needed from zotero (or a plug in) would be to pass file path and assign the output of the script as an attachment to the currently selected item. Not sure if this is possible since I have not made any plugins, but conceptually it sounds reasonable.
  • I use the ZotFile trick @qqbb suggests along with PrintFriendly (instead of just printing) to produce a more acceptable-looking PDF.

    I wonder if a PDF resolver might be one way of implementing this. There are some (paid) options that you can feed an HTML address to and will give you a PDF URL in return that could use the PDF resolver API: https://www.zotero.org/support/kb/custom_pdf_resolvers

    Still, one click is better than many...
  • edited April 18, 2020
    I'm as well interested in saving a webpage directly as a PDF (instead of the snapshot, or along with the snapshot). Of course it can be done manually, but having it as part of the default connector would be really handy.

    I have no idea how that could be implemented, but starting to search on the topic, I found that the Chrome DevTools Protocol has a Page.printToPDF method:

    https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF

    A pdf file of a page can also be obtained from a headerless instance of Chrome:

    https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf

    or using Puppeteer with Node:

    https://developers.google.com/web/updates/2017/04/headless-chrome#puppeteer

    None of that is a real solution, but it's reasonable to think that there's a built in method "somewhere", at least in Chrome.

    Installing a Zotero printer driver component is also an option, just as Adobe Acrobat or Microsoft Onenote do.
  • Has there been some thought to using the SingleFile browser addon to snapstop webpages? It also has an highlight/annotate option.

    https://github.com/gildas-lormeau/SingleFile
  • We're going to be redoing how snapshots work soon to fix problems on some JS-heavy sites. SingleFile is one of the possibilities for what we'd base the new snapshots on, but we're going to be reviewing a bunch of options, since much of it depends on how well they can be integrated with Zotero on a technical level.
  • @qqbb , @ahmontgo , thanks a lot, guys, for sharing, these tricks are really appreciated!

    @dstillman , you're saying that the snapshot feature would still use the HTML output, not PDF output, aren't you?
  • I vote for this feature, that would be really useful. imho saving in pdf is much better because it keeps the same workflow as pointed by @cheflo .

    In the meantime, the @qqbb trick saved me precious time saving the webpages.
  • +1 for this feature
  • + 1 for save as pdf integration. I've been printing separately and attaching to a saved website item (per @qqbb -- and you can also just drag the pdf file into zotero on top of the item). But doing this automatically would save a lot of time.
  • This gets a strong vote from me too. The HTML text snapshot is ok for proving text was present, but it doesn't store any graphs, charts or illustrations. As more content is moving online, it would be super useful to have a one-click workflow for this!
  • The HTML text snapshot is ok for proving text was present, but it doesn't store any graphs, charts or illustrations.
    Sure it does. What makes you say that?

    Since the above discussion, we've implemented SingleFile-based snapshots, which save nearly perfect static versions of even interactive pages like this one (with graphs, charts, and illustrations). A PDF of that page from Firefox leaves out lots of content. A PDF from Chrome does a little better but still cuts off significant amounts of content at page breaks. A PDF of the SingleFile snapshot from Firefox just makes it worse, cutting off content and messing up the layout.

    I understand that the previous snapshot functionality — like the HTML saving in Firefox and Chrome — didn't handle all pages well, but the idea that an HTML webpage can be reliably turned into a PDF is unfortunately just mistaken. I'd encourage people to try the new SingleFile-based snapshots, which will be available in Zotero 5.0.91 within the next day.
  • +1 for this feature. I'm using PDFs to mark text and make annotations for later use, and can't do that with Snapshots.
  • I don't think that's going to happen for the reasons given above, sorry.
    There _are_ annotation tools for websites you can use.
  • @adamsmith, I haven't yet found a website annotation tool that fully works for Zotero saved .html. The problems are one or more of:

    * they save the annotations separately, usually on their own servers.
    * they save annotations only in ~/Downloads
    * you can manually move the file they save to ~/Downloads to wherever Zotero wants it, but then they cannot re-edit and save new annotations.

    But if you're aware of an annotation tool that does the job, I'd love to hear about it!
  • I don't have the knowledge nor skill of those posting here. I have spent the last three days searching for a solution that would permit me to take a non-pdf web article place it in Zotero, annotate it and then extract it using something like zotfile. the above message (December 8, 2020) is the most recent comment on this issue. The following seperate discussions end with the same comment dated January, 2020:
    https://forums.zotero.org/discussion/9183/htmltopdf-workflow/
    https://forums.zotero.org/discussion/23704/create-and-attach-pdf-from-snapshot/
    https://forums.zotero.org/discussion/31876/web-page-annotation-a-working-solution

    Use of web pages as resources is common. Zotero is a logical platform for organizing, annotating and extracting from webpage information. Has there been any development on this since December 2020? Can anyone direct me to work flow processes and extension, etc. that will accomplish this?


  • You should save webpages as HTML files using Zotero’s snapshot feature. That is the page’s native format and the one that will preserve correct formatting. PDF is a legacy format for print publications—it doesn’t make sense to convert a web page to a lesser-performing format. Zotero has recently overhauled its snapshot feature since the previous post in this thread to have much improved and more accurate HTML snapshot saving.
  • Yes, Zotero's new snapshots really are very good. But as far as I can tell, there's still no way to annotate them once they're saved, like you can, for example, in Evernote.
  • I want to make a humble suggestion on this subject. Sometimes I use opera to record the website. The opera has the feature of converting the integrated web page into pdf. I take the web page as a snapshoot in the classical way and transfer it to the zotero. Then I save my web page as pdf with opera and add it to the zotero. It is a long way, but it is easy when you get used to it.
  • Thank you for your responses. In order to have consistent workflow, I would like to have pdfs from academic journals and articles from websites amenable to annotation and extraction with zotfile. I have read a number of articles suggesting converting web sites to pdf and then saving pdf to Zotero and then annotating and extracting with zotfile. The problem is that not all pdfs are amenable to annotation and extraction. as noted by Scotto above, there is no way to annotate the html once saved. It would be so helpful if either Zotero found a way to handle this issues OR there was a functioning workaround. So far all the suggestions I have found have not worked (there are many suggestions).
  • Re comment by zmetedlinler, does this approach allow you to annotate the pdf (highlight and then save) and then extract with zotfile?
  • edited April 11, 2021
    @dstillman is there a way to reproduce what @zmetedinler does with Opera within Zotero?
    I.e. could Zotero save a website as a snapshot, but then, if an option "Save website as PDF" is checked, save the snapshot as a PDF and then keep only the PDF as an attachment?

    If it's not too difficult, then I also support adding this feature. I've been just saving websites as pdfs and adding them to Zotero manually, but in iOS this takes too much time and clicks. I was about to request automatic website-to-PDF saving for iOS but if this can be done everywhere that would be fantastic.

    A uniform format for adding/annotating texts in Z really helps both research and writing. For example, good academic blogs (which currently can only be automatically saved as web snapshots in Z) are becoming as important as JStor articles for academic debates, so having two different systems for saving/annotating them makes no sense.
  • @Joncthomas, Yes it does. However, I use xchangeviever as a pdf scanner. With this program, every comment and markup I make on the pdf is transferred to the zotero.
  • But as far as I can tell, there's still no way to annotate them once they're saved
    @scotto: You could install SingleFile in your browser, right-click the SingleFile button and select "Annotate the page...". This works with Firefox. There's also a setting to automatically open snapshots in the annotation mode.

    If you then save the annotated snapshot to your Downloads folder, you could use ZotFile's "Attach New File" function as explained above.

    I think automatically saving the annotated snapshot back to the original file location isn't currently possible. See here for details: https://github.com/gildas-lormeau/SingleFile/issues/453.
  • could Zotero save a website as a snapshot, but then, if an option "Save website as PDF" is checked, save the snapshot as a PDF and then keep only the PDF as an attachment?
    @erazlogo: No, we won't be doing that, for the reasons explained above. PDF is not an appropriate format for saving webpages.

    But Zotero had webpage annotation in the past and may again in the future.
  • edited April 18, 2021
    While I understand the reasoning behind not supporting this feature, I'd like to also express my support and present some arguments for it. Apologies in advance for the length, but I want to try to make a compelling case for this :P

    I appreciate the improvements made to the webpage snapshot feature and I understand that it is superior to PDF when the goal is to accurately capture the whole webpage.

    However, I think there is something to be said for capturing webpages where the relevant content is an article or text.

    Most of the time, when I capture a webpage to Zotero, it's because it has text with relevant information for a particular project. Maybe it's a blogpost, a news article, a WHO/EU/UN page with information on a particular topic, a Wikipedia article, etc.

    In this case, I'm not interested in having an accurate representation of the whole webpage at the moment of capture, I just want the main body of the text. Having the text saved as a nicely formatted PDF allows me:
    - to just focus on the text without the clutter
    - to have the text formatted in a way that facilitates reading and annotating
    - to read and annotate that PDF like I do with journal articles (either using Zotero Beta/Zotero iOS or any other PDF annotator of my choice)
    - to work with the PDF like I do with journal articles (maintaining the workflow with Zotfile's Send to Tablet, which I still find useful to read files with particular devices or apps)
    - to open, read and/or annotate the file on any device that supports opening PDF (including PCs, iPad, iPhones, Android phones and tablets, ereaders, windows tablets...)
    - to share/send the file (annotated or not) to any colleague, even those who are less technically inclined ("everyone" understands and knows how to work with PDF)
    - to more easily use those same files in the future in other reference/knowledge management apps, which might not support all kinds of attachments or filetypes (I doubt I'll ever leave Zotero, but I like to keep future possibilities as open as possible - that's one of the main reasons I use Zotero, to avoid being locked-in as much as possible)

    Besides saving articles/text, the second most frequent scenario where I store a webpage is just to keep the link, like a bookmark, where I also don't find it necessary to save a snapshot. That's why in my particular workflow I have the snapshot auto-save disabled: most of the time it would just clutter my Zotero file list and waste storage space (however little). And in the few cases where I do want to save a snapshot, I still have that option available through right-click.

    In fact, the "Saving Webpage as a PDF" should be just that: an option. Just like we can globally enable or disable the automatic saving of snapshots in the settings, and regardless still save a particular item with or without snapshot, saving as a PDF should be just another option (in addition to, and not instead of, the snapshot).

    You might argue the Zotfile traditional workflow (send to tablet, extract annotations, etc) will be irrelevant when the new Zotero features reach maturity. But I don't think that will be true for every user. I'm very excited about the new features and I've been happily beta-testing them and I'm delighted to see them improve week after week. However, I don't think I'll be able to completely abandon my previous workflow for all cases any time soon, if ever. No matter how good the Zotero iOS app becomes on the iPad (and even if it supports inking/handwriting in the future), there will always be scenarios where other apps will be prefered (I'm thinking of LiquidText, Marginnote, Notability, and others, with features that Zotero likely won't - and shouldn't - be able to replicate). This is not to mention those that use Android tablets, e-readers, their phone... Besides, it is said in the annoucement of the new PDF reader that "It will always remain possible to set Zotero to open PDFs in an external reader if you prefer one to the internal reader." So the Zotfile workflow will remain a viable alternative for those who prefer it (provided it remains actively developed despite the new Zotero features, which I hope it does).

    You might also reiterate that webpage annotation might be supported again in the future. However, that will not answer all the points outlined above. The same can be said of using external tools to annotate webpages (like Weava, Liner, or Hypothesis), with even more problems regarding cross-device compatibility, vendor lock-in, long term access and use, and dispersion of information.

    Finally, it's true that saving a webpage as a PDF can be achieved without Zotero intervention, using the print dialog from the browser or a tool like "Print Friendly and PDF", and then attaching the resulting PDF to the relevant item in Zotero. Nevertheless, being able to do this automatically and through Zotero would streamline and simplify this workflow, making it significantly quicker and easier.

    I completely understand if Zotero developers still feel this is not something worth investing their (limited) time or it does not align with their vision for the program. In that case, hopefully someone will be able to make a plugin?
  • edited April 18, 2021
    @dstillman I tried to use snapshots today and they look great but they are not set up for actual research/analysis at present.

    1. I can't highlight/annotate them
    2. They open in a browser rather than within Zotero so I can't see a snapshot and take notes on it at the same time
    3. I can't share them with others via email [Edit: looks like I can--they open in a browser. Still, PDF is called "Portable" Document Format for a reason--that's the format people expect for sharing]
    4. On iOS snapshots don't behave in the same way as PDFs--there is no icon in the list view, clicking on the list item does not trigger their download, and, like on the desktop, they open in a separate window making notetaking impossible.

    For me the iOS behavior is most baffling--it takes time, clicks, and scrolling to navigate to the snapshot in order to view it even if it is already downloaded! But a user's reasons for consulting a snapshot would be the same as viewing a PDF which is immediately available from the list view.

    Note: upon reflection, I created a new thread for the iOS behavior https://forums.zotero.org/discussion/89151/ios-feature-request-snapshots-should-behave-in-the-same-way-as-pdfs
  • edited May 4, 2021
    I agree that PDF is not a suitable format to save complex web pages, but I also think the main feature requested here is getting the text and images into an annotation-capable format rather than seeking faithful representation of the web page layout. I read in another thread that HTML annotations are likely not on the table for the near future, so PDFs remains the only format that is possible to annotate. Particularly with the new excellent PDF annotation capabilities in the preview, this would be such a great addition.

    I have tested wkhtmltopdf with some pages and most of the time it does either a great or good enough job. Firefox also has a built in "reader mode" that can simplify webpages which maybe could be leveraged before doing the conversion?

    I don't really have a preference for the format that is annotated so if it would be easier to make HTML files annotatable and integrated into the new annotation panel, then that would be great, but it seems like a more complex undertaking than saving a pdf.

    EDIT: I started a discussion here about HTML annotations https://forums.zotero.org/discussion/89301/feature-request-add-annotations-for-web-pages
  • The solution to the annotation issue is adding an HTML viewer and annotation tool in Zotero, not converting web pages to PDF.
Sign In or Register to comment.