Feature request: Optionally save web pages as PDF (and a list of tools that might help)
There has been some discussion on this before, most notably:
https://forums.zotero.org/discussion/comment/313592
https://forums.zotero.org/discussion/31876/web-page-annotation-a-working-solution/
https://forums.zotero.org/discussion/9183/htmltopdf-workflow/
https://forums.zotero.org/discussion/23704/create-and-attach-pdf-from-snapshot/
The latest comment on limitations was in 2012 from @adamsmith, who mentioned that this might be more suitable for a plugin and that there might not exist any cross-platform library to use. I wonder if it would be possible receive a comment on how this feature is viewed by the Zotero team today. Is it something that would be interesting if the right tools would be available and there was enough resources to implement them? Or is it seen as a less useful feature and it will likely not be implemented?
My reasons for considering this a highly useful feature, is that while there are other solutions for annotating live webpages (such as hypothes.is), I think PDF is an ideal format for webpage snapshots. It allows for using the same workflow and tools as when annotating research article PDFs, and would work well with the extract annotation feature from Zotfile. It is like printing out a physical copy of a page and taking notes, great for learning and to refer to later (of course keeping in mind when it was retrieved, but many research posts such as blogs rarely updated, rather new posts are made).
Given my interest in this feature, I want to try to help by sharing a few tools, which could be able to provide this feature in Zotero. I am not able to say which (if any) would be suitable choices for incorporating this feature into Zotero. All these tools are cross-platform, open source, and currently maintained.
jsPDF is a client-side JS library for pdf generation. Source code is 5 MB, not sure about size of dependencies.
wkhtmltopdf has precompiled binaries that are around ~80 MB in size (for all three OSes together). If that would be to big to bundle, maybe Zotero users could be instructed to download the binary themselves if they want to print to PDF, and then there could be an option in Zotero to set the location of wkhtmlbinary to use.
Google chrome/Chromium has a headless mode that allows for printing (and can. While not everyone
chrome –headless –disable-gpu –print-to-pdf file:///path/to/myfile.html
PDFShift is an online API. This might include the least resources from the Zotero team. Users would have to be comfortable with that their HTML files are uploaded to a third party for conversion to PDF, but since these are web snapshots they are already public anyways (and could maybe be uploaded without being linked to user id?).
As for the user facing side, there would ideally be an option to save webpage snapshots as HTML, PDF, or both. Saving as only PDF would involve clicking the browser extension button, which would trigger a download of the HTML page, a conversion to PDF, and finally a deletion of the HTML file.
A couple of node-based solution that I don’t know if they are relevant:
https://github.com/foliojs/pdfkit
https://github.com/westmonroe/pdf-puppeteer
Resources:
https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/
https://stackoverflow.com/questions/18191893/generate-pdf-from-html-in-div-using-javascript
https://stackoverflow.com/questions/176476/how-can-i-automate-html-to-pdf-conversions
1. Save the website to Zotero using the browser connector. Zotero's default shortcut is "Ctrl+Shift+S". For adjusting the Firefox shortcut see here.
2. Save the website as a pdf to the Downloads folder using a web browser, e.g., Chrome. The shortcut for printing is usually "Ctrl+P".
3. Attach the pdf to the Zotero item using the Zotfile add-on, which adds an "Attach New File" option to the item context menu. The Zutilo add-on allows triggering this function through its keyboard shortcuts for Zotfile.
For the best pdf output, step 2 might require some adjustments. E.g., if the website doesn't offer a printer-friendly version, you could remove irrelevant content using the "only selection" option in the printing dialog. A more automatic way might be possible for capturing the full page, but you might get large documents with unnecessary clutter.
Some automatic html-to-pdf functionality might be useful for Zotero notes and reports. Zotero's reports and the Report Customizer add-on already provide printing options. For a more automatic export function, it could help to check the code in the ZoteroQuickLook add-on, which saves notes to a temporary html file that can be shown in an external viewer.
I'm still interested in solutions that would be entirely automated. After looking at the quicklook plugin you linked, I'm thinking that maybe a good solution would be to add the option to run a custom command after a web snapshot is saved. Then the user could write a script using wkhtmltopdf or whatever to convert the snapshot to html and delete the original. I believe the only parts needed from zotero (or a plug in) would be to pass file path and assign the output of the script as an attachment to the currently selected item. Not sure if this is possible since I have not made any plugins, but conceptually it sounds reasonable.
I wonder if a PDF resolver might be one way of implementing this. There are some (paid) options that you can feed an HTML address to and will give you a PDF URL in return that could use the PDF resolver API: https://www.zotero.org/support/kb/custom_pdf_resolvers
Still, one click is better than many...
I have no idea how that could be implemented, but starting to search on the topic, I found that the Chrome DevTools Protocol has a Page.printToPDF method:
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF
A pdf file of a page can also be obtained from a headerless instance of Chrome:
https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf
or using Puppeteer with Node:
https://developers.google.com/web/updates/2017/04/headless-chrome#puppeteer
None of that is a real solution, but it's reasonable to think that there's a built in method "somewhere", at least in Chrome.
Installing a Zotero printer driver component is also an option, just as Adobe Acrobat or Microsoft Onenote do.
https://github.com/gildas-lormeau/SingleFile
@dstillman , you're saying that the snapshot feature would still use the HTML output, not PDF output, aren't you?
In the meantime, the @qqbb trick saved me precious time saving the webpages.
Since the above discussion, we've implemented SingleFile-based snapshots, which save nearly perfect static versions of even interactive pages like this one (with graphs, charts, and illustrations). A PDF of that page from Firefox leaves out lots of content. A PDF from Chrome does a little better but still cuts off significant amounts of content at page breaks. A PDF of the SingleFile snapshot from Firefox just makes it worse, cutting off content and messing up the layout.
I understand that the previous snapshot functionality — like the HTML saving in Firefox and Chrome — didn't handle all pages well, but the idea that an HTML webpage can be reliably turned into a PDF is unfortunately just mistaken. I'd encourage people to try the new SingleFile-based snapshots, which will be available in Zotero 5.0.91 within the next day.
There _are_ annotation tools for websites you can use.
* they save the annotations separately, usually on their own servers.
* they save annotations only in ~/Downloads
* you can manually move the file they save to ~/Downloads to wherever Zotero wants it, but then they cannot re-edit and save new annotations.
But if you're aware of an annotation tool that does the job, I'd love to hear about it!
https://forums.zotero.org/discussion/9183/htmltopdf-workflow/
https://forums.zotero.org/discussion/23704/create-and-attach-pdf-from-snapshot/
https://forums.zotero.org/discussion/31876/web-page-annotation-a-working-solution
Use of web pages as resources is common. Zotero is a logical platform for organizing, annotating and extracting from webpage information. Has there been any development on this since December 2020? Can anyone direct me to work flow processes and extension, etc. that will accomplish this?
I.e. could Zotero save a website as a snapshot, but then, if an option "Save website as PDF" is checked, save the snapshot as a PDF and then keep only the PDF as an attachment?
If it's not too difficult, then I also support adding this feature. I've been just saving websites as pdfs and adding them to Zotero manually, but in iOS this takes too much time and clicks. I was about to request automatic website-to-PDF saving for iOS but if this can be done everywhere that would be fantastic.
A uniform format for adding/annotating texts in Z really helps both research and writing. For example, good academic blogs (which currently can only be automatically saved as web snapshots in Z) are becoming as important as JStor articles for academic debates, so having two different systems for saving/annotating them makes no sense.
If you then save the annotated snapshot to your Downloads folder, you could use ZotFile's "Attach New File" function as explained above.
I think automatically saving the annotated snapshot back to the original file location isn't currently possible. See here for details: https://github.com/gildas-lormeau/SingleFile/issues/453.
But Zotero had webpage annotation in the past and may again in the future.
I appreciate the improvements made to the webpage snapshot feature and I understand that it is superior to PDF when the goal is to accurately capture the whole webpage.
However, I think there is something to be said for capturing webpages where the relevant content is an article or text.
Most of the time, when I capture a webpage to Zotero, it's because it has text with relevant information for a particular project. Maybe it's a blogpost, a news article, a WHO/EU/UN page with information on a particular topic, a Wikipedia article, etc.
In this case, I'm not interested in having an accurate representation of the whole webpage at the moment of capture, I just want the main body of the text. Having the text saved as a nicely formatted PDF allows me:
- to just focus on the text without the clutter
- to have the text formatted in a way that facilitates reading and annotating
- to read and annotate that PDF like I do with journal articles (either using Zotero Beta/Zotero iOS or any other PDF annotator of my choice)
- to work with the PDF like I do with journal articles (maintaining the workflow with Zotfile's Send to Tablet, which I still find useful to read files with particular devices or apps)
- to open, read and/or annotate the file on any device that supports opening PDF (including PCs, iPad, iPhones, Android phones and tablets, ereaders, windows tablets...)
- to share/send the file (annotated or not) to any colleague, even those who are less technically inclined ("everyone" understands and knows how to work with PDF)
- to more easily use those same files in the future in other reference/knowledge management apps, which might not support all kinds of attachments or filetypes (I doubt I'll ever leave Zotero, but I like to keep future possibilities as open as possible - that's one of the main reasons I use Zotero, to avoid being locked-in as much as possible)
Besides saving articles/text, the second most frequent scenario where I store a webpage is just to keep the link, like a bookmark, where I also don't find it necessary to save a snapshot. That's why in my particular workflow I have the snapshot auto-save disabled: most of the time it would just clutter my Zotero file list and waste storage space (however little). And in the few cases where I do want to save a snapshot, I still have that option available through right-click.
In fact, the "Saving Webpage as a PDF" should be just that: an option. Just like we can globally enable or disable the automatic saving of snapshots in the settings, and regardless still save a particular item with or without snapshot, saving as a PDF should be just another option (in addition to, and not instead of, the snapshot).
You might argue the Zotfile traditional workflow (send to tablet, extract annotations, etc) will be irrelevant when the new Zotero features reach maturity. But I don't think that will be true for every user. I'm very excited about the new features and I've been happily beta-testing them and I'm delighted to see them improve week after week. However, I don't think I'll be able to completely abandon my previous workflow for all cases any time soon, if ever. No matter how good the Zotero iOS app becomes on the iPad (and even if it supports inking/handwriting in the future), there will always be scenarios where other apps will be prefered (I'm thinking of LiquidText, Marginnote, Notability, and others, with features that Zotero likely won't - and shouldn't - be able to replicate). This is not to mention those that use Android tablets, e-readers, their phone... Besides, it is said in the annoucement of the new PDF reader that "It will always remain possible to set Zotero to open PDFs in an external reader if you prefer one to the internal reader." So the Zotfile workflow will remain a viable alternative for those who prefer it (provided it remains actively developed despite the new Zotero features, which I hope it does).
You might also reiterate that webpage annotation might be supported again in the future. However, that will not answer all the points outlined above. The same can be said of using external tools to annotate webpages (like Weava, Liner, or Hypothesis), with even more problems regarding cross-device compatibility, vendor lock-in, long term access and use, and dispersion of information.
Finally, it's true that saving a webpage as a PDF can be achieved without Zotero intervention, using the print dialog from the browser or a tool like "Print Friendly and PDF", and then attaching the resulting PDF to the relevant item in Zotero. Nevertheless, being able to do this automatically and through Zotero would streamline and simplify this workflow, making it significantly quicker and easier.
I completely understand if Zotero developers still feel this is not something worth investing their (limited) time or it does not align with their vision for the program. In that case, hopefully someone will be able to make a plugin?
1. I can't highlight/annotate them
2. They open in a browser rather than within Zotero so I can't see a snapshot and take notes on it at the same time
3. I can't share them with others via email [Edit: looks like I can--they open in a browser. Still, PDF is called "Portable" Document Format for a reason--that's the format people expect for sharing]
4. On iOS snapshots don't behave in the same way as PDFs--there is no icon in the list view, clicking on the list item does not trigger their download, and, like on the desktop, they open in a separate window making notetaking impossible.
For me the iOS behavior is most baffling--it takes time, clicks, and scrolling to navigate to the snapshot in order to view it even if it is already downloaded! But a user's reasons for consulting a snapshot would be the same as viewing a PDF which is immediately available from the list view.
Note: upon reflection, I created a new thread for the iOS behavior https://forums.zotero.org/discussion/89151/ios-feature-request-snapshots-should-behave-in-the-same-way-as-pdfs
I have tested wkhtmltopdf with some pages and most of the time it does either a great or good enough job. Firefox also has a built in "reader mode" that can simplify webpages which maybe could be leveraged before doing the conversion?
I don't really have a preference for the format that is annotated so if it would be easier to make HTML files annotatable and integrated into the new annotation panel, then that would be great, but it seems like a more complex undertaking than saving a pdf.
EDIT: I started a discussion here about HTML annotations https://forums.zotero.org/discussion/89301/feature-request-add-annotations-for-web-pages