Feature request: Optionally save web pages as PDF (and a list of tools that might help)

constancappcarvalho · April 25, 2021

@bwiernik I do disagree, for all the reasons outlined above. Adding an HTML viewer and annotation tool to Zotero won't solve all the issues/situations mentioned.

Although it would still be welcomed, for those that need to preserve the full page layout and/or prefer to save the whole webpage instead of just the article text to pdf, and that right now don't have many alternatives for annotations

overossm · June 22, 2021

I found a pretty easy partial work around for this that works 99% of the time.

I use the chrome extension linked here. https://www.printfriendly.com/

I was leery at first but now it is one the tools on my “starting lineup” for writing. I use in conjunction with the zotero chrome extension. When I’m on a webpage I want to save, I just click the zotero extension first to get the parent item saved along with the snapshot. Then I immediately click the print friendly pdf extension which I put directly next to the zotero one. It immediately pops up a screen where I can very quickly and easily adjust the pdf. Then you just click the download PDF button and boom it instantly downloads in your browser. Then you just drag that file over to Zotero, attaching it to the websites parent item. Then I can use that file just as you said, for searching and tagging purposes.

I only make adjustment to remove large pictures or unnecessary text on the page. It already turns it into a simplified version ready for printing, but I like having the cleanest copy possible without giant pictures. Sometimes it automatically removes them, sometimes not. EVEN if I have to delete some things, I would guess the whole process of going from just a website to a citation in zotero with snapshot and readable PDF cant be more than 5 seconds total. SUPER EASY.

I know it’s not automatic, but it’s at least worked for me so I can search and tag more easily. Once I figured this out, I also went back and saved a pdf for all my snapshots. All it takes is clicking on snapshot, clicking on print friendly pdf extension, click download, and drag over to citation in zotero, just skipping the step of saving the website to zotero. “Back tracking” like that is a little annoying and time consuming compared to an automatic feature, but easily something that can be done while watching TV, etc. Or if you’re like me, you train your 9 year old nerdy nephew (or whoever) to use zotero and delegate out.

alflamingo · August 17, 2021

Another small contribution from the same annotate the web preoccupation.
Zotero conector with beta version 5.+something, chrome-edge.

From the zotero connector you could select any text on a webpage, then r-click, choose create zotero item and note from selection. The notes are nice, in attachment with the item from your standalone interface, but the remaining issues are to my view the following:
- you have to repeat the process for every annotation, which will result in as many duplicate as annotation in your library, which in return could be tweaked by merging duplicates later on. Remain clicks to delete the duplicates snaphots. Also, you have as many child note as highlighted text.
- There is no corresponding coloured highlight, neither in the snapshot file later on. You cannot see any yellow, neither make a click on the note that would teleport you at the note location from your usual library interface (a thing you can do with the super nice new annotation features in beta, for pdf anotated file).
- It is quite a repetitive process to rclick-save as child item on everything you want to highlight

Posible paths:
- enable zotero conector to only save child item when the parent item is already saved, still needs couple of clicks tough
- somehow enable multi-selection, the one you use in text editor using ctrl+click, then add everything at once. (Idtk it is possible even with carret mode)
- make the corresponding selection highlighted for the html snaphot later-on
- use memex or hipotesis to have the coloured highlight in the snapshotfile, combined with the right click combo from zotero connector mentioned above. This implies to process twice what you want to highlight. Not nonsense corresponding workflow would be to make two readings, the first doing most of your highlighting, the second for adding every note on zotero. Positive side effect is to restrain you from annotating like crazy, keeping only a level-2 strongness quotation as item...
- Still difficult to make other edits tough: you have to re-save snapshotfile if you make new highlight on the snapshot html file, and for new annotations you want to keep as child note you have to go back to the website hoping the page has not change to do your new annotations.
- and all other already mentioned solutions, including using pdfs...

yyyi · August 23, 2021

i prefer epub, so being able to chose the format would be great. I use rdrview https://github.com/eafer/rdrview which in turn is based on Firefox https://github.com/mozilla/readability, which zotero could implement, or an extension be made

happysadhu · September 28, 2021

Zotero could automatically access and download a pdf version of Wikipedia articles through the "download as pdf" link on the lower left of every article. In the meantime this can be done manually by first downloading the PDF and then adding it to the relevant Wikipedia citation/snapshot entry in Zotero using "Add Attachment" > "Attach Stored Copy of File". The pdf can then be easily viewed and annotated in Zotero using the new PDF preview feature available in the current Zotero beta.

Formal Wiki pdfs are better presented than the results of using the built-in browser print to pdf feature, at least in Firefox.

happysadhu · October 2, 2021

Enabling Reader View in Firefox, selecting the page icon to the right of the address, displays a minimalistic webpage that works well for printing to pdf. Firefox shows the reader view icon only on some websites, for some reason? Toggling reader.parse-on-load.force-enabled” to “true” using about:config displays the reader view even o those websites.

kazzy · October 6, 2021

I recently ran into the same hiccup - wanted to save some pages as PDF (mainly because the current beta version of Zotero allows for in-app PDF reading).

After trying a few extensions, I settled for "PDF Mage" - because it actually opens the converted page as a PDF in the browser itself (without downloading), and Zotero even recognizes it as PDF and so I can save it directly without having to clutter the downloads folder and manually drag-drop/use Zutilo.

In Microsoft Edge, the "Download PDFs" option should be disabled, else the file is auto-downloaded instead of loading in the new tab.

I have tested it in Vivaldi too.

For the following link, Zotero could even extract metadata from the saved PDF!
https://owasp.org/www-community/attacks/csrf
--

And what if the site is overloaded with ads and fluff and you are simply interested in the text?

Well, for that I tried the "Read Pro" chrome extension (in Vivaldi), so the workflow is:
1. Visit the webpage
2. Click the Read-Pro icon to sanitize it
3. Click the PDf-Mage icon to load it as PDF in adjacent tab
4. Save to Zotero

Read Pro also has an option to download the sanitized page as a PDF (but that again is the longer route).

Both these extensions are free to use and do not require you to create any accounts.
--

Links:
1. PDF Mage
Edge: https://microsoftedge.microsoft.com/addons/detail/pdf-mage/jncoibmpdjfaccecklaooocaenaaibni
Vivalid: https://chrome.google.com/webstore/detail/pdf-mage/gknphemhpcknkhegndlihchfonpdcben

2. Reader Pro
Vivaldi: https://chrome.google.com/webstore/detail/read-pro/ckjogkiieodbdmkeabpnhdaagilainco
Edge: Unavailable

kazzy · October 6, 2021

Other Options for extracting text:

You can try other extensions for sanitization or the Firefox “Reader View“ mentioned above, but you will have to test if your tool of choice works well with PDF Mage i.e. loads a webpage as an actual PDF (URL ends with .pdf) so the Zotero connector recognizes it as one.

When in a hurry, the bookmarklet I posted in this thread might help. It extracts and auto-copies text to the clipboard, saving time and clicks. Also preserves the line-breaks and indentations (https://forums.zotero.org/discussion/85569/indexing-web-pages-without-snapshots).

Or you can use the bookmarklet from Textize: https://www.textise.net/Bookmarklet.aspx

Disclaimer: I am not affiliated to any of these services/extensions in any way. I simply chanced upon them during my hunt.

kzssc · October 13, 2021

Not posing as a solution but I do keep pdfs as snapshots of webpage for accessibility (other devices) and economic reasons (takes up less space).Printerfriendly is good but it's so slow (and sometimes it fetches inadequately from a webpage). Printing from browser is more likely to get what you now see onto PDF.

I do support a built-in function w/ Zotero to save snapshots as PDF.

My current solution is to:
1) NOT use the snapshot function; just saving an entry into Zotero (via Connector)
2) Print-off the webpage as a PDF and then attach it into Zotero.

Caveats for speeding-up:
a) I'm on a Mac, with keyboard shortcut I can speed up the process by setting up an App Shortcut (Export as PDF - or whatever it's called in your browser, I'm on Safari) to save. My way: cmd+shift+P, then enter (choose the download directory).
b) I usually use this when I'm reading off RSS, so I do in batch without having to jump back-and-forth between different Zotero collections and download directory (a). So I'll just save loads of entries into Zotero & page-print PDF into a directory.
c) Return to Zotero. Attach New File (and Zutilo for hotkey access) for every new entry: because you'd be working in order (new-old) so you can just go through the list attaching the files from your download directory.

pekka_co · November 23, 2021

I need to archive a lot of social media posts for my research. The snapshot feature doesn't appear to archive images e.g. in Tweets on twitter.com correctly.

Such posts are often routinely deleted later, so I need a static representation of what the post looked like the moment I snapshotted it. The best way to reliably do this is indeed a PDF or a screenshot.

I've also ended up at a workaround like those above - use the connector to create the item in Zotero, then jump through various hoops to attach a browser-generated PDF or screenshot to the item. This isn't very convenient, though, and arguably defeats the main purpose of the Connector feature which is convenience.

dstillman · November 23, 2021

@pekka_co: There's nothing inherent about the PDF format that would make this more reliable. The Safari connector you're using just doesn't have the improved snapshot functionality mentioned in this thread. The Chrome connector has no problem saving snapshots of images on Twitter (right-click on save button → Save to Zotero → "Web Page with Snapshot" while viewing a tweet with an image). Follow up in your other thread with Twitter-specific questions — the Twitter site is pretty hostile to automated extraction of any sort, so it's a bit of a special case.

whuber · November 23, 2021

This article presents an interesting use-case: the HTML article cannot be capture for some reason.

https://journal.transformativeworks.org/index.php/twc/article/view/436

dstillman · November 24, 2021

@whuber: You're talking about clicking the HTML link on that page and manually selecting "Web Page with Snapshot" from the menu? They're loading the content in a frame, and we currently skip frames in snapshots, since they're usually just ads. We might provide an option to change or override that at some point. In Firefox you can load just the frame (right-click → This Frame → Show Only This Frame) and save that.

FH2 · November 24, 2021

@pekka_co I have faced your problem as well, see this thread: https://twitter.com/databaseculture/status/1310544024386842624 and I agree with @dstillman that twitter is outright hostile towards being printed or scraped.
Personally I'd prefer to have a PDF representation, since PDFs can be annotated and are searchable and interchangeable(!) with colleagues for reviews and further comments beyond Zotero.

Hanna.Marina · November 26, 2021

I think, it would be great to let the user decide, whether he wants to save a webpage as Webpage or as PDF. Of course, PDF is missing some elements of HTML-pages. But it is great for archiving a snapshot of a webpage that can still be searched.

I would appreciate such a feature very much.

alflamingo · November 29, 2021

Trying to list all possibilities, you could also...take screenshots. Then save it in a dedicated folder that you share on some cloud service, link it to a zotero collection in a shared library. Note that you could just share the link to your shared folder in the shared library so you do not use zotero storage. Printscreen key is quite fast after all and tweets are small rectangles...

Also, the collaboration process to analyse such material might be questioned: do you (we) really need to be able to comment the very image on the spot or can we compile annotations in a document in a more traditional fashion? (Seems like you work with a lot of data so you may answer no). Or you can share your comments notes as an item in zotero too. If am I not too wrong googledrive enable comments for png files (and pdf) as long as one can access the shared link, even though not passing through a gmail account. And do you need the text part of every tweet? Here you may find OCR applications that could enter a workflow with text as an image.

Then let me mention that some qualitative software analysis such as atlas.ti include integration with tweeter. I do not know more, neither do I have interest in those services but you might check it out. Open access philosophy with text analysis could be done with some Rmarkdown tools, check IRamuteq for example, but neither ik as for tweet processing in that case.

Just to recall in the end that the magnitud of a project might determine what tool and workflow is most suitable to realize it.

Thanks for reading!

Edit: oups, screenshots are exactly what you were doing at first. Sorry for being redundant :/

mirceglavni · December 7, 2021

This thread was very interesting.

Annotation of web-sites is indeed very much needed, so I will share my workflow: I don't use the web-connectors for Zotero. Instead, I use the Chrome SingleFile addon directly, annotate and "clean-up" the page, then save it.

Afterwards, this clean, annotated html file is imported to Zotero. It would be better if this workflow could be done inside Zotero (i.e. annotating after the html has already been imported), but for the time beeing, this approach works for me.

wangweinoo1 · December 31, 2021

+1

For now, I can't find a good enough tool to take annotation on the saved web page(snapshot):

- annotation extension on chrome, like "note anywhere" or other chrome extensions:
it is not able to take notes on local html files, since chrome limits extensions to read local files

- edit the local webpage with some html WYSIWYG editor:
it is not search-able in zotero

I find the best way for now is save as pdf and then send to zotero....

michael riordan · July 19, 2022

I find most browsers produce ghastly results when printing from web pages.

If you want to annotate a PDF of a website, I suggest using a good PDF plugin - I've shopped around and found 'Print Friendly' (https://www.printfriendly.com/) is the best extension. Then use Zotfile to import the resulting file as as attachment to your Zotero library.

To set Zotfile up, download and install the Zotfile extension (http://zotfile.com/). Then, in Zotero, navigate to Tools>Zotfile Preferences..., and set the source folder to your browser's download folder.

When you want to download a PDF of a web page:

1. Save Zotero metadata in the normal way
2. Click Green 'Print Friendly & PDF' icon
3. Click the 'PDF' icon, then 'Download your PDF'
4. Right click on the item in your library > Attach new PDF > OK

atakanokan · July 27, 2022

> I use the Chrome SingleFile addon directly, annotate and "clean-up" the page, then save it

@mirceglavni How do you annotate the html output from SingleFile?

Yehudabe · January 6, 2023

Hey Everyone! any update about this?

timpaul · November 1, 2023

I'm in support of a request to simplify the workflow that involves clipping from a webpage > importing into Zotero > marking up the clipped page with highlights and notes > importing those markups and notes into Obsidian

Edit: I do think the PDF format is the most universally available mechanism to achieve these workflow goals

egoipse · November 29, 2023

We don't use Zotero as a browser or as a html viewer/editor. It is enough for us the posibility to get the PDF from the html body to add anotations and notes.

Please do not iterate on the non-sequitur fallacy of "pdf is not a suitable format to save html". Sure it is not. Nobody pretends to use zotero to embrace the full-and-rich-experience of html full visualization. We just need a PDF with the text-body, the title, the authors, the date of publication from the html webpages. All the other rich and great features of html that cannot be saved on pdf are completly irrelevant.

Please consider to develop the option of webcliping a PDF version of html pages.

alflamingo · November 30, 2023

Interesting comment @egoipse ! I would love to see a pdf option for the webclipper. As far as I'm concern, I work with pdfs when it comes to peer review articles, so it could be consistent to work with pdfs only. Note though I also started to use ebook format for books. ...Overall that's a minor issue as I still could use external plugins to save html pages as pdf.

agoldenvein · November 30, 2023

I was talking to the developer of SingleFile, who explained that the Zotero connector actually uses SingleFile to perform its snapshot function. Doesn't SingleFile allow saving as PDF? It seems to me enabling the option to save as PDF would be as easy as flicking a switch on the back end. What am I missing?

adamsmith · November 30, 2023

No, SingleFile doesn't save as PDF -- it's HTML, self-exctracting, and regular zip.
Zotero has HTML (and epub) annotations in version 7. Especially given that, I think the chance of it developing a save HTML as PDF is basically zero -- if you urgently need it, you'd want to write an add-on or find (&presumably pay) someone else to.

agoldenvein · December 19, 2023

looks like I was mistaken regarding singlefile specifically. But I'm not sure I understand your claim that the chances of saving HTML as PDF is zero... what's wrong with just printing the page as pdf from the browser or using somthing like PrintFriendly?...

adamsmith · December 19, 2023

I'm not saying it _can't_ be done, but that Zotero won't implement it