Feature request: Optionally save web pages as PDF (and a list of tools that might help)

2»
  • @bwiernik I do disagree, for all the reasons outlined above. Adding an HTML viewer and annotation tool to Zotero won't solve all the issues/situations mentioned.

    Although it would still be welcomed, for those that need to preserve the full page layout and/or prefer to save the whole webpage instead of just the article text to pdf, and that right now don't have many alternatives for annotations
  • I found a pretty easy partial work around for this that works 99% of the time.

    I use the chrome extension linked here. https://www.printfriendly.com/

    I was leery at first but now it is one the tools on my “starting lineup” for writing. I use in conjunction with the zotero chrome extension. When I’m on a webpage I want to save, I just click the zotero extension first to get the parent item saved along with the snapshot. Then I immediately click the print friendly pdf extension which I put directly next to the zotero one. It immediately pops up a screen where I can very quickly and easily adjust the pdf. Then you just click the download PDF button and boom it instantly downloads in your browser. Then you just drag that file over to Zotero, attaching it to the websites parent item. Then I can use that file just as you said, for searching and tagging purposes.

    I only make adjustment to remove large pictures or unnecessary text on the page. It already turns it into a simplified version ready for printing, but I like having the cleanest copy possible without giant pictures. Sometimes it automatically removes them, sometimes not. EVEN if I have to delete some things, I would guess the whole process of going from just a website to a citation in zotero with snapshot and readable PDF cant be more than 5 seconds total. SUPER EASY.

    I know it’s not automatic, but it’s at least worked for me so I can search and tag more easily. Once I figured this out, I also went back and saved a pdf for all my snapshots. All it takes is clicking on snapshot, clicking on print friendly pdf extension, click download, and drag over to citation in zotero, just skipping the step of saving the website to zotero. “Back tracking” like that is a little annoying and time consuming compared to an automatic feature, but easily something that can be done while watching TV, etc. Or if you’re like me, you train your 9 year old nerdy nephew (or whoever) to use zotero and delegate out.
  • edited August 17, 2021
    Another small contribution from the same annotate the web preoccupation.
    Zotero conector with beta version 5.+something, chrome-edge.

    From the zotero connector you could select any text on a webpage, then r-click, choose create zotero item and note from selection. The notes are nice, in attachment with the item from your standalone interface, but the remaining issues are to my view the following:
    - you have to repeat the process for every annotation, which will result in as many duplicate as annotation in your library, which in return could be tweaked by merging duplicates later on. Remain clicks to delete the duplicates snaphots. Also, you have as many child note as highlighted text.
    - There is no corresponding coloured highlight, neither in the snapshot file later on. You cannot see any yellow, neither make a click on the note that would teleport you at the note location from your usual library interface (a thing you can do with the super nice new annotation features in beta, for pdf anotated file).
    - It is quite a repetitive process to rclick-save as child item on everything you want to highlight

    Posible paths:
    - enable zotero conector to only save child item when the parent item is already saved, still needs couple of clicks tough
    - somehow enable multi-selection, the one you use in text editor using ctrl+click, then add everything at once. (Idtk it is possible even with carret mode)
    - make the corresponding selection highlighted for the html snaphot later-on
    - use memex or hipotesis to have the coloured highlight in the snapshotfile, combined with the right click combo from zotero connector mentioned above. This implies to process twice what you want to highlight. Not nonsense corresponding workflow would be to make two readings, the first doing most of your highlighting, the second for adding every note on zotero. Positive side effect is to restrain you from annotating like crazy, keeping only a level-2 strongness quotation as item...
    - Still difficult to make other edits tough: you have to re-save snapshotfile if you make new highlight on the snapshot html file, and for new annotations you want to keep as child note you have to go back to the website hoping the page has not change to do your new annotations.
    - and all other already mentioned solutions, including using pdfs...
  • i prefer epub, so being able to chose the format would be great. I use rdrview https://github.com/eafer/rdrview which in turn is based on Firefox https://github.com/mozilla/readability, which zotero could implement, or an extension be made
  • edited September 28, 2021
    Zotero could automatically access and download a pdf version of Wikipedia articles through the "download as pdf" link on the lower left of every article. In the meantime this can be done manually by first downloading the PDF and then adding it to the relevant Wikipedia citation/snapshot entry in Zotero using "Add Attachment" > "Attach Stored Copy of File". The pdf can then be easily viewed and annotated in Zotero using the new PDF preview feature available in the current Zotero beta.

    Formal Wiki pdfs are better presented than the results of using the built-in browser print to pdf feature, at least in Firefox.
  • edited October 2, 2021
    Enabling Reader View in Firefox, selecting the page icon to the right of the address, displays a minimalistic webpage that works well for printing to pdf. Firefox shows the reader view icon only on some websites, for some reason? Toggling reader.parse-on-load.force-enabled” to “true” using about:config displays the reader view even o those websites.
  • edited October 6, 2021
    I recently ran into the same hiccup - wanted to save some pages as PDF (mainly because the current beta version of Zotero allows for in-app PDF reading).

    After trying a few extensions, I settled for "PDF Mage" - because it actually opens the converted page as a PDF in the browser itself (without downloading), and Zotero even recognizes it as PDF and so I can save it directly without having to clutter the downloads folder and manually drag-drop/use Zutilo.

    In Microsoft Edge, the "Download PDFs" option should be disabled, else the file is auto-downloaded instead of loading in the new tab.

    I have tested it in Vivaldi too.

    For the following link, Zotero could even extract metadata from the saved PDF!
    https://owasp.org/www-community/attacks/csrf
    --

    And what if the site is overloaded with ads and fluff and you are simply interested in the text?

    Well, for that I tried the "Read Pro" chrome extension (in Vivaldi), so the workflow is:
    1. Visit the webpage
    2. Click the Read-Pro icon to sanitize it
    3. Click the PDf-Mage icon to load it as PDF in adjacent tab
    4. Save to Zotero

    Read Pro also has an option to download the sanitized page as a PDF (but that again is the longer route).

    Both these extensions are free to use and do not require you to create any accounts.
    --

    Links:
    1. PDF Mage
    Edge: https://microsoftedge.microsoft.com/addons/detail/pdf-mage/jncoibmpdjfaccecklaooocaenaaibni
    Vivalid: https://chrome.google.com/webstore/detail/pdf-mage/gknphemhpcknkhegndlihchfonpdcben

    2. Reader Pro
    Vivaldi: https://chrome.google.com/webstore/detail/read-pro/ckjogkiieodbdmkeabpnhdaagilainco
    Edge: Unavailable
  • edited October 6, 2021
    Other Options for extracting text:

    You can try other extensions for sanitization or the Firefox “Reader View“ mentioned above, but you will have to test if your tool of choice works well with PDF Mage i.e. loads a webpage as an actual PDF (URL ends with .pdf) so the Zotero connector recognizes it as one.

    When in a hurry, the bookmarklet I posted in this thread might help. It extracts and auto-copies text to the clipboard, saving time and clicks. Also preserves the line-breaks and indentations (https://forums.zotero.org/discussion/85569/indexing-web-pages-without-snapshots).

    Or you can use the bookmarklet from Textize: https://www.textise.net/Bookmarklet.aspx


    Disclaimer: I am not affiliated to any of these services/extensions in any way. I simply chanced upon them during my hunt.
  • edited October 13, 2021
    Not posing as a solution but I do keep pdfs as snapshots of webpage for accessibility (other devices) and economic reasons (takes up less space).Printerfriendly is good but it's so slow (and sometimes it fetches inadequately from a webpage). Printing from browser is more likely to get what you now see onto PDF.

    I do support a built-in function w/ Zotero to save snapshots as PDF.


    My current solution is to:
    1) NOT use the snapshot function; just saving an entry into Zotero (via Connector)
    2) Print-off the webpage as a PDF and then attach it into Zotero.

    Caveats for speeding-up:
    a) I'm on a Mac, with keyboard shortcut I can speed up the process by setting up an App Shortcut (Export as PDF - or whatever it's called in your browser, I'm on Safari) to save. My way: cmd+shift+P, then enter (choose the download directory).
    b) I usually use this when I'm reading off RSS, so I do in batch without having to jump back-and-forth between different Zotero collections and download directory (a). So I'll just save loads of entries into Zotero & page-print PDF into a directory.
    c) Return to Zotero. Attach New File (and Zutilo for hotkey access) for every new entry: because you'd be working in order (new-old) so you can just go through the list attaching the files from your download directory.

  • I need to archive a lot of social media posts for my research. The snapshot feature doesn't appear to archive images e.g. in Tweets on twitter.com correctly.

    Such posts are often routinely deleted later, so I need a static representation of what the post looked like the moment I snapshotted it. The best way to reliably do this is indeed a PDF or a screenshot.

    I've also ended up at a workaround like those above - use the connector to create the item in Zotero, then jump through various hoops to attach a browser-generated PDF or screenshot to the item. This isn't very convenient, though, and arguably defeats the main purpose of the Connector feature which is convenience.
  • edited November 23, 2021
    @pekka_co: There's nothing inherent about the PDF format that would make this more reliable. The Safari connector you're using just doesn't have the improved snapshot functionality mentioned in this thread. The Chrome connector has no problem saving snapshots of images on Twitter (right-click on save button → Save to Zotero → "Web Page with Snapshot" while viewing a tweet with an image). Follow up in your other thread with Twitter-specific questions — the Twitter site is pretty hostile to automated extraction of any sort, so it's a bit of a special case.
  • This article presents an interesting use-case: the HTML article cannot be capture for some reason.

    https://journal.transformativeworks.org/index.php/twc/article/view/436
  • edited November 24, 2021
    @whuber: You're talking about clicking the HTML link on that page and manually selecting "Web Page with Snapshot" from the menu? They're loading the content in a frame, and we currently skip frames in snapshots, since they're usually just ads. We might provide an option to change or override that at some point. In Firefox you can load just the frame (right-click → This Frame → Show Only This Frame) and save that.
  • FH2
    edited November 24, 2021
    @pekka_co I have faced your problem as well, see this thread: https://twitter.com/databaseculture/status/1310544024386842624 and I agree with @dstillman that twitter is outright hostile towards being printed or scraped.
    Personally I'd prefer to have a PDF representation, since PDFs can be annotated and are searchable and interchangeable(!) with colleagues for reviews and further comments beyond Zotero.

  • I think, it would be great to let the user decide, whether he wants to save a webpage as Webpage or as PDF. Of course, PDF is missing some elements of HTML-pages. But it is great for archiving a snapshot of a webpage that can still be searched.

    I would appreciate such a feature very much.
  • edited November 29, 2021
    Trying to list all possibilities, you could also...take screenshots. Then save it in a dedicated folder that you share on some cloud service, link it to a zotero collection in a shared library. Note that you could just share the link to your shared folder in the shared library so you do not use zotero storage. Printscreen key is quite fast after all and tweets are small rectangles...

    Also, the collaboration process to analyse such material might be questioned: do you (we) really need to be able to comment the very image on the spot or can we compile annotations in a document in a more traditional fashion? (Seems like you work with a lot of data so you may answer no). Or you can share your comments notes as an item in zotero too. If am I not too wrong googledrive enable comments for png files (and pdf) as long as one can access the shared link, even though not passing through a gmail account. And do you need the text part of every tweet? Here you may find OCR applications that could enter a workflow with text as an image.

    Then let me mention that some qualitative software analysis such as atlas.ti include integration with tweeter. I do not know more, neither do I have interest in those services but you might check it out. Open access philosophy with text analysis could be done with some Rmarkdown tools, check IRamuteq for example, but neither ik as for tweet processing in that case.

    Just to recall in the end that the magnitud of a project might determine what tool and workflow is most suitable to realize it.

    Thanks for reading!

    Edit: oups, screenshots are exactly what you were doing at first. Sorry for being redundant :/

  • This thread was very interesting.

    Annotation of web-sites is indeed very much needed, so I will share my workflow: I don't use the web-connectors for Zotero. Instead, I use the Chrome SingleFile addon directly, annotate and "clean-up" the page, then save it.

    Afterwards, this clean, annotated html file is imported to Zotero. It would be better if this workflow could be done inside Zotero (i.e. annotating after the html has already been imported), but for the time beeing, this approach works for me.
  • +1

    For now, I can't find a good enough tool to take annotation on the saved web page(snapshot):

    - annotation extension on chrome, like "note anywhere" or other chrome extensions:
    it is not able to take notes on local html files, since chrome limits extensions to read local files

    - edit the local webpage with some html WYSIWYG editor:
    it is not search-able in zotero

    I find the best way for now is save as pdf and then send to zotero....
Sign In or Register to comment.