Option to Save screenshots as .WARC archive files.

I love the screenshot option. It does a really good job, better than most website screenshot software.

However, I highly, highly, recommend that Zotero offer the .WARC format as an option in addition to or instead of the screenshot option.

https://archive-it.org/blog/post/the-stack-warc-file/

What is .WARC?
-A WARC (Web ARChive) is a container file standard for storing web content in its original context, maintained by the International Internet Preservation Consortium (IIPC).

What does this mean?

Let’s unpack what this means. A WARC is…

1) a digital file that you can store on your own local or networked storage, like a PDF document or an MP3 audio file, complete its own .warc file extension and application/warc mimetype.

2) a container file that houses other files. It concatenates several files into one digital object, like you’ve seen elsewhere from container formats like ZIP, GZIP, TAR, or RAR. A WARC wraps around other files like the PDF and MP3 above, along with some additional information and formatting that we’ll cover below.

3) a container for files that are native to the web. WARCs are produced by crawlers, proxies, and other utilities that retrieve files from a live web server. They can contain the PDF and MP3 files described above, for instance, but also the HTML, JS, CSS, and other structural elements that web browsers need to read in order to represent site contents to human computer users.

4) a container that can also contextualize those contents. WARCs contain technical and provenance metadata about the collection and arrangement of their media so sites can be read and represented in live web browsing experiences like they were at the time of their collection.

5) a standard container format. The WARC file format standard was published by the International Organization for Standardization (ISO) committee on technical interoperability as ISO 28500. You might get other outputs from web scraping tools, but WARC is the generally agreed-upon way to contain web archives such that people and their software know how to interpret and read the contents today and into the future.

6) a standard maintained by web archivists. Keeping up the WARC file format standard is the responsibility of the International Internet Preservation Consortium (IIPC). This coalition of practitioners does the ‘agreeing upon’ above, that keeps the WARC relevant and vital to how we collect and preserve web archives.

.WARC files are inherently designed to be used by Archives, Museums, Universities, Law Firms, Digital Research institutes, etc. Essentially it's designed to be used by the very same institutions and individuals that Zotero has as customers.

For my own personal use I have turned off screenshots. When I save a website to Zotero using the connector, I save the website or article to create the entry, use a different software extension to save the website as an .WARC file, and then manually add it to my Library. This is obviously much more labour intensive than if saving an archive of the page as .WARC file was possible natively.

Thank you for anyone who reads and comments to this discussion post. I am very interested in any feedback comments or opinions.
  • Zotero considered WARC briefly and it's a great archival format but it's not a good idea for a tool like Zotero:
    1. WARC files, being archival, are *huge*
    2. WARC files require a dedicated player. You can't just open them in a browser.

    if you really want to use WARC files (why?) go ahead, but I think this going to be very niche. If anything, people who want real archival copies should just push the files to the Internet Archive or a comparable site using WARC. The memento add-on used to do this automatically, though it's not currently compatible with Zotero.
  • > If anything, people who want real archival copies should just push the files to the Internet Archive or a comparable site using WARC.

    Okay, is there a way for this functionality to be added? I'm assuming some sort of check or lookup to see if there is already an archival copy uploaded, then if there is for that to be an option included. For example:
    "View Online","View Snapshot","View File","View Archive".
    If there is not an archival copy the best case scenario would be for Zotero to create an .WARC file and upload it. Second best case scenario would be for Zotero to prompt the user with a dialogue explaining no archive was found and that the user should upload one themselves.

    I don't have enough technical skills yet to contribute via a fork and pull request. WARC files do require a dedicated player, but I think that is more the case that they are relatively new, and that may change in the future.

    Development and acceptance of .WARC is still young. It's possible that WARC files will be more efficiently created in the future, players will become more plentiful, and support will be more broad.

    I have to admit your are correct that WARC are currently very niche, and Zotero as an organization has to make trade-offs when it comes to where development resources are deployed. So my final request is that while support and integration for WARC is currently extremely low, please keep it in the back of your mind in case it becomes more relevant to the Zotero community in the future?
  • WARC files are hardly new — they've been used by the Internet Archive for ages. They're not something we're planning to ever support natively. As adamsmith said, there was a plugin that triggered IA archiving, and someone could update that if they wanted.

    But I'd echo adamsmith in questioning what problem you're actually trying to solve by using WARC. Since switching to use the SingleFile extension, Zotero now saves the final rendered version of the page as a single HTML file with embedded images as data URIs. It's hard to get more standard and stable than that.
  • Thank you for mentioning the switch to Single extension, I read up the forum posts discussing it, and what had been happening is that I was not waiting for the snapshot icon to switch from transparent to full, which I now understand indicates that the snapshot process is complete? I had just been clicking "done" and my snapshots were not rendering good.

    There's a few websites where the built in SingleFile extension isn't working perfectly, www.tabletmag.com is one example, but for the most part I'm finding it to be pretty awesome now that I know just to be patient!

    Thank you for your responses.
  • If there's a specific URL that's not working well, let us know.
Sign In or Register to comment.