Compatibility with DevonThink or webarchive format

David Auerbach · June 16, 2007

At the moment using both Zotero and Devonthink is awkward. They serve very different purposes, but it would be good to have them mesh better. Devonthink can import the Zotero files, but they remain merely a collection of files making up a website. Devonthink can import the bibtex export, but that loses the website itself. There *is* a standard that would be useful since both the Mac generally and Devonthink in particular understand it: webarchive. That's a format to encapsulate a website that is treated as a single file but is searchable, etc. by Devonthink.
I'm not asking the Zotero move to it as its format for websites, but simply consider adding it as an export option.

bdarcus · June 16, 2007

Not really knowing anything about webarchive, I googled. According to this post, it is a private (e.g. Apple/Safari-only) format. Rather typical of Apple really. So in that sense, it's hardly a "standard."

I really don't recommend Zotero be supporting a closed format of this sort, even if they could (which given the file I looked at seems doubtful).

It would be nice, BTW, if Zotero (or Firefox) would compress the saved archives, at least optionally. They take up a fair bit of space; much more than my database.

dstillman · June 16, 2007

It would be nice, BTW, if Zotero (or Firefox) would compress the saved archives, at least optionally.

https://bugzilla.mozilla.org/show_bug.cgi?id=379633

David Auerbach · June 16, 2007

Bdarcus misses some of my point.
1. There are a lot of devonthink users. ANY readable format would be nice. But zotero saves its web archives in its own format that, for one, has arbitrary folder names. It would be nice if it zotero's web archive were either a) saved in a format that presented the relevant information (page title, home page, etc.) in a transparent way or b) offered an export option for a widely used format.
2. Note that Zotero offers exports options in other "private" formats.
3. Devonthink is, of course, not the only text-handling database program suitable for academics, writers, researchers, etc. And, all of them could use an easy way to integrate with zotero.
4. I'm sure that someone good at scripting could write a script to convert zotero's RDF format to another one (like webarchive).

bdarcus · June 17, 2007

The point about the private format is to suggest not only that they shouldn't try to create these files, but that they can't. In the information I could find about webarchive, there's no information about writing or reading these files outside of Apple APIs. This is a format designed by and for Apple. Just open one of the files in a text editor and look at the information they put in the file.

Based on this criteria (not an open format, and not accessible outside of proprietary and platform specific APIs), how does it hold that Zotero exports to other private formats?

But that aside, I agree that being able to have more meaningful folder names would be useful. Indeed, this is what Firefox itself does when you save a web page.

David Auerbach · June 17, 2007

I'm not sure what you mean by proprietary and platform specific.
Other programs deal with webarchives (Devonthink for one). People have already written very small scripts that turn a site one is looking at (in Firefox, e.g.) with one click into a webarchive and push it into Devonthink.
It may be, of course, that I'll get more joy from the Devonthink crowd and maybe someone there will write a script.

pierfranco · June 17, 2007

I agree with the request made by David Auerbach.
The import of web pages is a very convenient feature but should be improved because at the time all the components of the web page (images etc.) are downloaded in a single folder. That is Zotero now uses just the save alla file method. Undoubtedly this enables you to retrieve the stored web page and view it as if it were just retrieved from the Internet, but at the same time after a while your disk is full of useless files.
I am unable to discuss the opportunity of adopting the webarchive format proposed by Auerbach, but I can see that Firefox has a compressed format for saving web pages a single file. On a Mac this is possible not using the Save page... command from File but by pressing the ALT (or CTRL) key on a hyperlink: in this way a new window opens and you can save the webpage which the chosen link points at in a format called HyperText which produces a single file.
If Firefox can do it, probably even Zotero could do the same and this would be a great improvement on the current method (at least for our disk space!)

Another problem that I put to your attention and that if it were solved could improve the compatibility between Zotero and DevonThink is the fact that Zotero gives to every folder which contains a saved web page a number, which obviously is the ID which connects the web page to the data stored in the database. But in my opinion this method could also be improved. I explain this with an example.
At the moment if you import the content of the Zotero storage folder in DevonThink you have a list of numbered folders which have no sense (just their content is meaningful and you have to open each of them to see it). If Zotero could adopt another method for assigning a unique ID to the web pages downloaded this problem could be solved. For example: if Zotero could use a compressed system for saving web pages in a single file as I proposed above, then it could use for creating the unique ID the title of the web page+something else, e.g. a number or a date (a similar method for producing a unique record key is used by the standard BibTeX). That would greatly improve the Zotero storage folder making the view of the saved files in the File system more comprehensible.
Any comments?
pierfranco

bdarcus · June 17, 2007

I'm not sure what you mean by proprietary and platform specific.

Simple: they only work on OS X, and can only be created (from my understanding) by hooking into OS X APIs.

Firefox and Zotero also work on Windows, Linux, and a number of other operating systems.

Other programs deal with webarchives (Devonthink for one).

Right, but only Mac applications.

People have already written very small scripts that turn a site one is looking at (in Firefox, e.g.) with one click into a webarchive and push it into Devonthink.

As above; I have a feeling those scripts actually access Apple APIs to read and/or write the webarchive files.

I think the more promising short-term solution is for someone to write a little script that renames the folders that Zotero does create and maybe load them from there directly into DT.

bdarcus · June 17, 2007

Another problem that I put to your attention and that if it were solved could improve the compatibility between Zotero and DevonThink is the fact that Zotero gives to every folder which contains a saved web page a number, which obviously is the ID which connects the web page to the data stored in the database. But in my opinion this method could also be improved.

This relates to discussions elsewhere about citation IDs. The safest way to do this is in fact to use the URI for the document.

But you can't use that for a directory name. Maybe use a more human readable label and include some index file that associates the directory with the original URI? E.g.:


-
  uri: http://ex.net/1
  directory: some_nice_name

Just an idea; not sure how good it is ...

dstillman · June 18, 2007

I can see that Firefox has a compressed format for saving web pages a single file. On a Mac this is possible not using the Save page... command from File but by pressing the ALT (or CTRL) key on a hyperlink: in this way a new window opens and you can save the webpage which the chosen link points at in a format called HyperText which produces a single file.

That's not a compressed file—it's just saving the raw HTML file to disk, changing relative URLs to absolute ones so that images and other files will load off the original server when the file is opened.

Mozilla has no native archive format. There's an extension that allows saving of sites into a compressed/single-file format, but it's not under active development anymore and was never released for OS X. As I noted above, ZIP writing is planned for Mozilla, and we might offer the ability to use that when it's available, though it wouldn't be ideal for a number of reasons (harder to search, would only load and display in a browser with Zotero installed, etc.). Ideally Mozilla will implement a native solution that has some chance of interoperability with other [open-source] browsers. I wouldn't be surprised if this happens fairly soon after ZIP writing is added. There's a tracking bug on Bugzilla for all the various archive requests over the years.

dstillman · June 18, 2007

We could probably improve the folder-naming scheme to just base the folder name on the name used for the attachment file itself (which is configurable with the extensions.zotero.attachmentRenameFormatString pref (though currently that pref A) only supports the default fields, B) isn't used for all attachment types currently, and C) is pretty badly named and might change)). Using non-ID folder names would just require a good bit more logic in terms of handling name conflicts, renaming the folder when renaming the attachment, etc... And note that the folder name would not be authoritative, as it might be missing some characters that are invalid on the filesystem, be truncated, or have an integer appended to make the name unique.

Zotero itself doesn't need the index file that Bruce suggests, since it stores the path to the main attachment file in the database. However, providing some way for external consumers to map directories back to their ids/metadata might be helpful, though to do anything meaningful they'd probably want to access the Zotero database anyway (using either SQLite itself or some future local socket-based API that Zotero provided), in which case they could just look it up based on the filename. But perhaps Zotero could create a .zotero-info file in the attachment directory that provided the id/uri. Maintaining a single index file would be trickier and slower to update.

Re: DevonThink, I'd agree that, for now, getting someone in the DT community to write a script to parse Zotero RDF and convert the exported folders of HTML files to .webarchive using the Apple API might be the best approach. Note that, other than the folder names, we don't save snapshots in our "own format." They're just HTML and related files.

awowwed · November 19, 2007

I was directed here from a posting to discussion 1616, where I expressed the need to store Zotero snapshots in a single file. However, my suggestion is not to use a proprietary format. I am suggesting that Zotero save to MHT or MATF format (HTML with MIME encoded images attached). This is a completely open format and still accomplishes the goal of storing the entire snapshot to a single file.

My reason for this is that I use Groove 2007 to synchronize multiple computers. It has a limit on the raw number of files it can synchronize. Most of the time this is not a problem, but when you take a snapshot of a page that has 10 or 15 associated images, stylesheets, etc. the folder quickly grows to exceed this limit.

noksagt · November 19, 2007

I directed you here, just as Dan directed someone from discussion 1161 here. See post 10 (by Dan). Mozilla has no mhtml or maf support at this time. A MAF extension had a bit of hiatus, but it may be developed for Firefox 3

nickdep · May 14, 2008

After spending a couple of days test driving this add-on, I've found that Zotero is quite impressive. That said, there are only two features that will prevent me from using this application.

1. Compatibility with Firefox 3.0 - this of course, will be an inevitability.

2. Saving a web page as single file, ideally MHTML - there is an open source plugin for Firefox that saves files in MHT format. If Mozilla doesn't integrate this within Firefox 3.0, Zotero should.

Check out UnMHT for Firefox
http://www.unmht.org/unmht/en_index.html

bdarcus · May 14, 2008

Saving a web page as single file.

I don't speak for Zotero, but In proposing a feature request like this, you might step back and specify more broadly what the problem is you believe will be solved by your proposed solution (in ways different than the ideas that emerged out of this thread).

While I can see the value of being able to easily move around a document, I don't think a "single file" is necessary for that, nor that MHTML is a particularly good solution.

I've not looked at Mozilla's alternative closely, but it seems they took a better tack: a compressed archive.

Moreover, I don't think you should be complaining to Zotero; this is really a Mozilla issue. If it's such a critical feature, with such an obvious solution, it seems to me you'll see it in Firefox.

weiyg · June 13, 2009

I have install the MAF extension which makes firefox be able to save webpages in mht and MAFF formats. Is there a way to configure Zotero use this format instead of naive html format when archiving webpages?

Thanks!

kmlawson · September 16, 2010

I have created a script that will import (sync) entries from Zotero into DEVONthink. It recreates the hierarchy of Zotero in a chosen group in DEVONthink, then gives each source a group, and inside that group a main rtf note file. It also copies over all tags from Zotero:

Posted info on this at this forum posting:

http://forums.zotero.org/discussion/14268/devonthink-and-zotero/

unfortunately I haven't touched the issue of attachments such as pdfs etc. which are associated with entries - that would be great if someone modified the script to do that.