epub extraction

dgwakeman · February 15, 2012

Hello, I own and use an e-reader. I utilize software called calibre (http://calibre-ebook.com/) to put material on it. I also dislike the current pdf model for journal articles. It makes it very difficult to use in any way other than printing (an environmentally costly technique). I know that calibre has some automated techniques for extracting an ebook from html, and html versions of articles are available for most if not all journals (including pubmed central). I would love to see the ability for zotero (either on it's own or utilizing calibre) could extract an ebook from the html version (note this may also be something the publishers can help with i.e. providing documentation of their html generation. FYI, I am also contacting the calibre group about having them access the zotero database.
Thank You, I love zotero, and have converted many users.
Dan Wakeman

adamsmith · February 15, 2012

If I understand you correctly, you could just use calibre to convert the Zotero snapshots to e-pubs.
That can be done already, it could presumably be sped up by a plugin. All tablet functionality for Zotero has been coded by third party developers and so would any integration/auto-conversion for epubs.

Also, and without wanting to get into a big discussion about the relative merits of various file formats: With the advent of tablets, the notion that the only way to use/read pdfs is to print them is pretty much past, no?

dgwakeman · February 15, 2012

Well, maybe I don't understand the features properly, but as I understand it the snapshots are just links, so when you are offline they are useless e.g. when I double click on an article (that I have turned snapshots on for when adding to zotero 3.0.1) when firefox is in offline mode nothing happens unlike when I am in online mode it takes me to the article. I want it to do the full ebook extraction when I add it to zotero.

Ideally it would also do a bit deeper digging, so automatically check pubmed central for the full article even if I am on google scholar and check any library subscriptions I have as well. Convert the html to epub ( mobi or whatever) right then and there so there is no need to worry about a connection later. I realize this is pretty fancy, but it doesn't seem totally wild, and I thought the publishers (at least friendly ones like plos and frontiers) might help with the data extraction/conversion. This would have the added benefit of being much smaller than the entirety of a journal's web page, which are often filled with lots of extraneous material (publishers search engine, links to related articles from the publisher and other nonsense).

On the tablets comment: while technically accurate. PDFs are a poor way of dealing with this information flowable text is much better pdfs are best for fixed size display medium (personally I think pdfs are terrible on lcd displays and painful on the eyes, and difficult to maneuver on e-ink displays). Therefore flowable text is preferable.

adamsmith · February 15, 2012

Well, maybe I don't understand the features properly, but as I understand it the snapshots are just links, so when you are offline they are useless e.g. when I double click on an article (that I have turned snapshots on for when adding to zotero 3.0.1) when firefox is in offline mode nothing happens unlike when I am in online mode it takes me to the article.

yes, you are misunderstanding that. Snapshots are full offline copies of the webpage in html format. I can't tell you why they're not opening for you in offline mode - try again. They looks similar to links in Zotero, though, so you'd have to make sure that Zotero actually took a snapshot.

The other issue - would be interesting, it's just not going to happen from Zotero's side. What you describe - searching and querying other databases etc. is actually quite a challenge. The conversion part could be automated more easily, but as I say, that's just not something that the Zotero core team is going to touch, there is simply no capacity.

What I do think may be long term desirable is for Zotero to improve handling of .epub formats - especially indexing and potentially annotating (along with annotating pdfs (which might or might not happen). But that, too depends on the availability of third-party libraries.

dgwakeman · February 15, 2012

sorry about the first bit above it seems there are many nuances to getting snapshots (seems to only work on reboot of firefox, and must have third party cookies on for some publishers elsevier for example).

Now that I have them working, I can say that snapshots do not include the images (at least from science-direct/elsevier), which for most articles is essential (at least in science). They also seem to only provide links not snapshots of pubmed central (this may be a licensing thing, but I don't know). Both of these would be helpful.

Again, I don't know how difficult any of these things are, but since you reliably extract the doi, I would think that would help to automate the search (obviously the mechanisms for accessing these databases are too obtuse).

adamsmith · February 15, 2012

We can't do anything about the images in snapshots, no. It's third party code (also, the images aren't actually on the webpage, right? They're just linked to?).
AFAIK we never get article full text from Pubmed - be it as html or as pdf.
I fear Zotero is just the wrong vantage point to go at this. The entire academic publishing industry is set-up to work with PDFs, like it or not. Zotero can't change that.

dgwakeman · February 15, 2012

but this seems weird as you can see the full html article e.g. here:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2720756/
Why not pull a snapshot of this? Thanks

dgwakeman · February 15, 2012

and they are on the webpage (at least the view is) e.g.
http://www.sciencedirect.com/science/article/pii/S0920121109001089
I'm not trying to get zotero to change it. I was trying to get it to take advantage of the html nature of it at least.

cjb · March 5, 2012

I'm no developer but perhaps the dotEPUB bookmarklet might come in handy? It seems to work quite well for academic articles - I can manually create epub files in the browser, a little like how web services like instapaper/readitlaterlist would be able to if they could get behind the pay-wall.
It's a manual process (browse, click, save, drag into Zotero) using entirely in-browser features of open-source projects, so I'm guessing it's the sort of `hack' that might be able to be automated in Zotero or via a plugin without too much trouble.

http://dotepub.com/

the source is available at

http://code.google.com/p/dotepub/