Is there a way to "Retrieve Metadata" for epub files within zotero?

Bibliomania · April 6, 2016

I have over 5,000 articles, books, and other bits in my library right now. Near 1/6th of them are EPUB format books. Are there any internal or external tools to batch retrieve the metadata for these files?

adamsmith · April 6, 2016

definitely not internal, not that I'm aware of external, sorry. As preferable as epub is as a format in many ways, there are a lot fewer well developed tools that would allow us to work with it (within Zotero) easily.

Bibliomania · April 6, 2016

I'm curious. How does the metadata retrieval system work in Zotero compared to Calibre? Would it be possible to use Calibre's ebook-meta.exe/fetch-ebook-metadata to jurry rig something together that could work good enough?

adamsmith · April 6, 2016

I don't use Calibre, but that doesn't work automatically, right? You'd have to fill out some fields and then use those to query? If I understand that correctly, it doesn't even read anything from the epub when using that option.

That could be done in Zotero, probably more easily than for Calibre, actually, but it's a completely separate workflow that'd have to be implemented. Something like that is generally planned, but will take time.

noksagt · April 6, 2016

This issue has been brought up before for this and various other formats. A couple points:

Many formats have tools that allow the plane text to be dumped in a similar way that we use the xpdf/poppler pdftotext program & so our current method of getting PDF data could conceivably be extended to other formats
Many of these formats (including PDF) have the ability for structured metadata to be embedded & there are various tools that could read the metadata included with the file

I believe that Calibre ships with tools that read/write the embedded metadata.

adamsmith · April 6, 2016

yes, right, Calibre also reads and writes actually embedded metadata.
For PDFs, specifically, I think the current state of XMP is that it's so bad that it's entirely useless (i.e. you get so much false positives, that importing it is worse than nothing).

I'd imagine that for epub that'd be better, so that could be an approach.

Is there an equivalent of pdftotext for epub?

Last question is, how many individual formats can Zotero reasonably have custom approaches for (and which ones).

noksagt · April 6, 2016

FWIW: other software trusts the XMP data & there are well-curated pdfs out there.
Calibre ships with ebook-convert, which could convert epub to a text file. There's likely others.
No idea of which attachment formats we should be supporting (djvu? rtf? doc(x)?). I do think a more generalized approach of being able to plugin different tools to either extract metadata or to generate a text file from the file would be better than what we have now.