Retrieve metadata from epub

Perhaps the post title is self-explanatory, but in case it's not, what I would love to be able to do is use the "retrieve metadata for pdf" option but for epub (and perhaps other types of) files. I know this functionality is related to the pdfinfo (i think) program, so perhaps someone would have to create an "epubinfo" program which could then be integrated into zotero for this to work. In any case, it seems like a feasible goal. Epubs, after all, contain text just like pdf files -- we just need a program that can take a peek inside and pull out the relevant metadata in order to pull the entry from google books or wherever.

Any thoughts? Possible? Is this the right place to ask? Is the pdfinfo tool open source? If so, perhaps it could be retooled to include epub capabilities.

Thanks all for your consideration!
  • broadly speaking possible, yes, but not trivial.

    The way that retrieve metadata works is that Zotero reads the full text of the PDF, then first looks for an ISBN or DOI in the first couple of pages. Absent that, it picks out a phrase somewhere from the middle of the document and searches google scholar for it, importing the first search result.

    So to answer your question, no, it doesn't rely on pdfinfo, but it does rely on its companion, pdftotext: the ability that Zotero needs is to read and index the whole file. While pdftotext is open source, I don't think there's much of a chance of just extending that to other formats. There are other tools, however, to index other types of files and I think it'd be cool for Zotero to support a bit more (PPT, LibreOffice, and EPub come to mind).
    This isn't trivial -- one would need to identify the right tool(s), they'd have to be open source under a compatible license, they can't be too big; then someone has to integrate them in Zotero. I think all of this is doable, but it's not insignificant as a commitment. Obviously would be great for a 3rd party dev to take up, it's a project where that would work well. I can't say if this is anywhere on the agenda of Dan & co.
  • Hi adamsmith,

    Thanks for the response. I think as you know from previous inquiries, I don't have the skills to do this myself -- but I hope just putting it out there in the community might help the idea to receive some attention. I think that this functionality will only grow in importance as ebooks become more popular over time. Anyway, thanks for the response -- I understand it's a big ask. Perhaps someone could take a look at what's on github in terms of open source epub indexers. I found this, for example: https://github.com/vdloo/booksbooksbooks There are some others which you can find by searching on github for epub index or ebook index which might be better, I'm not really sure.

    Anyways, thanks to anyone who can help contribute to developing this feature!
  • Something like this is in the planning stages for my SafrtyLit service. What I hope to do is have a series of scripts that :
    1) identifies records with a doi but no volume or pagination metadata; 2) calls Zotero to re-fetch the article metadata; 3) checks the new metadata against what had been downloaded previously (we keep the original metadata but, on the public-side display, fill the fields with 'ePub'); 4) if ithe new metadata differs from the original and is similar to the pattern of metadata of other articles in the same journal; 5) overwrite the ePub placeholders in the record. This will be imperfect because some publishers provide awful metadata but at worst case, these problems can be hand edited.

    Upon finalizing our plans and finding the funds to support the work; I'll be pleased to share the code. Someone may find it useful as a starting point for a plug-in. With little effort this could be useful as a sideline utility. My plans include using Zotero and its translators to capture the metadata from the variety of journals. Worst case, this should be ready by the end of 2016.

    On a similar topic:
    We already do this with PubMed articles but that requires a license for special access to PubMed records.

    Also, separately, we use the doi and titles for articles from journals that are indexed in PubMed but where the metadata were not captured from PubMed but from the publisher. This process searches PubMed so that we can capture the PMID.
Sign In or Register to comment.