issues with PDF Metadata retrieval options

seraphinatarrant · November 15, 2019

Note: I am aware of a few ongoing discussions related to this, but they do not solve my problem.
(Notably https://forums.zotero.org/discussion/comment/340128#Comment_340128 and https://forums.zotero.org/discussion/78638/unable-to-bulk-import-a-list-of-urls are the adjacent ones).

Issue:
I am using the API to manage libraries programmatically (I ported a number of libraries from competitors in order to do this, since everyone else has rubbish APIs).
However, it is a really serious problem that I cannot trigger metadata extraction programmatically. I know that before now you were rate limited, but now that you are not, surely this is possible to do? Consider this a feature request, but also a request for a workaround in the interim.
As it is right now, it is pointless for me to write objects to the API, since once I upload files I have to manually click on them to extract data. This is a legitimate API use case - to upload files received from a digest and automatically extract information about them.

As a workaround I tried using https://github.com/zotero/recognizer-server, which I would be happy to do, but I'm uncertain if it was intended for external consumption or not. If so, maybe the README could be fleshed out a bit?
I can't even get past "npm start" - I think there are some dependencies or version incompatibilities that might need to be documented. (though I don't speak node js so my ability to understand its error codes is flawed and a product of much googling).
Note that the javascript option in the discussion won't work since this has to all exist as part of offline jobs that run to manage the libraries. That said, if it was for some reason preferable to use urls instead of pdf files, that would be fine, I could start with a URL instead.

Thanks very much! If API support does exist I'll port it to pyzotero so it can reach a wider audience. In the interim though some other solution would be really helpful.

seraphinatarrant · November 20, 2019

Ping! Just a quick response please - on how to use recognizer-server to get metadata myself, or on possibilities for adding API support to trigger extraction!

adamsmith · November 20, 2019

I assume this isn't possible (and unlikely to be at least in the server API) because there is no server-side PDF extraction. Obviously the server API can only expose functionality that the server offers.
There are good reasons not to have that, pragmatic, privacy related, related to existing rate limits e.g. on the CrossRef API (with the tool running locally, API requests come from users individually), and because of server load.
I think there is some chance of having more local functionality exposed to a CLI API, but I doubt this is high on the agenda. I can't speak about setting up recognizer-server. That'd be better to ask on zotero-dev, though.

seraphinatarrant · November 20, 2019

Ah that's a google group! I added an issue to the github but was unaware there was a google group, I'll look there. Thanks, really appreciate it. Also very much appreciate the explanation.

dstillman · November 20, 2019

That said, if it was for some reason preferable to use urls instead of pdf files, that would be fine, I could start with a URL instead.

Are you referring to PDF URLs specifically, or URLs of article pages? Because if the latter would be sufficient, you can use your own translation-server instance (as done by Wikipedia and others). Just to be clear, that's the standard Zotero approach to retrieving metadata, and it works on a much wider variety of sources. PDF recognition is a backup.

As a workaround I tried using https://github.com/zotero/recognizer-server, which I would be happy to do, but I'm uncertain if it was intended for external consumption or not.

Yeah, sorry, it's not really intended for external consumption — there are various internal server-side parts that are difficult to expose, and it involves logic on both the client and the server. Sorry for the confusion.

seraphinatarrant · November 21, 2019

PDF URLs specifically, unfortunately - all of the information we're storing is in conference papers and academic journals.
Thanks for the reply, I really appreciate it, and it's much clearer now. Shame it wasn't intended for external consumption.

Since we will then have to build our own metadata extraction in order to upload to Zotero, do you have a recommendation then for the elements of how to do that to make it as close as possible to what you do yourself (for academic PDFs, only)?
Based on reading your blog and snooping I gather it is: 1) Query CrossRef 2) ?? 3) Fall back to actually parsing the PDF?

If it's hard to explain no worries, just thought I'd ask in case you have time. Thanks again!

adamsmith · November 21, 2019

all of the information we're storing is in conference papers and academic journals

right, but those have landing pages that Zotero can import -- including the attached PDF, so this might still work. Do you have a sample URL?

seraphinatarrant · November 21, 2019

sometimes we have a landing page and sometimes we just have a link to wherever the PDF is stored (it depends on if the crawl was via a news site or an aggregator or google scholar, plus some miscellaneous variance).
Here are 20 examples, of which it seems about 25% go direct to PDF and the others hit a landing page first.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029076/
http://193.190.239.98/bitstream/handle/10390/8017/2014jpvb0016.pdf?sequence=1
https://www.sciencedirect.com/science/article/pii/S0304401713006390
https://link.springer.com/article/10.1007/s11250-013-0442-z
https://www.researchgate.net/profile/Tadele_Tolosa/publication/208740343_Prevalence_and_Risk_Factors_for_Donkey_Babesiosis_in_and_around_Debre_Zeit_Central_Ethiopia/links/073a0d8f31df230da95c1c4c.pdf
https://www.ajol.info/index.php/sinet/article/viewFile/18293/3755
http://www.academia.edu/download/39321512/a-crosssectional-study-of-bovine-babesiosis-in-teltele-district-borena-zone-southern-ethiopia-2157-7579-10002301.pdf
https://pdfs.semanticscholar.org/70e0/ce1795bd832c33ce28c3489b16f90611681b.pdf
https://www.cambridge.org/core/journals/international-journal-of-tropical-insect-science/article/ticks-and-tickborne-parasites-associated-with-indigenous-cattle-in-didtuyura-ranch-southern-ethiopia/35DA0ACFED3E13EA7D7C2C6A69260664
https://www.researchgate.net/profile/Berhanu_Mekibib2/publication/282737045_Prevalence_of_Haemoparasites_and_Associated_Risk_Factors_in_Working_Donkeys_in_Adigudem_and_Kwiha_Districts_of_Tigray_Region_Northern_Ethiopia/links/56d02c2b08ae059e375c211d/Prevalence-of-Haemoparasites-and-Associated-Risk-Factors-in-Working-Donkeys-in-Adigudem-and-Kwiha-Districts-of-Tigray-Region-Northern-Ethiopia.pdf

adamsmith · November 21, 2019

FWIW, a lot of these will work with Zotero's translation server or the existing public Citoid API (https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation )

I had a quick look and these one look like they won't work:
https://www.ajol.info/index.php/sinet/article/viewFile/18293/3755
--> if you can reformat these as https://www.ajol.info/index.php/sinet/article/view/18293 they will import Note that the PDF actually doesn't have OCRd text, so PDF extraction would definitely fail.

http://www.academia.edu/download/39321512/a-crosssectional-study-of-bovine-babesiosis-in-teltele-district-borena-zone-southern-ethiopia-2157-7579-10002301.pdf --> Looks like a broken link

https://pdfs.semanticscholar.org/70e0/ce1795bd832c33ce28c3489b16f90611681b.pdf
--> semantischolar links I don't have a great idea for

seraphinatarrant · November 21, 2019

Thank you so much, that's amazingly helpful! I just tried it with a larger sample and it seems to work quite well.

If I work out what to do about semantic scholar I'll post here in case anyone is ever interested in future - their API is friendly but unfortunately very lightweight and i can't query it without their ID or a DOI or arxiv ID.
And even with other semantic scholar pages, the Citoid API doesn't seem to like them.

That said, everything on semantic scholar should in theory also exist somewhere else, so perhaps I can get away with just taking everything from semantic scholar and trying to find it elsewhere.

Thanks again, I definitely have a clear path forward now!

seraphinatarrant · December 3, 2019

One further request for advice:
Just as an FYI, based on the Citoid API, I have a 50% success rate at retrieving data (over a very large sample, though this seems pretty consistent). Only 12% of failures (so 6% of total) are unable to be retrieved by zotero when I trigger a manual metadata-retrieval on the pdfs.

Can you instruct me on how best to implement your translation server with the URLs of PDFs?
For instance, your readme (https://github.com/zotero/translation-server) includes the ability to query either a webpage or search a DOI/Arxiv/etc via http://127.0.0.1:1969/search or http://127.0.0.1:1969/web.

But neither of these accepts a PDF URL. Is there an enumeration of the API endpoint options available to me? It isn't super clear to me from searching through the code (and I'm not entirely sure what to search for, since grepping for something like PDF is too broad).

For some URLs I can just truncate to the path before the PDF, but this fails 3/4 of the time.

Thanks!