issues with PDF Metadata retrieval options
Note: I am aware of a few ongoing discussions related to this, but they do not solve my problem.
(Notably https://forums.zotero.org/discussion/comment/340128#Comment_340128 and https://forums.zotero.org/discussion/78638/unable-to-bulk-import-a-list-of-urls are the adjacent ones).
Issue:
I am using the API to manage libraries programmatically (I ported a number of libraries from competitors in order to do this, since everyone else has rubbish APIs).
However, it is a really serious problem that I cannot trigger metadata extraction programmatically. I know that before now you were rate limited, but now that you are not, surely this is possible to do? Consider this a feature request, but also a request for a workaround in the interim.
As it is right now, it is pointless for me to write objects to the API, since once I upload files I have to manually click on them to extract data. This is a legitimate API use case - to upload files received from a digest and automatically extract information about them.
As a workaround I tried using https://github.com/zotero/recognizer-server, which I would be happy to do, but I'm uncertain if it was intended for external consumption or not. If so, maybe the README could be fleshed out a bit?
I can't even get past "npm start" - I think there are some dependencies or version incompatibilities that might need to be documented. (though I don't speak node js so my ability to understand its error codes is flawed and a product of much googling).
Note that the javascript option in the discussion won't work since this has to all exist as part of offline jobs that run to manage the libraries. That said, if it was for some reason preferable to use urls instead of pdf files, that would be fine, I could start with a URL instead.
Thanks very much! If API support does exist I'll port it to pyzotero so it can reach a wider audience. In the interim though some other solution would be really helpful.
(Notably https://forums.zotero.org/discussion/comment/340128#Comment_340128 and https://forums.zotero.org/discussion/78638/unable-to-bulk-import-a-list-of-urls are the adjacent ones).
Issue:
I am using the API to manage libraries programmatically (I ported a number of libraries from competitors in order to do this, since everyone else has rubbish APIs).
However, it is a really serious problem that I cannot trigger metadata extraction programmatically. I know that before now you were rate limited, but now that you are not, surely this is possible to do? Consider this a feature request, but also a request for a workaround in the interim.
As it is right now, it is pointless for me to write objects to the API, since once I upload files I have to manually click on them to extract data. This is a legitimate API use case - to upload files received from a digest and automatically extract information about them.
As a workaround I tried using https://github.com/zotero/recognizer-server, which I would be happy to do, but I'm uncertain if it was intended for external consumption or not. If so, maybe the README could be fleshed out a bit?
I can't even get past "npm start" - I think there are some dependencies or version incompatibilities that might need to be documented. (though I don't speak node js so my ability to understand its error codes is flawed and a product of much googling).
Note that the javascript option in the discussion won't work since this has to all exist as part of offline jobs that run to manage the libraries. That said, if it was for some reason preferable to use urls instead of pdf files, that would be fine, I could start with a URL instead.
Thanks very much! If API support does exist I'll port it to pyzotero so it can reach a wider audience. In the interim though some other solution would be really helpful.
There are good reasons not to have that, pragmatic, privacy related, related to existing rate limits e.g. on the CrossRef API (with the tool running locally, API requests come from users individually), and because of server load.
I think there is some chance of having more local functionality exposed to a CLI API, but I doubt this is high on the agenda. I can't speak about setting up recognizer-server. That'd be better to ask on zotero-dev, though.
Thanks for the reply, I really appreciate it, and it's much clearer now. Shame it wasn't intended for external consumption.
Since we will then have to build our own metadata extraction in order to upload to Zotero, do you have a recommendation then for the elements of how to do that to make it as close as possible to what you do yourself (for academic PDFs, only)?
Based on reading your blog and snooping I gather it is: 1) Query CrossRef 2) ?? 3) Fall back to actually parsing the PDF?
If it's hard to explain no worries, just thought I'd ask in case you have time. Thanks again!
Here are 20 examples, of which it seems about 25% go direct to PDF and the others hit a landing page first.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029076/
http://193.190.239.98/bitstream/handle/10390/8017/2014jpvb0016.pdf?sequence=1
https://www.sciencedirect.com/science/article/pii/S0304401713006390
https://link.springer.com/article/10.1007/s11250-013-0442-z
https://www.researchgate.net/profile/Tadele_Tolosa/publication/208740343_Prevalence_and_Risk_Factors_for_Donkey_Babesiosis_in_and_around_Debre_Zeit_Central_Ethiopia/links/073a0d8f31df230da95c1c4c.pdf
https://www.ajol.info/index.php/sinet/article/viewFile/18293/3755
http://www.academia.edu/download/39321512/a-crosssectional-study-of-bovine-babesiosis-in-teltele-district-borena-zone-southern-ethiopia-2157-7579-10002301.pdf
https://pdfs.semanticscholar.org/70e0/ce1795bd832c33ce28c3489b16f90611681b.pdf
https://www.cambridge.org/core/journals/international-journal-of-tropical-insect-science/article/ticks-and-tickborne-parasites-associated-with-indigenous-cattle-in-didtuyura-ranch-southern-ethiopia/35DA0ACFED3E13EA7D7C2C6A69260664
https://www.researchgate.net/profile/Berhanu_Mekibib2/publication/282737045_Prevalence_of_Haemoparasites_and_Associated_Risk_Factors_in_Working_Donkeys_in_Adigudem_and_Kwiha_Districts_of_Tigray_Region_Northern_Ethiopia/links/56d02c2b08ae059e375c211d/Prevalence-of-Haemoparasites-and-Associated-Risk-Factors-in-Working-Donkeys-in-Adigudem-and-Kwiha-Districts-of-Tigray-Region-Northern-Ethiopia.pdf
I had a quick look and these one look like they won't work:
https://www.ajol.info/index.php/sinet/article/viewFile/18293/3755
--> if you can reformat these as https://www.ajol.info/index.php/sinet/article/view/18293 they will import Note that the PDF actually doesn't have OCRd text, so PDF extraction would definitely fail.
http://www.academia.edu/download/39321512/a-crosssectional-study-of-bovine-babesiosis-in-teltele-district-borena-zone-southern-ethiopia-2157-7579-10002301.pdf --> Looks like a broken link
https://pdfs.semanticscholar.org/70e0/ce1795bd832c33ce28c3489b16f90611681b.pdf
--> semantischolar links I don't have a great idea for
If I work out what to do about semantic scholar I'll post here in case anyone is ever interested in future - their API is friendly but unfortunately very lightweight and i can't query it without their ID or a DOI or arxiv ID.
And even with other semantic scholar pages, the Citoid API doesn't seem to like them.
That said, everything on semantic scholar should in theory also exist somewhere else, so perhaps I can get away with just taking everything from semantic scholar and trying to find it elsewhere.
Thanks again, I definitely have a clear path forward now!
Just as an FYI, based on the Citoid API, I have a 50% success rate at retrieving data (over a very large sample, though this seems pretty consistent). Only 12% of failures (so 6% of total) are unable to be retrieved by zotero when I trigger a manual metadata-retrieval on the pdfs.
Can you instruct me on how best to implement your translation server with the URLs of PDFs?
For instance, your readme (https://github.com/zotero/translation-server) includes the ability to query either a webpage or search a DOI/Arxiv/etc via http://127.0.0.1:1969/search or http://127.0.0.1:1969/web.
But neither of these accepts a PDF URL. Is there an enumeration of the API endpoint options available to me? It isn't super clear to me from searching through the code (and I'm not entirely sure what to search for, since grepping for something like PDF is too broad).
For some URLs I can just truncate to the path before the PDF, but this fails 3/4 of the time.
Thanks!