Problem in retrieving metadata
I found what I believe to be a bug in Zotero. A file that I downloaded a file from a database, added it to Zotero and tried to retrieve metadata, I get an error "PDF does not contain OCRd text". However, the file contains a DOI on the first page and this can be read with the pdftotext tool that is installed by Zotero by launching the tool from command line:
mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel 173.full.pdf /dev/stdout
Error: No paper information available - using defaults
Applied Psychological Measurement http://apm.sagepub.com/
Estimation of Composite Reliability for Congeneric Measures
Tenko Raykov Applied Psychological Measurement 1997 21: 173 DOI: 10.1177/01466216970212006 The online version of this article can be found at: http://apm.sagepub.com/content/21/2/173
There is no text in the PDF after the first page, but only images.
I uploaded the paper to my dropbox public folder in case the developers want to take a look:
http://dl.dropbox.com/u/694399/173.full.pdf
mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel 173.full.pdf /dev/stdout
Error: No paper information available - using defaults
Applied Psychological Measurement http://apm.sagepub.com/
Estimation of Composite Reliability for Congeneric Measures
Tenko Raykov Applied Psychological Measurement 1997 21: 173 DOI: 10.1177/01466216970212006 The online version of this article can be found at: http://apm.sagepub.com/content/21/2/173
There is no text in the PDF after the first page, but only images.
I uploaded the paper to my dropbox public folder in case the developers want to take a look:
http://dl.dropbox.com/u/694399/173.full.pdf
When Zotero does not find an OCR'd document, the system does respond with an "X" and moves to index the next file. However, for most OCR'd files, the system will continue to hang and continue to search. I've left it this way for over an hour to see if would eventually be able to index it, but to no avail. As such, I used to be able to highlight many documents, select "retrieve metadata" and it would find some and not find others, but now the system doesn't move to the next item and I'm having to index the files manually.
There is not pattern on whether the files are from a particular source. I've also tried it via the standalone Zotero and via Firefox with the same result. I am running Zotero 3.0b.3.1.
Thanks.
Paras.
You should index (i.e. have Zotero read and store part of) your files if you're going to use the retrieve metadata feature.
If it's possible to provide a sample document that's not working that would help - it's rather hard to tell what's going on without anything more specific. Also, have you made sure that you're not locked out of google scholar?
https://docs.google.com/open?id=0Bwd4c-BLEe52MDhmOWVkNWUtODc4Mi00YjhmLTgxOWItNmRlZDFiMzc3NTc4
https://docs.google.com/open?id=0Bwd4c-BLEe52ZGQyYWY5YWMtOTQzMy00MTgwLWIzNDktNTc2YzU3ODU5OGMx
I don't know know if I'm being locked out of Google Scholar, but I don't think so. I'm able to retrieve metadata via Google Scholar for some files but not others. The same files are problematic regardless of the number of times I try, yet others work fine.
Thanks for your help!
Paras.
But then, just like for you, the process doesn't fail nicely but instead just keeps going - that's not right and someone (Simon?) should take a look
The five simple suggestions I contributed two years ago on the basis of several users' reports would go a long way towards solving some of the basic problems with the pdf retrieve metadata UI.
best regards,
Paras.
http://dl.dropbox.com/u/38804134/science-4.pdf
http://dl.dropbox.com/u/38804134/science-10.pdf
http://dl.dropbox.com/u/38804134/science-62.pdf
> Also, many files don't have a DOI and do have a PII... if the system does not find and index a DOI, can it be made to attempt to index via the PII? Here are a few examples...
http://en.wikipedia.org/wiki/Publisher_Item_Identifier
http://dl.dropbox.com/u/38804134/science-5.pdf
http://dl.dropbox.com/u/38804134/science-41.pdf
Thank you.
Paras.
- I don't know if PII would help - it's harder to identify than DOIs (which all start with 10. and so are easy for Zotero to spot) and I don't know of a central database like CrossRef that we could query
Can the search string be put into PubMed, for example? This is what I do and most often find a DOI, PII, or PMID within the results. Can that be captured and plugged into the engine as an alternate way to index an item?
Thanks,
Paras.
Thanks,
Paras.
It's less trivial then it sounds, depending on where the DOI is given - usually when it's somewhere within the actual text, Zotero does well, but in the header or footer of a page pdftotext may not catch it.
Also, any thoughts on the PII and PubMed questions above?
Thank you.
Paras.
t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-10.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
doi:10.1016/j.ultrasmedbio.2008.05.006
t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-4.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
Vol. 181, 861-866, February 2009 Printed in U.S.A. DOI:10.1016/j.juro.2008.10.066
t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-62.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
0041-624X/$ - see front matter ? 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.ultras.2006.06.036
I think that PII is a good feature request.
The PMID could be doable - for PII we still have the question of where to look them up effectively, for PMID we already have a lookup in place.
Generally I think Zotero devs don't prioritize improving the retrieve metadata function - ideally, people should get their data - including pdfs - into Zotero from publisher websites or the like. There is, of course, always the option to submit patches.
recognizePDF.js is a single javascript file, so it's relatively easy to work with.
Adam, the retrieve metadata is very useful... I'd say essential, considering the time it saves looking up each DOI. I get articles through my university's system, which is connected to PubMed. When I do a PubMed search, all results from journals that Cornell subscribes to come up, regardless of publisher; so we don't / can't even log-in to the publisher's website.
Thanks,
Paras.
http://musingsaboutlibrarianship.blogspot.com/2010/07/extracting-metadata-from-pdfs-comparing.html
I'm still not sure I follow why you can't import the data from pubmed - which you apparently search? - and then attach the pdfs.
I'm still not sure I follow your comment of "...why you can't import the data from pubmed" ....perhaps I'm misunderstanding, so let me clarify a bit....
The university system has a PubMed search window on the intranet. It uses PubMed to find the article and then downloads from the publisher through some sort of access gateway through the university's system. There is no way to be able to login directly via a publisher's site. After I download the file, I use the "Store a Copy of File" feature to bring it into Zotero; then right-click and retrieve metadata.... this is the process I've been using, not sure if I'm doing it incorrectly.
Via this process, I hope it's clear how essential retrieving metadata is. I don't see any other way of getting automated citation info.
Thanks,
Paras.
And that's not on a browser? You couldn't import the search results into Zotero?
also, the doi of some of the articles would help so we can have a look.
I'm using the standalone Zotero.
Here are some files for which Zotero couldn't retrieve metadata...
http://dl.dropbox.com/u/38804134/Blom.pdf
http://dl.dropbox.com/u/38804134/end.2010.0131.pdf
http://dl.dropbox.com/u/38804134/science%20%2851%29.pdf
http://dl.dropbox.com/u/38804134/science%20%2863%29.pdf
http://dl.dropbox.com/u/38804134/science-46.pdf
Thanks,
Paras.
You could also consider using pubmed IDs - that will be a little slower, but the data quality will be much higher than anything you'll ever get using Retrieve Metadata.
I'm looking at the pdfs.
All of these except the Acta Biomaterialia paper work for me. Four out of five is much more in line with the usual success rate.
However, they all get metadata from google scholar, none of them catches the DOI. That's probably worth taking a look.
My initial suspicion would be that you used retrieve metadata on a lot of files at once and got locked out of google scholar (because they thought you were a bot). I don't think Zotero tells you when that's the case, although that might be desirable.
ok... what can I do about it?
There is not much you can do about that when it happens, except take a break and continue the next day.
Because GS does this, it would really be helpful if Zotero could do a better job with DOIs from pdf files (I don't think CrossRef locks anyone out). Also, of course, data from DOIs is much better.
Right now the core devs are likely busy fixing up 3.0 for final release, but maybe after that's done (end of this month) and potential initial bumps are removed, Simon could hopefully take a look at this.
Also, I think I see what you are saying about the connectors... I installed the Zotero Connectors in Safari & Chrome. I was able to search a title through the PubMed window on the intranet, and get to the article on the publisher page. From here, I was able to click on the Add to Zotero button and it did indeed add the citation to the library, but not the article.
When I double click the citation (without the pdf) the university's system blocks me from accessing the site in order to get the pdf.
The only way to do it was to download the article separately and go thru the "Store a Copy of File" routine. After getting it into Zotero, then it looks like I have to match the file to the citation, which is pretty cumbersome when you have 20+ files.
Please let me know if I'm missing anything.. I think the university's firewall is going to be problematic.
Any way to draw some attention to the Retrieve Metadata feature? I think it seems to be the best option considering the hoops I'm running thru here.
Thanks,
Paras.
Devs read all threads, no need to draw additional attention.
As for the google scholar issue - could be something else - someone would need to read through your debug output. http://www.zotero.org/support/debug_output . Unfortunately, I can't do it (I don't work for Zotero and obviously don't have access to the debug logs) so you'll have to wait until a dev takes some time to troubleshoot this.
I don't have access to debug lo
Paras.