Problem in retrieving metadata

mronkko · December 28, 2011

I found what I believe to be a bug in Zotero. A file that I downloaded a file from a database, added it to Zotero and tried to retrieve metadata, I get an error "PDF does not contain OCRd text". However, the file contains a DOI on the first page and this can be read with the pdftotext tool that is installed by Zotero by launching the tool from command line:

mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel 173.full.pdf /dev/stdout
Error: No paper information available - using defaults
Applied Psychological Measurement http://apm.sagepub.com/

Estimation of Composite Reliability for Congeneric Measures
Tenko Raykov Applied Psychological Measurement 1997 21: 173 DOI: 10.1177/01466216970212006 The online version of this article can be found at: http://apm.sagepub.com/content/21/2/173

There is no text in the PDF after the first page, but only images.

I uploaded the paper to my dropbox public folder in case the developers want to take a look:
http://dl.dropbox.com/u/694399/173.full.pdf

dstillman · December 28, 2011

It looks like the recognize code checks to see if there are at least 20 lines of text in the first three pages. If there aren't, it throws that error. This makes sense for Google Scholar lookups, but maybe not for DOI lookups, though this would only be a problem for image-based PDFs. I'll let Simon comment further.

parasbuy · December 29, 2011

Hello.. I'm having a great deal of trouble using the retrieve metadata feature. Some pdf articles are able to be found, but most are not, approximately about 70% or the time. The pdf files that are not indexed, are not images and do have OCR'd text as I am able to open the file, copy/paste the DOI into the magic wand "Add item by identifier," and Zotero is able to index the file perfectly... it just does not do this via the "retrieve metadata" selection for most file, however.

When Zotero does not find an OCR'd document, the system does respond with an "X" and moves to index the next file. However, for most OCR'd files, the system will continue to hang and continue to search. I've left it this way for over an hour to see if would eventually be able to index it, but to no avail. As such, I used to be able to highlight many documents, select "retrieve metadata" and it would find some and not find others, but now the system doesn't move to the next item and I'm having to index the files manually.

There is not pattern on whether the files are from a particular source. I've also tried it via the standalone Zotero and via Firefox with the same result. I am running Zotero 3.0b.3.1.

Thanks.
Paras.

adamsmith · December 29, 2011

when you say "index" it's not clear to me what you mean.
You should index (i.e. have Zotero read and store part of) your files if you're going to use the retrieve metadata feature.

If it's possible to provide a sample document that's not working that would help - it's rather hard to tell what's going on without anything more specific. Also, have you made sure that you're not locked out of google scholar?

parasbuy · December 29, 2011

Hi Adam... sorry, I'll try to be more clear. Here are two files that have been problematic. As you can see, they are pdfs with DOI and not image files. To have Zotero retrieve metadata, I added them into Zotero, right-clicked, and selected "Retrieve Metadata for PDF" ...this was all I meant as far the term "index."

https://docs.google.com/open?id=0Bwd4c-BLEe52MDhmOWVkNWUtODc4Mi00YjhmLTgxOWItNmRlZDFiMzc3NTc4

https://docs.google.com/open?id=0Bwd4c-BLEe52ZGQyYWY5YWMtOTQzMy00MTgwLWIzNDktNTc2YzU3ODU5OGMx

I don't know know if I'm being locked out of Google Scholar, but I don't think so. I'm able to retrieve metadata via Google Scholar for some files but not others. The same files are problematic regardless of the number of times I try, yet others work fine.

Thanks for your help!
Paras.

adamsmith · December 29, 2011

OK, I had a look at the first one - Zotero doesn't pick up the doi, it then checks google scholar for some search terms and doesn't get any results (because it picks bad search terms) - so far that's too bad, but expected to happen from time to time.
But then, just like for you, the process doesn't fail nicely but instead just keeps going - that's not right and someone (Simon?) should take a look

dstillman · December 29, 2011

Yeah, the second one throws an error:

Error: this._handlers[type][i] is undefined
Source file: chrome://zotero/content/xpcom/translation/translate.js
Line: 855

Simon will have to take a look.

Simon · December 30, 2011

I'll take a look, but my first thought is that tightening the detect code for Google Scholar may have broken something.

Simon · December 31, 2011

That was indeed the case. There was another issue causing that error to be thrown, which wasn't causing the hangs, but which I've fixed anyway. I've reverted the changes to the Google Scholar translator and also fixed recognizePDF.js so that this won't be a problem in the future.

mark · January 1, 2012

Glad to see the interaction with Google Scholar is being improved. I note, though, that parabuys' comments make this once again a thread (along with another recent example) in which problems surface that have to do not just with the code but also with the UI of the retrieve metadata feature.

The five simple suggestions I contributed two years ago on the basis of several users' reports would go a long way towards solving some of the basic problems with the pdf retrieve metadata UI.

parasbuy · January 6, 2012

I've noticed a significant improvement in this... thanks a lot for your attention to the issue! It helps a lot!

best regards,
Paras.

parasbuy · January 6, 2012

btw, I do still notice a number of PDFs with OCR'd text and with DOI's do not get retrieved. Here are some examples....

http://dl.dropbox.com/u/38804134/science-4.pdf

http://dl.dropbox.com/u/38804134/science-10.pdf

http://dl.dropbox.com/u/38804134/science-62.pdf

> Also, many files don't have a DOI and do have a PII... if the system does not find and index a DOI, can it be made to attempt to index via the PII? Here are a few examples...

http://en.wikipedia.org/wiki/Publisher_Item_Identifier

http://dl.dropbox.com/u/38804134/science-5.pdf

http://dl.dropbox.com/u/38804134/science-41.pdf

Thank you.

Paras.

adamsmith · January 6, 2012

where Zotero can't find a doi it uses google scholar using a search string taken from the document
- I don't know if PII would help - it's harder to identify than DOIs (which all start with 10. and so are easy for Zotero to spot) and I don't know of a central database like CrossRef that we could query

parasbuy · January 6, 2012

it seems like all PII's start with "PII:" can that be used? Same thing with PMID, can it check PMID if it doesn't find a DOI via Google Scholar?

Can the search string be put into PubMed, for example? This is what I do and most often find a DOI, PII, or PMID within the results. Can that be captured and plugged into the engine as an alternate way to index an item?

Thanks,
Paras.

parasbuy · January 6, 2012

Adam, on the files above where the DOI's are not being retrieved by Zotero even though they are OCR'd text... if I manually copy and paste the DOI from the file and input them into the "Add Item by Identifier" within Zotero, the correct bibliographic information is readily indexed. I am assuming Zotero is essentially doing the same thing through its metadata retrieval system, so why shouldn't Zotero be able to find and index the item itself?

Thanks,
Paras.

adamsmith · January 6, 2012

because it doesn't find the DOI in the document.
It's less trivial then it sounds, depending on where the DOI is given - usually when it's somewhere within the actual text, Zotero does well, but in the header or footer of a page pdftotext may not catch it.

parasbuy · January 6, 2012

OK, I understand... can this be added as a feature request? The reason is that the vast majority of DOI's on files I use are in the footer or header. I can copy/resubmit via the appropriate forum.

Also, any thoughts on the PII and PubMed questions above?

Thank you.
Paras.

mronkko · January 6, 2012

The version of pdftotxt that is included in Zotero 3 can find the DOIs on the example items. So the problem is not there but in Zotero code if these items are not recognized.

t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-10.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
doi:10.1016/j.ultrasmedbio.2008.05.006
t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-4.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
Vol. 181, 861-866, February 2009 Printed in U.S.A. DOI:10.1016/j.juro.2008.10.066
t3080-la0003:Downloads mronkko$ ~/Documents/Research/Zotero/pdftotext-MacIntel science-62.pdf /dev/stdout | grep -i doi
Error: No paper information available - using defaults
0041-624X/$ - see front matter ? 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.ultras.2006.06.036

I think that PII is a good feature request.

adamsmith · January 6, 2012

I don't know how much a feature request would help here - Zotero relies on pdftotext, i.e. a third party tool, to extract the text from the pdfs and if that text doesn't include the DOIs (as it apparently doesn't) there's not much to be done.

The PMID could be doable - for PII we still have the question of where to look them up effectively, for PMID we already have a lookup in place.

Generally I think Zotero devs don't prioritize improving the retrieve metadata function - ideally, people should get their data - including pdfs - into Zotero from publisher websites or the like. There is, of course, always the option to submit patches.
recognizePDF.js is a single javascript file, so it's relatively easy to work with.

parasbuy · January 17, 2012

Checking in to see there have been some changes recently? Zotero isn't retrieving much... about 10% of articles. Confirmed the articles are OCR'd and have DOI's at the bottom in the regular location.

Adam, the retrieve metadata is very useful... I'd say essential, considering the time it saves looking up each DOI. I get articles through my university's system, which is connected to PubMed. When I do a PubMed search, all results from journals that Cornell subscribes to come up, regardless of publisher; so we don't / can't even log-in to the publisher's website.

Thanks,
Paras.

adamsmith · January 17, 2012

Nothing new, no. 10% seems extremely low, though, nowhere close to my experience or that of people testing this:
http://musingsaboutlibrarianship.blogspot.com/2010/07/extracting-metadata-from-pdfs-comparing.html

I'm still not sure I follow why you can't import the data from pubmed - which you apparently search? - and then attach the pdfs.

parasbuy · January 17, 2012

It was actually 3 out of 54 pdf's that Zotero was able to retrieve metadata for.

I'm still not sure I follow your comment of "...why you can't import the data from pubmed" ....perhaps I'm misunderstanding, so let me clarify a bit....

The university system has a PubMed search window on the intranet. It uses PubMed to find the article and then downloads from the publisher through some sort of access gateway through the university's system. There is no way to be able to login directly via a publisher's site. After I download the file, I use the "Store a Copy of File" feature to bring it into Zotero; then right-click and retrieve metadata.... this is the process I've been using, not sure if I'm doing it incorrectly.

Via this process, I hope it's clear how essential retrieving metadata is. I don't see any other way of getting automated citation info.

Thanks,
Paras.

adamsmith · January 17, 2012

"The university system has a PubMed search window on the intranet."
And that's not on a browser? You couldn't import the search results into Zotero?

also, the doi of some of the articles would help so we can have a look.

parasbuy · January 17, 2012

Yes, the PubMed search window is in browser... how can I import the search results into Zotero from there? Don't I need to download the pdf and then "Store a Copy of File" to bring it into Zotero?

I'm using the standalone Zotero.

Here are some files for which Zotero couldn't retrieve metadata...

http://dl.dropbox.com/u/38804134/Blom.pdf

http://dl.dropbox.com/u/38804134/end.2010.0131.pdf

http://dl.dropbox.com/u/38804134/science%20%2851%29.pdf

http://dl.dropbox.com/u/38804134/science%20%2863%29.pdf

http://dl.dropbox.com/u/38804134/science-46.pdf

Thanks,
Paras.

adamsmith · January 17, 2012

which browser are you using? Safari and Chrome have connectors that should allow easy import from Pubmed.
You could also consider using pubmed IDs - that will be a little slower, but the data quality will be much higher than anything you'll ever get using Retrieve Metadata.
I'm looking at the pdfs.

adamsmith · January 17, 2012

As for the PDFs - as I suspected something isn't right:
All of these except the Acta Biomaterialia paper work for me. Four out of five is much more in line with the usual success rate.

However, they all get metadata from google scholar, none of them catches the DOI. That's probably worth taking a look.

My initial suspicion would be that you used retrieve metadata on a lot of files at once and got locked out of google scholar (because they thought you were a bot). I don't think Zotero tells you when that's the case, although that might be desirable.

parasbuy · January 17, 2012

> "My initial suspicion would be that you used retrieve metadata on a lot of files at once and got locked out of google scholar (because they thought you were a bot). I don't think Zotero tells you when that's the case, although that might be desirable."

ok... what can I do about it?

adamsmith · January 17, 2012

If I'm right about this you would get a message from google scholar when you go to their page that tells you that they think you're a bot and potentially have you fill out a captcha. You'll only get locked out for some time, so you wouldn't necessarily see that now, just immediately after retrieving metadata.

There is not much you can do about that when it happens, except take a break and continue the next day.

Because GS does this, it would really be helpful if Zotero could do a better job with DOIs from pdf files (I don't think CrossRef locks anyone out). Also, of course, data from DOIs is much better.
Right now the core devs are likely busy fixing up 3.0 for final release, but maybe after that's done (end of this month) and potential initial bumps are removed, Simon could hopefully take a look at this.

parasbuy · January 17, 2012

OK.. thanks... I'm not getting any error back from Google Scholar. I'm familiar with the captcha and haven't been asked to submit it. Any other thoughts?

Also, I think I see what you are saying about the connectors... I installed the Zotero Connectors in Safari & Chrome. I was able to search a title through the PubMed window on the intranet, and get to the article on the publisher page. From here, I was able to click on the Add to Zotero button and it did indeed add the citation to the library, but not the article.

When I double click the citation (without the pdf) the university's system blocks me from accessing the site in order to get the pdf.

The only way to do it was to download the article separately and go thru the "Store a Copy of File" routine. After getting it into Zotero, then it looks like I have to match the file to the citation, which is pretty cumbersome when you have 20+ files.

Please let me know if I'm missing anything.. I think the university's firewall is going to be problematic.

Any way to draw some attention to the Retrieve Metadata feature? I think it seems to be the best option considering the hoops I'm running thru here.

Thanks,
Paras.

adamsmith · January 17, 2012

The only way to do it was to download the article separately and go thru the "Store a Copy of File" routine. After getting it into Zotero, then it looks like I have to match the file to the citation, which is pretty cumbersome when you have 20+ files.

yeah, that's what I meant. As above, though - pubmed data is much better than anything you'll get from retrieve metadata (certainly GS, also DOIs). What you save on import you may lose on fixing data issues.

Devs read all threads, no need to draw additional attention.

As for the google scholar issue - could be something else - someone would need to read through your debug output. http://www.zotero.org/support/debug_output . Unfortunately, I can't do it (I don't work for Zotero and obviously don't have access to the debug logs) so you'll have to wait until a dev takes some time to troubleshoot this.
I don't have access to debug lo

parasbuy · January 17, 2012

Thanks for your help, Adam.

Paras.