Error in metadata extraction

freddie2310 · August 28, 2014

Hello,

When I retrieved the metadata from the following PDF file available here http://ec.europa.eu/environment/marine/pdf/9-Task-Group-10.pdf I got the metadata from another file availalable here http://ec.europa.eu/environment/marine/pdf/8-Task-Group-9.pdf.

I would like to understand which information in the pdf file lead Zotero (and Google Sholar) to the wrong file.

Thank you for your help.

F

adamsmith · August 28, 2014

zotero extracts a part of the full text of the document, puts it in quotation marks, queries google scholar with that.
If you want to look at what exactly, you can have debug output run during the retrieve metadata process:
https://www.zotero.org/support/debug_output
and look at it - there will be a fair amount of irrelevant stuff, but the google scholar query is easy to spot.

(Zotero first looks for a DOI (CrossRef) or an ISBN (WorldCat) but I think those are unlikely here).

freddie2310 · August 29, 2014

OK. I looked at the debug output and understood that :
- Zotero couldn't get a result from CrossRef
- Zotero didn't try anything with the ISBN (it might have not found it, even if it seems to be obvious in the document text)
- Zotero then generate a query string for Google Scholar. And this result lead it to a wrong document.
My question is : is the text for the query string selected randomly ?
In my example, it is 2 text strings selected from the document preface.

Thanks for your explanation.
F

adamsmith · August 29, 2014

we're working on the DOI issue, so that would fix this particular issue.
As for the phrase used in google scholar--they're relatively random (though not in the technical sense: you'd get the same sentence every time you try this). To be precise they're lines within 6 characters of the median line length that appear in the first column of the text (if there is multi-column text).
There's been talk of excluding the first x% of the document to avoid getting prefaces and the like, but I believe that has never happened.

aurimas · August 29, 2014

Also, we've never considered items with both DOI and ISBN, but obviously they exist. The ISBN would able to retrieve metadata for this item actually. I'm working on a fix for this as well.