Yet another "retrieve Metadata provides wrong result" case

i3v · May 26, 2016

Hello,

I've just noticed, that "retrieve metadata" for this document: http://www.merl.com/publications/docs/TR2005-057.pdf gives an absolutely wrong result :

oualline_practical_1995,
location = {Sebastopol, {CA}},
edition = {1st ed},
title = {Practical C++ programming},
isbn = {978-1-56592-139-9},
series = {A Nutshell handbook},
pagetotal = {557},
publisher = {Reilly \& Associates},
author = {Oualline, Steve},
date = {1995},
keywords = {C++ (Computer program language)

It's probably the same issue, as discussed here :
https://forums.zotero.org/discussion/57418/retrieve-pdfs-metadata-wrong-metadata-source-/
or here:
https://forums.zotero.org/discussion/26927/pdf-with-incorrect-metadata/

So, I just wish to submit one more example, where current mechanism fails.

noksagt · May 26, 2016

That seems likely to be the correct diagnosis: The ISBN is given in the report. This particular item is a report, which I would not expect successful retrieval of.

I don't know how much worse a "false positive" like this is vs. just not finding data, so don't know what efforts to improve the heuristic would be useful.

But I will say that in both this case and the one described in discussion 57418, the document submitted for metadata retrieval is a relatively short (10 page) PDF. There are short books. But WorldCat and other databases often have a page count (admittedly, in WorldCat, this is just an inconsistently-formatted "description" blob that we don't bother trying to parse). I wonder if a gross comparison of page count would be useful to lower false positives?

i3v · May 27, 2016

Hi noksagt,

Thanks for your quick reply!

Indeed, I'm not sure about the details of the implementation, and I'm not able to tell how much work is needed to make current algorithm work better. I just wanted to add a "test case", which currently fails, even though (for me) it looks even simpler than those one I've mentioned (and way simpler than another one).

In particular:

Googling using the title (which is on the first page, in this case) "Integral Histogram: A Fast Way to Extract Histograms in Cartesian Spaces" (printed using bold, large font on the first page) immediately provides a much more relevant result.
Similar result might be obtained by googling the first sentence after the "Abstract" keyword (still on the first page).
The ISBN of the “Practical C++ programming” is only present on the last page, after the "references" keyword. (Not on the first page, as noted in one of those two cases I've mentioned)

PS.
It looks like Mendeley is able to correctly retrieve metadata. This page says they use some heuristics as well.

noksagt · May 27, 2016

Thanks.

The relevant code in Zotero is:
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js

Zotero looks in the first 15 pages for a DOI. Failing that, it looks in the first 15 pages for an ISBN. Failing this, it then queries google scholar with median-length lines. I haven't tried lowering the page limit to force the google scholar query on this text. If that worked, it might be one bit of support for also checking scholar for cases like this (multiple ISBNs found late in the document).

[I don't think we'd have an easy way to identify the format/styling of text and that first pages can be misleading (they're often copyright or interlibrary loan boiler plate).]