PDF metadata mismatches

ugomatic · August 21, 2009

Occasionally the metadata retrieved by Zotero for PDF files is not accurate.
I'll post examples of this as I come across them

A paper titled "Information and Communication Technologies, Poverty and Development" is recognized as "Non-surgical retrieval of a broken segment of steel spring guide from the right atrium and inferior vena cava".
The original file is for some reason not indexed in Google Scholar, which might indeed be the problem, but it can be found in the following locations:

http://www.sed.manchester.ac.uk/idpm/research/publications/wp/di/di_wp05.htm
http://unpan1.un.org/intradoc/groups/public/documents/NISPAcee/UNPAN015539.pdf

ugomatic · August 22, 2009

Another file not properly recognised is this:

http://www.aspeninstitute.org/sites/default/files/content/docs/pubs/The_Rise_of_Collective_Intelligence.pdf

It is imported with the metadata of another report from the same series: http://www.aspeninstitute.org/sites/default/files/content/docs/pubs/A_Framework_for_a_National_Broadband_Policy_0.pdf

(however, this doesn't happen viceversa)

(I am reporting on these issues not as a complaint, as I love Zotero - just to help with bugs)

dvs0826 · August 23, 2009

Numerous articles from ACM Transactions on Graphics result in mismatched metadata.

One quick example is:

Modelling and rendering of realistic feathers

dstillman · August 23, 2009

Currently the recognizer only looks at the first two pages of the PDF, so if there's no DOI information and the full-text content doesn't start until the third page or later (e.g., if there's a table of contents), it likely either won't find anything or will return mismatched metadata.

So the first thing we need to do is to bump up the page limit.

komrade · September 3, 2009

Could it also possibly look for JSTOR URLs on the first page and get info from there?

This article from 2001 came up way wrong:
http://www.jstor.org/stable/3061243
(Identified as as 1993 paper in the same journal)