Available for beta testing: new PDF recognizer

amc · March 13, 2018

Fantastic updates! What kind of "inaccurate metadata" are you looking for? Obviously if it fetches the wrong citation that's something that should be corrected, but what about mistakes? For example, one publication I often get puts "n/a-n/a" in the pages field. Some reports list the wrong publisher or put the place in the wrong field. In other words, do you want field-specific inaccuracies for specific publications, or just those that are completely wrong citations?

adamsmith · March 13, 2018

I'd say collect a bunch and report them in a new thread (so this one doesn't get overburdened with details.) Not everything is going to be fixable, but e.g. the n/a-n/a probably is something we can look into.

dstillman · March 13, 2018

I think @amc is referring to the "Report Inaccurate Metadata" context-menu option, so it generally wouldn't involve posting here. But something like the pages field might be worth starting a new thread about, since that could be a problem with the data from CrossRef rather than a problem with the PDF recognizer specifically.

LiborA · March 18, 2018

Just now I am inporting the master thesis from https://www.utwente.nl/en/et/wem/education/msc-thesis/2018/herrebrugh.pdf. PDf recognizer creates parent item and fill in name, title, page and type as "journal article".

mguod · March 20, 2018

Just tried the new PDF recognizer on an IETF RFC

https://www.rfc-editor.org/info/rfc4291

Oddly, the recognizer finds the meta data for the previous (1998) version of this spec when given the RFC 4291 (published in 2006).

Heckscher · March 21, 2018

Indeed! This is a major step forward and very encouraging for the future of Zotero!

s7jackson · March 23, 2018

how can i disable this feature?

dstillman · March 23, 2018

@s7jackson: The new PDF recognizer is just an updated version of the functionality that's always been in Zotero. If you're referring to automatic retrieval when you add standalone PDFs, you can disable that in the General pane of the Zotero preferences.

calebward · April 1, 2018

It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images? This happened when I imported two books from Judith Halberstam...The Queer Art of Failure and Feminine Masculinities, both of which are OCR PDFs.

Thanks for all your work on this!!

panyuz · April 5, 2018

Can I use DOI to recognizer pdfs by hand？

dstillman · April 5, 2018

@panyuz: You can just enter the DOI in Add Item by Identifier and then drag the PDF to the new item, if that's what you mean. But if the PDF has a DOI there's a good chance it will be recognized automatically.

If you're asking whether you can get a PDF from a DOI with Add Item by Identifier, no, not currently, though that should be more possible in the future.

dstillman · April 6, 2018

It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images?

@calebward: Where the text starts shouldn't matter, but several pages of images certainly might do it. The beginnings of books are a bit tougher in general, but it's something we're trying to improve.

buehlerd · April 10, 2018

First, thanks for Zotero. By itself the file rename feature is incredibly useful; competing options don’t even come close.

If you are considering improvements I have some suggestions, things I’d find quite helpful:

1. A setting for title maximum character count.
2. A string setting for connectors, or at least options (I.e., “ - “, “_”, etc)
3. Options to set an overall format, FIRST_AUTHOR - YEAR - TITLE, etc.
4. Options for authors, first author only, first plus second, etc
5. An option to either truncate final word at max character count, or omit partial final word

Also, some pdfs don’t turn up meta data, usually older ones in my experience. Could you render the first page to an offscreen buffer and use OCR to at least get the author and title? Or pull it out of the pdf if it contains text (as opposed to bitmap).

If the rename mode would also have a selectable mode that allows the user to approve or cancel each proposed rename, that would be totally awesome. You could consider allowing selection between multiple file name possibilities when there isn’t a clear winner, including one where you could make edits to the proposed new name.

It sounds like you may have something like 1 - 5 in the works already.

Thanks again for the useful tool.

galicarnax · April 12, 2018

Are there APIs to work with this PDF recognizer, so that developers could use it in 3rd-party apps? I've looked at Zotero API pages, couldn't find anything related to this functionality. If it's really not available, are you planning to open this functionality via APIs?

treeswaterpeople · April 27, 2018

For a number of the PDF's in my library, it says 'no matching references found', only the case with around half of my files, links below..
http://spate-irrigation.org/wp-content/uploads/2017/06/Technical-Sheet_Geomembrane_bag.pdf
http://www.itnphil.org.ph/docs/How to construct a rainwater harvesting tank.pdf
https://cleancookstoves.org/binary-data/CMP_CATALOG/file/000/000/102-1.pdf
Any advice on how I can retrieve said data?

dstillman · April 27, 2018

@treeswaterpeople: https://forums.zotero.org/discussion/comment/307479/#Comment_307479