Fantastic updates! What kind of "inaccurate metadata" are you looking for? Obviously if it fetches the wrong citation that's something that should be corrected, but what about mistakes? For example, one publication I often get puts "n/a-n/a" in the pages field. Some reports list the wrong publisher or put the place in the wrong field. In other words, do you want field-specific inaccuracies for specific publications, or just those that are completely wrong citations?
I'd say collect a bunch and report them in a new thread (so this one doesn't get overburdened with details.) Not everything is going to be fixable, but e.g. the n/a-n/a probably is something we can look into.
I think @amc is referring to the "Report Inaccurate Metadata" context-menu option, so it generally wouldn't involve posting here. But something like the pages field might be worth starting a new thread about, since that could be a problem with the data from CrossRef rather than a problem with the PDF recognizer specifically.
@s7jackson: The new PDF recognizer is just an updated version of the functionality that's always been in Zotero. If you're referring to automatic retrieval when you add standalone PDFs, you can disable that in the General pane of the Zotero preferences.
It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images? This happened when I imported two books from Judith Halberstam...The Queer Art of Failure and Feminine Masculinities, both of which are OCR PDFs.
@panyuz: You can just enter the DOI in Add Item by Identifier and then drag the PDF to the new item, if that's what you mean. But if the PDF has a DOI there's a good chance it will be recognized automatically.
If you're asking whether you can get a PDF from a DOI with Add Item by Identifier, no, not currently, though that should be more possible in the future.
It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images?
@calebward: Where the text starts shouldn't matter, but several pages of images certainly might do it. The beginnings of books are a bit tougher in general, but it's something we're trying to improve.
First, thanks for Zotero. By itself the file rename feature is incredibly useful; competing options don’t even come close.
If you are considering improvements I have some suggestions, things I’d find quite helpful:
1. A setting for title maximum character count. 2. A string setting for connectors, or at least options (I.e., “ - “, “_”, etc) 3. Options to set an overall format, FIRST_AUTHOR - YEAR - TITLE, etc. 4. Options for authors, first author only, first plus second, etc 5. An option to either truncate final word at max character count, or omit partial final word
Also, some pdfs don’t turn up meta data, usually older ones in my experience. Could you render the first page to an offscreen buffer and use OCR to at least get the author and title? Or pull it out of the pdf if it contains text (as opposed to bitmap).
If the rename mode would also have a selectable mode that allows the user to approve or cancel each proposed rename, that would be totally awesome. You could consider allowing selection between multiple file name possibilities when there isn’t a clear winner, including one where you could make edits to the proposed new name.
It sounds like you may have something like 1 - 5 in the works already.
Are there APIs to work with this PDF recognizer, so that developers could use it in 3rd-party apps? I've looked at Zotero API pages, couldn't find anything related to this functionality. If it's really not available, are you planning to open this functionality via APIs?
https://www.rfc-editor.org/info/rfc4291
Oddly, the recognizer finds the meta data for the previous (1998) version of this spec when given the RFC 4291 (published in 2006).
Thanks for all your work on this!!
If you're asking whether you can get a PDF from a DOI with Add Item by Identifier, no, not currently, though that should be more possible in the future.
If you are considering improvements I have some suggestions, things I’d find quite helpful:
1. A setting for title maximum character count.
2. A string setting for connectors, or at least options (I.e., “ - “, “_”, etc)
3. Options to set an overall format, FIRST_AUTHOR - YEAR - TITLE, etc.
4. Options for authors, first author only, first plus second, etc
5. An option to either truncate final word at max character count, or omit partial final word
Also, some pdfs don’t turn up meta data, usually older ones in my experience. Could you render the first page to an offscreen buffer and use OCR to at least get the author and title? Or pull it out of the pdf if it contains text (as opposed to bitmap).
If the rename mode would also have a selectable mode that allows the user to approve or cancel each proposed rename, that would be totally awesome. You could consider allowing selection between multiple file name possibilities when there isn’t a clear winner, including one where you could make edits to the proposed new name.
It sounds like you may have something like 1 - 5 in the works already.
Thanks again for the useful tool.
http://spate-irrigation.org/wp-content/uploads/2017/06/Technical-Sheet_Geomembrane_bag.pdf
http://www.itnphil.org.ph/docs/How to construct a rainwater harvesting tank.pdf
https://cleancookstoves.org/binary-data/CMP_CATALOG/file/000/000/102-1.pdf
Any advice on how I can retrieve said data?