Available for beta testing: new PDF recognizer

  • Fantastic updates! What kind of "inaccurate metadata" are you looking for? Obviously if it fetches the wrong citation that's something that should be corrected, but what about mistakes? For example, one publication I often get puts "n/a-n/a" in the pages field. Some reports list the wrong publisher or put the place in the wrong field. In other words, do you want field-specific inaccuracies for specific publications, or just those that are completely wrong citations?
  • I'd say collect a bunch and report them in a new thread (so this one doesn't get overburdened with details.) Not everything is going to be fixable, but e.g. the n/a-n/a probably is something we can look into.
  • I think @amc is referring to the "Report Inaccurate Metadata" context-menu option, so it generally wouldn't involve posting here. But something like the pages field might be worth starting a new thread about, since that could be a problem with the data from CrossRef rather than a problem with the PDF recognizer specifically.
  • edited March 18, 2018
    Just now I am inporting the master thesis from https://www.utwente.nl/en/et/wem/education/msc-thesis/2018/herrebrugh.pdf. PDf recognizer creates parent item and fill in name, title, page and type as "journal article".
  • Just tried the new PDF recognizer on an IETF RFC


    Oddly, the recognizer finds the meta data for the previous (1998) version of this spec when given the RFC 4291 (published in 2006).
  • Indeed! This is a major step forward and very encouraging for the future of Zotero!
  • how can i disable this feature?
  • @s7jackson: The new PDF recognizer is just an updated version of the functionality that's always been in Zotero. If you're referring to automatic retrieval when you add standalone PDFs, you can disable that in the General pane of the Zotero preferences.
  • It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images? This happened when I imported two books from Judith Halberstam...The Queer Art of Failure and Feminine Masculinities, both of which are OCR PDFs.

    Thanks for all your work on this!!
  • Can I use DOI to recognizer pdfs by hand?
  • @panyuz: You can just enter the DOI in Add Item by Identifier and then drag the PDF to the new item, if that's what you mean. But if the PDF has a DOI there's a good chance it will be recognized automatically.

    If you're asking whether you can get a PDF from a DOI with Add Item by Identifier, no, not currently, though that should be more possible in the future.
  • It looks like books with a lot of front matter before the text really gets started might be going unrecognized...or maybe it's the several pages of images?
    @calebward: Where the text starts shouldn't matter, but several pages of images certainly might do it. The beginnings of books are a bit tougher in general, but it's something we're trying to improve.
  • First, thanks for Zotero. By itself the file rename feature is incredibly useful; competing options don’t even come close.

    If you are considering improvements I have some suggestions, things I’d find quite helpful:

    1. A setting for title maximum character count.
    2. A string setting for connectors, or at least options (I.e., “ - “, “_”, etc)
    3. Options to set an overall format, FIRST_AUTHOR - YEAR - TITLE, etc.
    4. Options for authors, first author only, first plus second, etc
    5. An option to either truncate final word at max character count, or omit partial final word

    Also, some pdfs don’t turn up meta data, usually older ones in my experience. Could you render the first page to an offscreen buffer and use OCR to at least get the author and title? Or pull it out of the pdf if it contains text (as opposed to bitmap).

    If the rename mode would also have a selectable mode that allows the user to approve or cancel each proposed rename, that would be totally awesome. You could consider allowing selection between multiple file name possibilities when there isn’t a clear winner, including one where you could make edits to the proposed new name.

    It sounds like you may have something like 1 - 5 in the works already.

    Thanks again for the useful tool.
  • Are there APIs to work with this PDF recognizer, so that developers could use it in 3rd-party apps? I've looked at Zotero API pages, couldn't find anything related to this functionality. If it's really not available, are you planning to open this functionality via APIs?
  • For a number of the PDF's in my library, it says 'no matching references found', only the case with around half of my files, links below..
    http://www.itnphil.org.ph/docs/How to construct a rainwater harvesting tank.pdf
    Any advice on how I can retrieve said data?
Sign In or Register to comment.