Various observations regarding PDF recognition and a few code solutions

azjakec · August 17, 2012

I posted in a different thread how much better Zotero's PDF metadata recognition is than anything else I've tried. The developers should be commended for doing such a great job making Zotero a really capable tool. For ease of use while conducting new research, it's fantastic and effortless. But over the last few weeks I've been trying to add in my 3000+ collection of journal articles, OCRed page scans, and e-books, which has not been easy without some code modification. It might be a rare case that someone has to start with such a large mess of PDFs already on hard disk, but since Zotero's catalog display features - for lack of a better term - seem to be on a par with a number of other packages, more users might be tempted as I was to use it for more than a reference generator. There are issues regarding Zotero's data framework that crop up when you push it beyond it's intended purpose, but those are for another time. Here, I thought I'd share a few of the hiccups directly related to the PDF metadata recognition - for book material especially - that I've had and the work-arounds that I've used or attempted to use. Maybe the developers will see some value in my suggestions or perhaps someone searching the forums might be able to use them themselves. For the latter, I point out that I've used and modified the Zotero stand-alone program as described here:

http://www.zotero.org/support/dev/client_coding/building_the_standalone_client
http://www.zotero.org/support/dev/modifying_zotero_files

And I did end up finding and using a newer recognizePDF.js file that as of 8/17/12 is not in Zotero 3.0.8 but is currently on github here:

https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js

My comments below reflect this much-improved version of the file.

Please be careful and always make backups of your Zotero and Firefox profiles!

ISBN recognition

Unfortunately many OCR engines do not distinguish the numeral '0' and the letter 'O' - upper or lower very well, and the same goes for numeral '1' and lowercase 'l'. That is, Acrobat or whatever processes an image of a copyright page with some unfortunate typeface choice and so the text in the PDF becomes not 0-12-345678-9 but o-l-2345678-9. While cataloguing my collection I've come across many scans whose ISBN numbers Zotero chokes on because of this. The ISBN parser in recognizePDF.js needs to handle these cases better by first expanding the ISBN recognition regular expression pattern from [0-9X] to something like [0-9XxOol] and then having some sort of cleaner code to turn any errant letters into the appropriate numerals. The same goes for whatever handles the magic wand or "Add Item by Identifier" input.

Next, the branching in Zotero_RecognizePDF.Recognizer.prototype._queryGoogle = function() needs to be improved. The IF statement with the condition if(this._DOI || this._ISBNs) is somehow preventing the Google Scholar query from taking place if the ISBN search fails, as it does with ISBNs like the ones with errant letters as above. I tried to rectify this myself, but I couldn't guarantee the expected behavior. So I ended up creating a separate recognizePDF.js with the DOI/ISBN branch eliminated to swap in to use with the files that failed the DOI/ISBN recognition the and it did what I needed.

Lastly, for those items that Zotero does recognize the ISBN or the ISBN is entered manually into the magic wand box, the Open WorldCat lookup results in a subpar outcome with regard to what is in the ISBN field of the item created in Zotero: at least two and perhaps many more ISBNs. This is not really Zotero's fault, because the record in WorldCat includes all of those ISBNs, the paired ISBN-10 and the ISBN-13 at minimum, and more pairs for each edition WorldCat collates under that one result - hardcover, paperback, e-book, multi-volume sets… For the Zotero recognizer, little can be done, since there is no way of knowing which of the many ISBNs from WorldCat is the appropriate ISBN since the recognizer itself doesn't currently care which of the ISBNs that may occur on a copyright page (paperback, hardcover, ebook, etc.) is the appropriate ISBN, it simply chooses one. For the 'Add Item by Identifier' magic wand box, on the other hand, this is distinctly suboptimal behavior. If I've gone through the trouble of going to the copyright page, picking the appropriate ISBN, and then inputing it manually, that ISBN should be what is seen in the item entry ISBN field, not 10 ISBNs reflecting a long publishing entry.
(Because I wanted to be able to export each book item from Zotero to Librarything though the ISBN records and having multiple ISBNs makes the Librarything importer think I have multiple copies of each book, I modified the OpenWorldCat translator to take the first ISBN record only - unfortunately there is no reliability in WorldCat as to whether that is an ISBN-10 or 13 or what sort of edition that it references - and then modified recognizePDF.js to call the new OpenWorldCat translator. This only works with the recognizer, since I couldn't find the .js file that governs the magic wand box. And it's not something I recommend Zotero do, but I mention it in case others want to do that. And I did have to define a custom RIS translator to get only the ISBN field exported from Zotero, which was more work than it should be.)

Google

Now should the ISBN search fail, searching Google Scholar the way Zotero is set up doesn't actually do much anyway, for a couple of reasons. One, Google doesn't actually have a lot of scholarly books (that I've tried anyway, your mileage may vary) tied to the Scholar database. Even if the quotes chosen by recognizePDF.js are exact and not mangled with OCR errors, Zotero will still come up empty if Scholar does not have that text. In general, for book searches, Google Scholar is surely a subpar choice compared to Google Books. I tested a routine that searched Google Books using the same framework as the Scholar query in recognizePDF.js, calling a hidden browser with a Books url constructed from the search string and calling the Google Books translator instead of Scholar, and managed to get a 50% positive match rate on a large sample that Zotero had otherwise failed to match by ISBN or Google Scholar. There was about a 10% error rate, but I thought that was an acceptable ratio for my purposes.

Second, as my usage has unfortunately shown, for book scans, even the 7 pages used as the option given to pdftotext command in recognizePDF.js is not enough for quality text query generation. For most books 7 pages will get you the ISBN, but between the cover, title, half-title, series title, series description and list, blank pages, copyright, etc. a median length line from that material might not be unique, if it's pulled from the common series title page, but more importantly, for whatever reason the columns and the closely spaced characters of a copyright page have a higher likelihood of inducing OCR errors and non-UTF characters and gobbledly-gook in the PDF's rendered text at the beginning - but not towards the middle of the file in the meat of the book's content - is the result. To get quality line samples I set pdttotext called by recognizePDF.js to use 40 pages, and quality line samples were ensured almost every time. This is a bit problematic for general use, since for most journal articles that will give you a high likelihood of copying end matter lines of references which will be common across many many articles. But since I had already gotten most of my articles processed, it was fine for my use.

continued in next post...

azjakec · August 17, 2012

old JSTOR articles

OCR challenges for JSTOR journal articles as well. Like this one:

http://www.jstor.org/stable/2024634

Sadly JSTOR used a bizarre OCR choice while scanning many old humanities journals so now their text has the strange property of being able to be copied out of Acrobat just fine exactly as if it had been a digital file to begin with, but the text rendered by OSX preview and pdftotext the text lacks a lot of whitespacing. Since the Google Scholar query uses quotes, the run together wording is a killer. As a JSTOR pdf has a cover page with properly rendered text of the title, author, journal, and JSTOR url, it should be an easy problem to surmount. The JSTOR translator doesn't seem to accept a url directly, so I couldn't get a simple pattern recognizer to work. You can easily pull the jstor.org/stable/ url from the pdftotext generated text with a little regex but I had no where to send it without the JSTOR translator being set up to work like the OpenWorldCat translator and accepting a url (and I read that JSTOR prevents the translator from working if you aren't on a university with access). So I just made the Google Scholar query generator in recognizePDF.js accept, rather than a random line, only the first two lines from the file, which invariably have either the publication name and then the article title or title and author. This worked very well, with an 80% success rate, although most of the failures were not complete failures but Google Scholar putting a different item at the top of the search results. Which is fine for my purposes, but others may have differing thoughts on acceptable ratios.

Thanks again to the developers, especially for making things relatively easy under the hood to understand and modify.

adamsmith · August 18, 2012

thanks, this is some good stuff. Simon is the main dev dealing with recognizePDF - I'm not at Zotero, but I'm generally also quite interested in improving this.

Looking over your changes my sense would be that the ones that sound most promising for inclusion would be the improved ISBN clean-up, preventing Zotero to give up after a failed ISBN look-up (which I realize you didn't manage to get to work), and the inclusion of google books in the routine.

Some of the other stuff seems more idiosyncratic to your purpose - e.g. I don't think core devs find 20% false positives as in your JSTOR example to be acceptable.

Generally speaking I (and possibly others) would certainly be interested in your code if you don't mind putting it on github or so.

jstrasser · January 9, 2013

With regard to the number of pages to use to get a high-quality search string for document content, it seems to me that an approach like the following would be useful:

- determine the total number of pages in the document
- divide by two and round down to the next lower integer
- use the page with the resulting number, plus the one before it and the one after it

For example, for a 18-page paper, the search string would include text from pp. 8-10, and for a 267-page book, the search string would include text from pp. 132-134. This approach should work well with most types of documents, no matter how long the actual content is, how many boilerplate pages there are at the beginning, or how many index or appendix pages there are at the end.