Need to OCR Some PDFs before Import?

ln108 · March 7, 2010

Prior to using Zotero, I was in the habit of running newly acquired PDFs through the "Recognize Text Using OCR . . ." function in Adobe Acrobat, just to make sure they'd be searchable. I notice that PDFs from some sources (like EBSCO) seem to be already OCR-ed, but Acrobat needs to do a bit of OCR work on files from other sources, especially JSTOR.

My question is: Is this step necessary for the PDF indexing function in Zotero, or is it redundant?

Thanks!

ajlyon · March 7, 2010

This step is necessary for PDF indexing to work. Zotero can't yet do OCR on its own, so users who have PDFs without a text layer have to do OCR themselves, then click the "reindex" button for each PDF.

adamsmith · March 8, 2010

well - I would say yes and know - an increasing number of pdfs from databases are OCRd - I'd say the percentage is above 80% (and includes JSTOR - I don't know what your acrobat did there) - for those OCR is redundant, but not because of anything Zotero does.

brandtb · March 17, 2010

I won't install Acrobat due to the serious ongoing security problems it has had. It would be a great addition for Zotero to automatically OCR when there is not already embedded text.

ajlyon · March 17, 2010

OCR is a very complex proposition, and really outside the realm of what Zotero needs to concern itself with. There are plenty of OCR products on the market-- take a look. I've been using gscan2pdf on Linux, and there are lots of commercial solutions as well.

brandtb · March 17, 2010

Thanks for the quick reply. Although I might easily agree that Zotero has more pressing needs, I suggest it is not out of the realm of what Zotero should be concerned with. What I love about Zotero is that it simplifies the workflow in my research. When Zotero automatically downloads a PDF for me, it saves about 10 clicks, time, and the opportunity for error. It makes the tedious invisible. Proxy detection and redirect is one of the most brilliant things I've seen in years. This would similarly make indexing a PDF an invisible step. So, please keep it in mind.

In the mean time, I’ll look at gscan2pdf.

Thanks.

scot · March 18, 2010

Older JSTOR articles already include a "text layer" (which is what you add when you OCR them), so they are searchable. Presumably Adobe re-does this, which is why it seems to do some work on JSTOR articles. But it should be redundant (and may even be inferior to the one JSTOR provides). [edit I now see that adam has already mentioned this]

Note that any PDFs which were not produced by scanning in the first place (the vast majority of newer PDFs) never need to be OCR'd, since they are produced directly using the article text and are searchable out of the box).

So the only time you should need to OCR an article is when it (1) is an older scanned article --- you can tell this by the quality. It will look generally more like a scan than a computer produced document--- AND (2) it is not from a database that includes 'text layers' in their articles. As adamsmith says, many databases already so this. Of course if you scan them yourself you'll need to OCR them.

And ajlyon is correct when s/he says that OCR is (still) a very complicated (read: expensive) software undertaking. Having said that, (and just now having read about the now-free tessaract engine), it might not be too much for someone to put together a plugin for Zotero that uses tessaract on scanned PDFs which don't contain a text layer. I have a lot of self-scanned PDFs, so I'd be quite happy for such a plugin.