Need to OCR Some PDFs before Import?
Prior to using Zotero, I was in the habit of running newly acquired PDFs through the "Recognize Text Using OCR . . ." function in Adobe Acrobat, just to make sure they'd be searchable. I notice that PDFs from some sources (like EBSCO) seem to be already OCR-ed, but Acrobat needs to do a bit of OCR work on files from other sources, especially JSTOR.
My question is: Is this step necessary for the PDF indexing function in Zotero, or is it redundant?
Thanks!
My question is: Is this step necessary for the PDF indexing function in Zotero, or is it redundant?
Thanks!
In the mean time, I’ll look at gscan2pdf.
Thanks.
Note that any PDFs which were not produced by scanning in the first place (the vast majority of newer PDFs) never need to be OCR'd, since they are produced directly using the article text and are searchable out of the box).
So the only time you should need to OCR an article is when it (1) is an older scanned article --- you can tell this by the quality. It will look generally more like a scan than a computer produced document--- AND (2) it is not from a database that includes 'text layers' in their articles. As adamsmith says, many databases already so this. Of course if you scan them yourself you'll need to OCR them.
And ajlyon is correct when s/he says that OCR is (still) a very complicated (read: expensive) software undertaking. Having said that, (and just now having read about the now-free tessaract engine), it might not be too much for someone to put together a plugin for Zotero that uses tessaract on scanned PDFs which don't contain a text layer. I have a lot of self-scanned PDFs, so I'd be quite happy for such a plugin.