PDF OCR Search

mikeiz · April 4, 2008

Often I'll tear out pages from magazines I'm reading that interest me. Can I scan these and make be able to search them?

Tjowens · April 4, 2008

Yes you can index PDFs for search with Zotero (See http://www.zotero.org/documentation/pdf_fulltext_indexing?s=pdf)

It is important to note that Zotero does not OCR these texts. For Zotero to index the PDF you will need to have already embedded the OCR through program you used to create the PDF.

mark · April 11, 2008

Adobe Acrobat (the full or professional versions, not Adobe Reader) offers a quick-n-dirty OCR engine which works fine most of the time. I often use it to OCR scanned papers.

sybille · April 11, 2008

If you (or someone who finds this thread with search) happen to use Linux or another unix-y operating system, it's possible to make PDF files that can be indexed with gscan2pdf:
http://gscan2pdf.sourceforge.net/

For best results, use tesseract-ocr (rather than gocr), scan at 300 or 600 dpi, and clean up the scans with unpaper (that's integrated into gscan2pdf if unpaper is installed). In my experience, gscan2pdf is a nice, easy-to-use interface.

borchers@nmt.edu · April 17, 2011

I've tried indexing some scanned .pdf's of papers after OCR'ing with gscan2pdf and been disappointed with the results. Although I can search for text within such a document using (e.g.) Acrobat reader and typically find it, Zotero doesn't seem to actually index these .pdf's, and "Retrieve Metadata for PDF" results in a "Could not read text from PDF" error message.

Are there any packages that have come out in the last couple of years that do a better job than gscan2pdf+tesseract?

ajlyon · April 18, 2011

Can you try running pdftotext on the OCR'ed PDFs manually and see what you get? It's possible that gscan2pdf is embedding text in a way that Zotero just can't understand.

borchers@nmt.edu · April 18, 2011

pdftotext on the OCR'd version of the PDF file produces a .txt file containing the text of the paper (not quite perfect, but adequate for indexing. I assume that there's something wrong with the way in which gscan2pdf has done this, but it's not clear exactly what's wrong.

ajlyon · April 20, 2011

Can you take a look at the Zotero debug log for an attempt to index the PDF? See http://www.zotero.org/support/debug_output

Just enable debug output, select the item and click the "Index / Reindex" button in the right panel of the Zotero pane/tab, and disable debug output. Search for "Running pdftotext" and post the relevant portions here.

borchers@nmt.edu · April 20, 2011

The debug ID is: D1659682574

the seemingly relevant part of the output is:

(3)(+0000001): Running pdfinfo "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-info"

(3)(+0000039): Running pdftotext -enc UTF-8 -nopgbrk -l 500 "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache"

(2)(+0000033): bai2.pdf was not indexed

ajlyon · April 20, 2011

Can you check if /home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache is in fact created? Does it contain a word list?

If not, can you run the pdftotext command manually? What happens?

ajlyon · April 20, 2011

And if someone on the core dev team can take a look at the debug report for other clues, that'd be great.

borchers@nmt.edu · April 20, 2011

the .zotero-ft-cache file was not created.

When I run the pdftotext command from the command line, it produces a bai2.txt with lots of output that looks like the text of the paper.

I also ran pdfinfo on the bai2.pdf file, and got:

Title: NONE
Subject: NONE
Keywords: NONE
Author: NONE
Creator: gscan2pdf v0.9.29
Producer: PDF::API2
CreationDate: Wed Apr 20 00:00:00 2011
ModDate: Wed Apr 20 00:00:00 2011
Tagged: no
Pages: 46
Encrypted: no
Page size: 2280 x 3120 pts
File size: 1846949 bytes
Optimized: no
PDF version: 1.4

ajlyon · April 20, 2011

Can you run pdftotext with the arguments / options exactly as in the debug output?

And other PDFs index fine, right?

borchers@nmt.edu · April 20, 2011

I ran the pdftotext with

pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf

which I believe is exactly how it was run according to the debug output...

Yes, conventional PDF's with embedded text work just fine.

borchers@nmt.edu · April 20, 2011

I went back and reran it with

pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf .zotero-ft-cache

and a .zotero-ft-cache file was created. When I then attempted to retrieve metadata, I still got "could not read text from item"

ajlyon · April 21, 2011

I'm out of ideas. It looks like everything should be working.

Can someone from the core dev team take a look at the debug output and offer some advice?

Simon · April 21, 2011

There's nothing useful in the debug output beyond what's above. However, if you can send a failing PDF (preferably a small one) to support@zot...org, we'll take a look.

borchers@nmt.edu · April 22, 2011

I've sent the .pdf file to support@zot...org.