It is important to note that Zotero does not OCR these texts. For Zotero to index the PDF you will need to have already embedded the OCR through program you used to create the PDF.
Adobe Acrobat (the full or professional versions, not Adobe Reader) offers a quick-n-dirty OCR engine which works fine most of the time. I often use it to OCR scanned papers.
If you (or someone who finds this thread with search) happen to use Linux or another unix-y operating system, it's possible to make PDF files that can be indexed with gscan2pdf: http://gscan2pdf.sourceforge.net/
For best results, use tesseract-ocr (rather than gocr), scan at 300 or 600 dpi, and clean up the scans with unpaper (that's integrated into gscan2pdf if unpaper is installed). In my experience, gscan2pdf is a nice, easy-to-use interface.
I've tried indexing some scanned .pdf's of papers after OCR'ing with gscan2pdf and been disappointed with the results. Although I can search for text within such a document using (e.g.) Acrobat reader and typically find it, Zotero doesn't seem to actually index these .pdf's, and "Retrieve Metadata for PDF" results in a "Could not read text from PDF" error message.
Are there any packages that have come out in the last couple of years that do a better job than gscan2pdf+tesseract?
Can you try running pdftotext on the OCR'ed PDFs manually and see what you get? It's possible that gscan2pdf is embedding text in a way that Zotero just can't understand.
pdftotext on the OCR'd version of the PDF file produces a .txt file containing the text of the paper (not quite perfect, but adequate for indexing. I assume that there's something wrong with the way in which gscan2pdf has done this, but it's not clear exactly what's wrong.
Just enable debug output, select the item and click the "Index / Reindex" button in the right panel of the Zotero pane/tab, and disable debug output. Search for "Running pdftotext" and post the relevant portions here.
Can you check if /home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache is in fact created? Does it contain a word list?
If not, can you run the pdftotext command manually? What happens?
There's nothing useful in the debug output beyond what's above. However, if you can send a failing PDF (preferably a small one) to support@zot...org, we'll take a look.
It is important to note that Zotero does not OCR these texts. For Zotero to index the PDF you will need to have already embedded the OCR through program you used to create the PDF.
http://gscan2pdf.sourceforge.net/
For best results, use tesseract-ocr (rather than gocr), scan at 300 or 600 dpi, and clean up the scans with unpaper (that's integrated into gscan2pdf if unpaper is installed). In my experience, gscan2pdf is a nice, easy-to-use interface.
Are there any packages that have come out in the last couple of years that do a better job than gscan2pdf+tesseract?
pdftotext
on the OCR'ed PDFs manually and see what you get? It's possible thatgscan2pdf
is embedding text in a way that Zotero just can't understand.Just enable debug output, select the item and click the "Index / Reindex" button in the right panel of the Zotero pane/tab, and disable debug output. Search for "Running pdftotext" and post the relevant portions here.
the seemingly relevant part of the output is:
(3)(+0000001): Running pdfinfo "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-info"
(3)(+0000039): Running pdftotext -enc UTF-8 -nopgbrk -l 500 "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache"
(2)(+0000033): bai2.pdf was not indexed
If not, can you run the pdftotext command manually? What happens?
When I run the pdftotext command from the command line, it produces a bai2.txt with lots of output that looks like the text of the paper.
I also ran pdfinfo on the bai2.pdf file, and got:
Title: NONE
Subject: NONE
Keywords: NONE
Author: NONE
Creator: gscan2pdf v0.9.29
Producer: PDF::API2
CreationDate: Wed Apr 20 00:00:00 2011
ModDate: Wed Apr 20 00:00:00 2011
Tagged: no
Pages: 46
Encrypted: no
Page size: 2280 x 3120 pts
File size: 1846949 bytes
Optimized: no
PDF version: 1.4
And other PDFs index fine, right?
pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf
which I believe is exactly how it was run according to the debug output...
Yes, conventional PDF's with embedded text work just fine.
pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf .zotero-ft-cache
and a .zotero-ft-cache file was created. When I then attempted to retrieve metadata, I still got "could not read text from item"
Can someone from the core dev team take a look at the debug output and offer some advice?