PDF OCR Search
Often I'll tear out pages from magazines I'm reading that interest me. Can I scan these and make be able to search them?
This is an old discussion that has not been active in a long time. Instead of commenting here, you should start a new discussion. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
It is important to note that Zotero does not OCR these texts. For Zotero to index the PDF you will need to have already embedded the OCR through program you used to create the PDF.
http://gscan2pdf.sourceforge.net/
For best results, use tesseract-ocr (rather than gocr), scan at 300 or 600 dpi, and clean up the scans with unpaper (that's integrated into gscan2pdf if unpaper is installed). In my experience, gscan2pdf is a nice, easy-to-use interface.
Are there any packages that have come out in the last couple of years that do a better job than gscan2pdf+tesseract?
pdftotext
on the OCR'ed PDFs manually and see what you get? It's possible thatgscan2pdf
is embedding text in a way that Zotero just can't understand.Just enable debug output, select the item and click the "Index / Reindex" button in the right panel of the Zotero pane/tab, and disable debug output. Search for "Running pdftotext" and post the relevant portions here.
the seemingly relevant part of the output is:
(3)(+0000001): Running pdfinfo "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-info"
(3)(+0000039): Running pdftotext -enc UTF-8 -nopgbrk -l 500 "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache"
(2)(+0000033): bai2.pdf was not indexed
If not, can you run the pdftotext command manually? What happens?
When I run the pdftotext command from the command line, it produces a bai2.txt with lots of output that looks like the text of the paper.
I also ran pdfinfo on the bai2.pdf file, and got:
Title: NONE
Subject: NONE
Keywords: NONE
Author: NONE
Creator: gscan2pdf v0.9.29
Producer: PDF::API2
CreationDate: Wed Apr 20 00:00:00 2011
ModDate: Wed Apr 20 00:00:00 2011
Tagged: no
Pages: 46
Encrypted: no
Page size: 2280 x 3120 pts
File size: 1846949 bytes
Optimized: no
PDF version: 1.4
And other PDFs index fine, right?
pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf
which I believe is exactly how it was run according to the debug output...
Yes, conventional PDF's with embedded text work just fine.
pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf .zotero-ft-cache
and a .zotero-ft-cache file was created. When I then attempted to retrieve metadata, I still got "could not read text from item"
Can someone from the core dev team take a look at the debug output and offer some advice?