PDF OCR Search

Often I'll tear out pages from magazines I'm reading that interest me. Can I scan these and make be able to search them?
  • Yes you can index PDFs for search with Zotero (See http://www.zotero.org/documentation/pdf_fulltext_indexing?s=pdf)

    It is important to note that Zotero does not OCR these texts. For Zotero to index the PDF you will need to have already embedded the OCR through program you used to create the PDF.
  • Adobe Acrobat (the full or professional versions, not Adobe Reader) offers a quick-n-dirty OCR engine which works fine most of the time. I often use it to OCR scanned papers.
  • If you (or someone who finds this thread with search) happen to use Linux or another unix-y operating system, it's possible to make PDF files that can be indexed with gscan2pdf:
    http://gscan2pdf.sourceforge.net/

    For best results, use tesseract-ocr (rather than gocr), scan at 300 or 600 dpi, and clean up the scans with unpaper (that's integrated into gscan2pdf if unpaper is installed). In my experience, gscan2pdf is a nice, easy-to-use interface.
  • I've tried indexing some scanned .pdf's of papers after OCR'ing with gscan2pdf and been disappointed with the results. Although I can search for text within such a document using (e.g.) Acrobat reader and typically find it, Zotero doesn't seem to actually index these .pdf's, and "Retrieve Metadata for PDF" results in a "Could not read text from PDF" error message.

    Are there any packages that have come out in the last couple of years that do a better job than gscan2pdf+tesseract?
  • Can you try running pdftotext on the OCR'ed PDFs manually and see what you get? It's possible that gscan2pdf is embedding text in a way that Zotero just can't understand.
  • pdftotext on the OCR'd version of the PDF file produces a .txt file containing the text of the paper (not quite perfect, but adequate for indexing. I assume that there's something wrong with the way in which gscan2pdf has done this, but it's not clear exactly what's wrong.
  • Can you take a look at the Zotero debug log for an attempt to index the PDF? See http://www.zotero.org/support/debug_output

    Just enable debug output, select the item and click the "Index / Reindex" button in the right panel of the Zotero pane/tab, and disable debug output. Search for "Running pdftotext" and post the relevant portions here.
  • The debug ID is: D1659682574

    the seemingly relevant part of the output is:

    (3)(+0000001): Running pdfinfo "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-info"

    (3)(+0000039): Running pdftotext -enc UTF-8 -nopgbrk -l 500 "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/bai2.pdf" "/home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache"

    (2)(+0000033): bai2.pdf was not indexed
  • Can you check if /home/brian/.mozilla/firefox/e6xak270.default/zotero/storage/NKESJDPQ/.zotero-ft-cache is in fact created? Does it contain a word list?

    If not, can you run the pdftotext command manually? What happens?
  • And if someone on the core dev team can take a look at the debug report for other clues, that'd be great.
  • the .zotero-ft-cache file was not created.

    When I run the pdftotext command from the command line, it produces a bai2.txt with lots of output that looks like the text of the paper.

    I also ran pdfinfo on the bai2.pdf file, and got:

    Title: NONE
    Subject: NONE
    Keywords: NONE
    Author: NONE
    Creator: gscan2pdf v0.9.29
    Producer: PDF::API2
    CreationDate: Wed Apr 20 00:00:00 2011
    ModDate: Wed Apr 20 00:00:00 2011
    Tagged: no
    Pages: 46
    Encrypted: no
    Page size: 2280 x 3120 pts
    File size: 1846949 bytes
    Optimized: no
    PDF version: 1.4
  • Can you run pdftotext with the arguments / options exactly as in the debug output?

    And other PDFs index fine, right?
  • I ran the pdftotext with

    pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf

    which I believe is exactly how it was run according to the debug output...

    Yes, conventional PDF's with embedded text work just fine.
  • I went back and reran it with

    pdftotext -enc UTF-8 -nopgbrk -l 500 bai2.pdf .zotero-ft-cache

    and a .zotero-ft-cache file was created. When I then attempted to retrieve metadata, I still got "could not read text from item"
  • I'm out of ideas. It looks like everything should be working.

    Can someone from the core dev team take a look at the debug output and offer some advice?
  • There's nothing useful in the debug output beyond what's above. However, if you can send a failing PDF (preferably a small one) to support@zot...org, we'll take a look.
  • edited April 22, 2011
    I've sent the .pdf file to support@zot...org.
Sign In or Register to comment.