Can't index a pdf

Jon Rubin · August 22, 2007

Hi there, I tried indexing my pdf.s in one go and got a "script is not responding" error. So I went through manually indexing my pdf.s. There are still however a few files that will not index (and I assume were the cause of the initial script problem). Has anybody else had this problem? and is there a specific property that would make a pdf indexable or not that I could look for in the files that won't index.
Thanks,
Jon.

dstillman · August 22, 2007

The "Script is not responding" message isn't really an error—if you clicked Continue some number of times it would eventually complete. You could temporarily set dom.max_chrome_script_run_time to 0 in about:config to get around it. We'll fix the the indexing code to turn it off during the indexing process.

But yes, a PDF could have restrictions that prevent text from being copied out of it, and pdftotext, the tool we use to extract text, complies with those restrictions.

sybille · August 22, 2007

I believe that, along with encrypted pdfs, another sort of pdf file cannot be indexed: the pdf containing images of text, such as images of scanned paper. At least, I know that Beagle - an indexing and search tool that runs on Linux, etc., and also uses pdfinfo to extract text and metadata just like Zotero - does not index the contents of these pdf files.

If you happen to have this kind of pdf, you can use Optical Character Recognition (OCR) software to extract text from the image and then attach the resulting text to the source pdf as an annotation. I only know how to do this in Linux, where I use gscan2pdf with tesseract-ocr, when either scanning new documents or importing existing pdf files. It works pretty well. I'm sure that similar software exists for Windows and Mac users, I just don't know anything about it myself.