Can't index a pdf
Hi there, I tried indexing my pdf.s in one go and got a "script is not responding" error. So I went through manually indexing my pdf.s. There are still however a few files that will not index (and I assume were the cause of the initial script problem). Has anybody else had this problem? and is there a specific property that would make a pdf indexable or not that I could look for in the files that won't index.
Thanks,
Jon.
Thanks,
Jon.
But yes, a PDF could have restrictions that prevent text from being copied out of it, and pdftotext, the tool we use to extract text, complies with those restrictions.
If you happen to have this kind of pdf, you can use Optical Character Recognition (OCR) software to extract text from the image and then attach the resulting text to the source pdf as an annotation. I only know how to do this in Linux, where I use gscan2pdf with tesseract-ocr, when either scanning new documents or importing existing pdf files. It works pretty well. I'm sure that similar software exists for Windows and Mac users, I just don't know anything about it myself.