Searchable PDFs Not Searchable In Main Interface

tdhumad · January 29, 2014

We are evaluating Zotero and love the product. There are a number of PDFs that we uploaded to Zotero, select and then do a metadata retrieval process.

The error back is PDF Name > Could not read text from PDF or no matching reference found or PDF does not contain OCRed text.

When we select the PDF and open it, we confirm it is searchable and can go to the bottom of the window and enter terms and they come up normally.

We tried to apply the fix described in report ID 66. Replacing the PDFInfo and PDFtoText files and reinstalling did not work.

Is there anything that someone can advise to help solve this?

adamsmith · January 29, 2014

"Could not read text from PDF or no matching reference found or PDF does not contain OCRed text."

These are three different error messages that have different causes. You mean you're getting them all at once for a single document?
In the tab on the right, does it show up as indexed?

We tried to apply the fix described in report ID 66.

sorry, where did you find that suggested fix/what do you mean by report ID?

tdhumad · January 29, 2014

Thanks for the quick reply. We were following some basic steps based on searching the forums and the fix may not have been applicable.

https://forums.zotero.org/discussion/28820/retrieve-metadata-for-pdf-broken-report-id-660357467/

As for the errors, selecting five PDFs and running the metadata process yields the various errors. Meaning, each presents with its own error but within the same results window.

adamsmith · January 29, 2014

OK, so
- no OCRd text means what it says: Zotero can't find any OCRd text in the PDF
- can't read text means the pdftools fail, usually for unknown reasons. I believe encrypted PDFs or the like can cause this, but I'm not 100% sure
- no matching references found means that Zotero looked for a DOI, an ISBN, a finally a text string via google scholar and didn't come up with anything.

So the reasons these may happen are quite distinct. The searchable PDF - which error message did you get for that?

dstillman · January 29, 2014

Yeah, the version of the PDF tools that we use fails on encrypted PDFs. A newer version fixes this, but we haven't upgraded to that yet (and we may just switch to Mozilla's built-in PDF tool first).

tdhumad · January 29, 2014

Tried the same process via stand alone app with same results:

Metadata retrieval process (right click on PDF)

Indexed - no matching reference found

Not indexed - Could not read text from PDF

Cannot index non-indexed PDF. Checked security on PDF and there is none. If I open the PDF we can search it without any issues.

aurimas · January 29, 2014

At least on Windows PDF tools has issues with non-latin characters in PDF file names. Try renaming the PDF (through Zotero) if that is the case.

adamsmith · January 29, 2014

Yeah, so this is mostly working as expected, though I'm not sure why pdftotext fails on that PDF. It doesn't just fail on encrypted ones:

@Dan - you can e.g. test this on:
http://catdir.loc.gov/catdir/samples/cam033/2002073770.pdf
which has no security and works fine with pdftotext 3.0.3, but Zotero's pdftotext version run from the commandline returns: "Error: Copying of text from this document is not allowed."

@tdhuman - in my experience this is pretty rare - maybe 1 in 50 or so. It is possible to replace Zotero's pdftools with the updated version manually if that's of interest to you.

tdhumad · January 29, 2014

Thanks. It would be. We are running 3.0.2. Where can we find 3.0.3?

dstillman · January 29, 2014

Well, it's more than just encryption — PDFs can have various restrictions, and what matters here is the no-copying flag, which 3.02 obeys and (I guess) 3.0.3 doesn't.

You can't replace pdftotext with 3.03 on Windows without altering the binary — otherwise you'll get a black console window every time it runs. (And pdfinfo needs a custom build to output to a text file, though the version probably doesn't matter for that anyway.) I don't even remember off-hand how we did that — it's some Windows flag on the executable — but that's why we distribute custom versions.

tdhumad · January 29, 2014

Based on suggestions, removing the Latin text solved the issue with those PDFs that were unindexed and saying Could not read text from PDF.

The remaining issue is related to indexed PDFs. Running the metadata retrieval process shows error: no matching references found.

If we open the PDF it is searchable. There is no encryption or other restrictions on the PDF.

Not sure if the 3.0.3 exe's would help fix this. If so, would you mind letting me know how to download them?

adamsmith · January 29, 2014

no, 3.0.3 has no effect on that. As I said above:

no matching references found means that Zotero looked for a DOI, an ISBN, a finally a text string via google scholar and didn't come up with anything.

you won't get metadata for every PDF.

tdhumad · January 29, 2014

Okay - we did find the article via Google Scholar and did get the DOI number. But that is okay. I appreciate the help in trying to figure some of this out.