Searchable PDFs Not Searchable In Main Interface

We are evaluating Zotero and love the product. There are a number of PDFs that we uploaded to Zotero, select and then do a metadata retrieval process.

The error back is PDF Name > Could not read text from PDF or no matching reference found or PDF does not contain OCRed text.

When we select the PDF and open it, we confirm it is searchable and can go to the bottom of the window and enter terms and they come up normally.

We tried to apply the fix described in report ID 66. Replacing the PDFInfo and PDFtoText files and reinstalling did not work.

Is there anything that someone can advise to help solve this?
  • "Could not read text from PDF or no matching reference found or PDF does not contain OCRed text."

    These are three different error messages that have different causes. You mean you're getting them all at once for a single document?
    In the tab on the right, does it show up as indexed?
    We tried to apply the fix described in report ID 66.
    sorry, where did you find that suggested fix/what do you mean by report ID?
  • Thanks for the quick reply. We were following some basic steps based on searching the forums and the fix may not have been applicable.

    https://forums.zotero.org/discussion/28820/retrieve-metadata-for-pdf-broken-report-id-660357467/

    As for the errors, selecting five PDFs and running the metadata process yields the various errors. Meaning, each presents with its own error but within the same results window.
  • OK, so
    - no OCRd text means what it says: Zotero can't find any OCRd text in the PDF
    - can't read text means the pdftools fail, usually for unknown reasons. I believe encrypted PDFs or the like can cause this, but I'm not 100% sure
    - no matching references found means that Zotero looked for a DOI, an ISBN, a finally a text string via google scholar and didn't come up with anything.

    So the reasons these may happen are quite distinct. The searchable PDF - which error message did you get for that?
  • Yeah, the version of the PDF tools that we use fails on encrypted PDFs. A newer version fixes this, but we haven't upgraded to that yet (and we may just switch to Mozilla's built-in PDF tool first).
  • Tried the same process via stand alone app with same results:

    Metadata retrieval process (right click on PDF)

    Indexed - no matching reference found

    Not indexed - Could not read text from PDF

    Cannot index non-indexed PDF. Checked security on PDF and there is none. If I open the PDF we can search it without any issues.
  • At least on Windows PDF tools has issues with non-latin characters in PDF file names. Try renaming the PDF (through Zotero) if that is the case.
  • Yeah, so this is mostly working as expected, though I'm not sure why pdftotext fails on that PDF. It doesn't just fail on encrypted ones:

    @Dan - you can e.g. test this on:
    http://catdir.loc.gov/catdir/samples/cam033/2002073770.pdf
    which has no security and works fine with pdftotext 3.0.3, but Zotero's pdftotext version run from the commandline returns: "Error: Copying of text from this document is not allowed."

    @tdhuman - in my experience this is pretty rare - maybe 1 in 50 or so. It is possible to replace Zotero's pdftools with the updated version manually if that's of interest to you.
  • Thanks. It would be. We are running 3.0.2. Where can we find 3.0.3?
  • Well, it's more than just encryption — PDFs can have various restrictions, and what matters here is the no-copying flag, which 3.02 obeys and (I guess) 3.0.3 doesn't.

    You can't replace pdftotext with 3.03 on Windows without altering the binary — otherwise you'll get a black console window every time it runs. (And pdfinfo needs a custom build to output to a text file, though the version probably doesn't matter for that anyway.) I don't even remember off-hand how we did that — it's some Windows flag on the executable — but that's why we distribute custom versions.
  • Based on suggestions, removing the Latin text solved the issue with those PDFs that were unindexed and saying Could not read text from PDF.

    The remaining issue is related to indexed PDFs. Running the metadata retrieval process shows error: no matching references found.

    If we open the PDF it is searchable. There is no encryption or other restrictions on the PDF.

    Not sure if the 3.0.3 exe's would help fix this. If so, would you mind letting me know how to download them?
  • no, 3.0.3 has no effect on that. As I said above:
    no matching references found means that Zotero looked for a DOI, an ISBN, a finally a text string via google scholar and didn't come up with anything.
    you won't get metadata for every PDF.
  • Okay - we did find the article via Google Scholar and did get the DOI number. But that is okay. I appreciate the help in trying to figure some of this out.
Sign In or Register to comment.