Help with Installing Zotero OCR

Hello, I would like some help with the installation of Zotero OCR (https://github.com/UB-Mannheim/zotero-ocr) which adds the functionality to perform an OCR for selected PDFs in Zotero, which is tailored for the use with Tesseract OCR.

I've installed the extension file in Zotero, and have downloaded Tesseract. However, I keep getting an error message when I try to OCR a document which reads "[Javascript Application] No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found.

I think I am missing the prerequisite step where "pdftoppm from poppler library is downloaded and copied to the other scripts in the Zotero directory". However, as someone with no background in programming, I have no idea what that means

Thanks in advance for any help!
  • You first need to install poppler. That's probably easiest by using Homebrew -- see
    http://macappstore.org/poppler/

    Once you have poppler installed, you should be able to find the file pdftoppm on your machine. (I think it'll be in /usr/bin or /usr/local/bin, but I don't use Mac so not sure)
    Copy (don't move) that file to /Applications/Zotero.app/Contents/MacOS/pdftoppm

    If you know how to, you can also use a symlink instead, but don't worry about it if you don't.
  • I followed 1-2 on the Homebrew instructions and am unsure how to execute step 3. Do I enter "brew install poppler" after the command from step 2 is completed? A bit more guidance for step 3 would be helpful. I'm a mac user.
  • Do I enter "brew install poppler" after the command from step 2 is completed?
    yes, exactly
  • Followed all these instructions.

    I installed both Tesseract and Poppler. I'm still getting that error: "No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found"

    Any advice for troubleshooting this?

  • Did you do this?
    Once you have poppler installed, you should be able to find the file pdftoppm on your machine. (I think it'll be in /usr/bin or /usr/local/bin, but I don't use Mac so not sure)
    Copy (don't move) that file to /Applications/Zotero.app/Contents/MacOS/pdftoppm
  • Thank you AdamSmith for responding. Prompted by your message I looked into this a bit more carefully and discovered the problem. This might be helpful for others.

    The pdftoppm file that I had initially copied turned out to be an alias file located in /usr/local/bin. Do not use that one. You need the original executable file. In my case, I'm using Catalina, and the correct pathway for the original file is

    /usr/local/Cellar/poppler/20.09.0/bin/pdftoppm

    Now it works! Thanks again.
  • Does somebody have a Windows equivalent? I'm also struggling here, have Tesseract installed but don't know where to get Poppler or where to put it once I have it...

    Thanks, you guys are fantastic.
  • This discusses various options -- kind of depends on what you're comfortable with: conda and chocolatey are probably the best options, but if you're not used to the commandline & a package manager, might be a bit scary.
  • edited 7 days ago
    @s3333969

    First you need to install Tesseract: https://github.com/UB-Mannheim/tesseract/wiki. (It doesn't seem to matter whether you install the latest 5.0 alpha or the 4.1 version--though I found version 5 seems to be faster--nor does it seem to matter whether you use the 32-bit or 64-bit version. Just install with all the default options. Next your way through the installer.) You need to set the path for Tesseract in the Tools > Zotero OCR preferences. In my case, I installed the 64-bit version, and the Tesseract path was "C:\Program Files\Tesseract-OCR\tesseract.exe". If you have the 32-bit version for whatever reason, it's probably in "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".

    You then need to install Poppler. This is the best place to download Poppler for Windows, because it comes with all the DLLs: https://github.com/oschwartz10612/poppler-windows. You need to click the little "Release xxxxx" link with the tag on the right side of the screen. Then you have to unzip all the contents, go into the unzipped poppler- > lib folder, and then drag and drop both pdftoppm.exe and ALL of the .dll files into the C:\Program Files (x86)\Zotero folder. The first time you run the plug-in, Windows will ask you if you're sure you want to run the executable. (On this point, I must say that I think it would be super nice if the plug-in would allow you to specify a path for the Poppler files, just like it does for the Tesseract executable, but for whatever reason that's not an issue. You just have to dump all that junk in your Zotero folder. Not entirely elegant, but it works, and it's free! So I'm very thankful for it no matter how it works.)

    When you right-click and OCR the file, you'll see a pop-up terminal windows appear, and eventually that will pass you to a second window where you can watch it start iterating through the documents.

    By default, the plug-in creates a bunch of other attachments for the item, like a Note file and some other stuff. You can disable the extra stuff under the extension options. If you right-click the PDF attachment in Zotero and then click "Show file", you can delete the other files you don't need apart from the linked OCR version that it adds.

    Also note that the OCR file sizes may be significantly larger than the nice, compact version you downloaded from your library's index. E.g., tonight I OCRed a file, and the original was 1 MB while the OCRed version was 17 MB.
Sign In or Register to comment.