Help with Installing Zotero OCR

SamtheBox · August 19, 2020

Hello, I would like some help with the installation of Zotero OCR (https://github.com/UB-Mannheim/zotero-ocr) which adds the functionality to perform an OCR for selected PDFs in Zotero, which is tailored for the use with Tesseract OCR.

I've installed the extension file in Zotero, and have downloaded Tesseract. However, I keep getting an error message when I try to OCR a document which reads "[Javascript Application] No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found.

I think I am missing the prerequisite step where "pdftoppm from poppler library is downloaded and copied to the other scripts in the Zotero directory". However, as someone with no background in programming, I have no idea what that means

Thanks in advance for any help!

adamsmith · August 19, 2020

You first need to install poppler. That's probably easiest by using Homebrew -- see
http://macappstore.org/poppler/

Once you have poppler installed, you should be able to find the file pdftoppm on your machine. (I think it'll be in /usr/bin or /usr/local/bin, but I don't use Mac so not sure)
Copy (don't move) that file to /Applications/Zotero.app/Contents/MacOS/pdftoppm

If you know how to, you can also use a symlink instead, but don't worry about it if you don't.

shoshan · September 1, 2020

I followed 1-2 on the Homebrew instructions and am unsure how to execute step 3. Do I enter "brew install poppler" after the command from step 2 is completed? A bit more guidance for step 3 would be helpful. I'm a mac user.

adamsmith · September 1, 2020

Do I enter "brew install poppler" after the command from step 2 is completed?

yes, exactly

PostmodernReligion · September 23, 2020

Followed all these instructions.

I installed both Tesseract and Poppler. I'm still getting that error: "No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found"

Any advice for troubleshooting this?

adamsmith · September 23, 2020

Did you do this?

Once you have poppler installed, you should be able to find the file pdftoppm on your machine. (I think it'll be in /usr/bin or /usr/local/bin, but I don't use Mac so not sure)
Copy (don't move) that file to /Applications/Zotero.app/Contents/MacOS/pdftoppm

PostmodernReligion · September 24, 2020

Thank you AdamSmith for responding. Prompted by your message I looked into this a bit more carefully and discovered the problem. This might be helpful for others.

The pdftoppm file that I had initially copied turned out to be an alias file located in /usr/local/bin. Do not use that one. You need the original executable file. In my case, I'm using Catalina, and the correct pathway for the original file is

/usr/local/Cellar/poppler/20.09.0/bin/pdftoppm

Now it works! Thanks again.

s3333969 · September 28, 2020

Does somebody have a Windows equivalent? I'm also struggling here, have Tesseract installed but don't know where to get Poppler or where to put it once I have it...

Thanks, you guys are fantastic.

adamsmith · September 29, 2020

This discusses various options -- kind of depends on what you're comfortable with: conda and chocolatey are probably the best options, but if you're not used to the commandline & a package manager, might be a bit scary.

internationaled · October 18, 2020

@s3333969

First you need to install Tesseract: https://github.com/UB-Mannheim/tesseract/wiki. (It doesn't seem to matter whether you install the latest 5.0 alpha or the 4.1 version--though I found version 5 seems to be faster--nor does it seem to matter whether you use the 32-bit or 64-bit version. Just install with all the default options. Next your way through the installer.) You need to set the path for Tesseract in the Tools > Zotero OCR preferences. In my case, I installed the 64-bit version, and the Tesseract path was "C:\Program Files\Tesseract-OCR\tesseract.exe". If you have the 32-bit version for whatever reason, it's probably in "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".

You then need to install Poppler. This is the best place to download Poppler for Windows, because it comes with all the DLLs: https://github.com/oschwartz10612/poppler-windows. You need to click the little "Release xxxxx" link with the tag on the right side of the screen. Then you have to unzip all the contents, go into the unzipped poppler- > lib folder, and then drag and drop both pdftoppm.exe and ALL of the .dll files into the C:\Program Files (x86)\Zotero folder. The first time you run the plug-in, Windows will ask you if you're sure you want to run the executable. (On this point, I must say that I think it would be super nice if the plug-in would allow you to specify a path for the Poppler files, just like it does for the Tesseract executable, but for whatever reason that's not an issue. You just have to dump all that junk in your Zotero folder. Not entirely elegant, but it works, and it's free! So I'm very thankful for it no matter how it works.)

When you right-click and OCR the file, you'll see a pop-up terminal windows appear, and eventually that will pass you to a second window where you can watch it start iterating through the documents.

By default, the plug-in creates a bunch of other attachments for the item, like a Note file and some other stuff. You can disable the extra stuff under the extension options. If you right-click the PDF attachment in Zotero and then click "Show file", you can delete the other files you don't need apart from the linked OCR version that it adds.

Also note that the OCR file sizes may be significantly larger than the nice, compact version you downloaded from your library's index. E.g., tonight I OCRed a file, and the original was 1 MB while the OCRed version was 17 MB.

dlstanton · October 26, 2020

I think you mean go into the /bin folder not the /lib folder.

s3333969 · November 27, 2020

First up, I just want to say a gigantic thanks to @internationaled , those instructions are comprehensive and clear and I can't say how much I appreciate you taking the time to write them out--you're a superstar.

Sadly I still haven't been able to get this plugin working--I think the issue is tesseract, which appears to crash on startup. When I run the plugin I get a command line screen for an instant which closes before I can actually read what comes up.
When I run 'tesseract.exe' through explorer I see the same thing.

I have tried installing the 32bit version of Tesseract 5.0, but it does not help. Legacy versions don't seem to have Windows installers and I'm not savvy enough to figure out how to use them to see. I'm running win10 Home.

If anybody has experienced anything similar, or has any ideas, I'd love to hear about it.

mydjtl · December 18, 2020

DO you know why Tesseract has malware?

internationaled · December 18, 2020

Interesting. A few questions: (1) Where are you downloading Tesseract? (2) What antivirus are you using? (3) What files are giving you the malware flag?

I also just tried scanning the Tesseract-OCR folder and found no malware, and Virustotal doesn't find malware in the installers in the two topmost links on https://github.com/UB-Mannheim/tesseract/wiki:

Virustotal scan of 32-bit installer: https://www.virustotal.com/gui/url/1fbe02d95d5a192026f1fcbb5283b3eb2d41fb64e98b288de6d3bdd6f81ae6ce/detection

Virustotal scan of 64-bit installer: https://www.virustotal.com/gui/url/1fbe02d95d5a192026f1fcbb5283b3eb2d41fb64e98b288de6d3bdd6f81ae6ce/detection

adurs2002 · December 18, 2020

Works perfectly for me @s3333969. However, my fans ramp up to max when I start the OCR... I have 16GB RAM and a Core i7 seventh gen processor and a dedicated GPU (don't know the number and doesn't really matter cuz GPU don't matter for CPU intensive task). Check your specs and try freeing up RAM and closing processes.

adurs2002 · December 18, 2020

@internationaled you're a lifesaver! Thank you for your comprehensive and completely working instructions. Can't say how much time you've saved for me.

thornier · January 1, 2021

Thanks to all who contributed to this thread I was able to install the plugin successfully. It is remarkable how much tesseract taxes the CPU! But in my tests it works wonderfully.

zuphilip · January 23, 2021

There is now a new version of the plugin Zotero OCR 0.4.0 out with the possibility to specify the path to pdftoppm as well in the preferences. Enjoy!

melashri · January 27, 2021

For installing OCR engine do the following

Use homebrew to install tessecrat first, this installs English and two other languages that I don't remember.

brew install tesseract-lang

Then I've the path is which depends on the version of course

/opt/homebrew/Cellar/tesseract/4.1.1

the second requirement is installing pdftoppm which comes with poppler package and can be done on two steps

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null

then run

brew install poppler

The path in my mac machine is

/opt/homebrew/Cellar/poppler/21.01.0/bin

which also depends on poppler version

internationaled · February 19, 2021

Everyone, the new version of this plugin now features an option to set the path to your Poppler's pdftoppm.exe file. So, you don't need to copy all those poppler files into the Zotero folder anymore. Just enter the directory where the pdftoppm.exe file is found (e.g., C:\Program Files\poppler\tools\pdftoppm.exe). Much better!

zuphilip · February 20, 2021

There is now also a new wiki page with information about how to install pdftoppm: https://github.com/UB-Mannheim/zotero-ocr/wiki/Install-pdftoppm

AndrewRRM · March 12, 2021

I need a bit of help with this one too. I'm on a new M1 Macbook running Big sur 11.2.2.

I've gone through the installation process using homebrew, installed tesseract and poppler, installed the zotero plugin, set the path in the Zotero plugin (/opt/homebrew/Cellar/tesseract/4.1.1
and
/opt/homebrew/Cellar/poppler/21.03.0_1/bin

I've copied the pdftoppm into /Applications/Zotero.app/Contents/MacOS/pdftoppm

When I run the plugin, an ocr file appears (far too quickly!) but when I try to open it I get the following error:
Format Error: Not a PDF or corrupted.

What am I missing?

I'm not attached to using an ocr within Zotero and would be happy with a recommendation for a free easy to use app ...

dimnc · May 6, 2021

@AndrewRRM, wait a bit for your text to be OCRed, then open it again.
To check the progression, right-click on the PDF being scanned, and click on "locate". You'll see your pdf in its folder, with a bunch of png and .txt files appearing and disappearing during the scanning process. The PDF seems to be re-written completely, so if you try to open it in Zotero during the process, you'll have an error.

CedricJ · December 6, 2021

Hey im running windows, looked trough all the advice.

I cant seem to find the pdftoppm exe, i can find a fille with that name but it isnt recognized as a exe. Can someone help

katelynmcw · February 3, 2022

@AndrewRRM were you ever able to resolve your issue? I have the same issue; tesseract and pdftoppm are installed successfully, but I get blank note and a corrupted pdf.

mydjtl · February 7, 2022

https://jdhao.github.io/2019/11/14/convert_pdf_to_images_pdftoppm/
https://pdf2image.readthedocs.io/en/latest/installation.html

alflamingo · August 9, 2022

Hello,
following instructions from zuphilip (https://github.com/UB-Mannheim/zotero-ocr/wiki/Install-pdftoppm) I installed both poppler and tesseract but couldn't get any of the .exe files...

So I couldn't put any filepath for zotero addon.

Any clue?

Edit: Partially solved for cases in english, except the pdf creation that seems corrupt but not such a problem. My concern now is the use of scan in spanish. What is the config to enter ? spanish, span or español? No one seems to work. How to install it?

macne028 · November 5, 2022

@AndrewRRm and @katelynmcw and @internatioaled I'm running into the same issue. The plug-in will run and produce a PDF but when I open it, Zotero gives me an "Invalid or corrupt PDF" message. Have y'all had any luck solving this? Thank you to everyone so far in this thread! It's been very helpful.

interferometer · December 7, 2022

@alflamingo

Please refer to the Tesseract document in the following path: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages-and-scripts

Generally you need to set the parameter for Zotero OCR Preference like this: spa+eng+spa_old

alflamingo · December 7, 2022

Thank you, I got what I needed

egreenbergPSD · January 13, 2026

Posting this here because I found the instructions the most helpful:
https://publish.obsidian.md/history-notes/04+OCR+in+Zotero