Help with Installing Zotero OCR
Hello, I would like some help with the installation of Zotero OCR (https://github.com/UB-Mannheim/zotero-ocr) which adds the functionality to perform an OCR for selected PDFs in Zotero, which is tailored for the use with Tesseract OCR.
I've installed the extension file in Zotero, and have downloaded Tesseract. However, I keep getting an error message when I try to OCR a document which reads "[Javascript Application] No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found.
I think I am missing the prerequisite step where "pdftoppm from poppler library is downloaded and copied to the other scripts in the Zotero directory". However, as someone with no background in programming, I have no idea what that means
Thanks in advance for any help!
I've installed the extension file in Zotero, and have downloaded Tesseract. However, I keep getting an error message when I try to OCR a document which reads "[Javascript Application] No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found.
I think I am missing the prerequisite step where "pdftoppm from poppler library is downloaded and copied to the other scripts in the Zotero directory". However, as someone with no background in programming, I have no idea what that means
Thanks in advance for any help!
http://macappstore.org/poppler/
Once you have poppler installed, you should be able to find the file pdftoppm on your machine. (I think it'll be in /usr/bin or /usr/local/bin, but I don't use Mac so not sure)
Copy (don't move) that file to /Applications/Zotero.app/Contents/MacOS/pdftoppm
If you know how to, you can also use a symlink instead, but don't worry about it if you don't.
I installed both Tesseract and Poppler. I'm still getting that error: "No /Applications/Zotero.app/Contents/MacOS/pdftoppm executable found"
Any advice for troubleshooting this?
The pdftoppm file that I had initially copied turned out to be an alias file located in /usr/local/bin. Do not use that one. You need the original executable file. In my case, I'm using Catalina, and the correct pathway for the original file is
/usr/local/Cellar/poppler/20.09.0/bin/pdftoppm
Now it works! Thanks again.
Thanks, you guys are fantastic.
First you need to install Tesseract: https://github.com/UB-Mannheim/tesseract/wiki. (It doesn't seem to matter whether you install the latest 5.0 alpha or the 4.1 version--though I found version 5 seems to be faster--nor does it seem to matter whether you use the 32-bit or 64-bit version. Just install with all the default options. Next your way through the installer.) You need to set the path for Tesseract in the Tools > Zotero OCR preferences. In my case, I installed the 64-bit version, and the Tesseract path was "C:\Program Files\Tesseract-OCR\tesseract.exe". If you have the 32-bit version for whatever reason, it's probably in "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".
You then need to install Poppler. This is the best place to download Poppler for Windows, because it comes with all the DLLs: https://github.com/oschwartz10612/poppler-windows. You need to click the little "Release xxxxx" link with the tag on the right side of the screen. Then you have to unzip all the contents, go into the unzipped poppler- > lib folder, and then drag and drop both pdftoppm.exe and ALL of the .dll files into the C:\Program Files (x86)\Zotero folder. The first time you run the plug-in, Windows will ask you if you're sure you want to run the executable. (On this point, I must say that I think it would be super nice if the plug-in would allow you to specify a path for the Poppler files, just like it does for the Tesseract executable, but for whatever reason that's not an issue. You just have to dump all that junk in your Zotero folder. Not entirely elegant, but it works, and it's free! So I'm very thankful for it no matter how it works.)
When you right-click and OCR the file, you'll see a pop-up terminal windows appear, and eventually that will pass you to a second window where you can watch it start iterating through the documents.
By default, the plug-in creates a bunch of other attachments for the item, like a Note file and some other stuff. You can disable the extra stuff under the extension options. If you right-click the PDF attachment in Zotero and then click "Show file", you can delete the other files you don't need apart from the linked OCR version that it adds.
Also note that the OCR file sizes may be significantly larger than the nice, compact version you downloaded from your library's index. E.g., tonight I OCRed a file, and the original was 1 MB while the OCRed version was 17 MB.
Sadly I still haven't been able to get this plugin working--I think the issue is tesseract, which appears to crash on startup. When I run the plugin I get a command line screen for an instant which closes before I can actually read what comes up.
When I run 'tesseract.exe' through explorer I see the same thing.
I have tried installing the 32bit version of Tesseract 5.0, but it does not help. Legacy versions don't seem to have Windows installers and I'm not savvy enough to figure out how to use them to see. I'm running win10 Home.
If anybody has experienced anything similar, or has any ideas, I'd love to hear about it.
I also just tried scanning the Tesseract-OCR folder and found no malware, and Virustotal doesn't find malware in the installers in the two topmost links on https://github.com/UB-Mannheim/tesseract/wiki:
Virustotal scan of 32-bit installer: https://www.virustotal.com/gui/url/1fbe02d95d5a192026f1fcbb5283b3eb2d41fb64e98b288de6d3bdd6f81ae6ce/detection
Virustotal scan of 64-bit installer: https://www.virustotal.com/gui/url/1fbe02d95d5a192026f1fcbb5283b3eb2d41fb64e98b288de6d3bdd6f81ae6ce/detection
Use homebrew to install tessecrat first, this installs English and two other languages that I don't remember.
brew install tesseract-lang
Then I've the path is which depends on the version of course
/opt/homebrew/Cellar/tesseract/4.1.1
the second requirement is installing pdftoppm which comes with poppler package and can be done on two steps
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
then run
brew install poppler
The path in my mac machine is
/opt/homebrew/Cellar/poppler/21.01.0/bin
which also depends on poppler version
I've gone through the installation process using homebrew, installed tesseract and poppler, installed the zotero plugin, set the path in the Zotero plugin (/opt/homebrew/Cellar/tesseract/4.1.1
and
/opt/homebrew/Cellar/poppler/21.03.0_1/bin
I've copied the pdftoppm into /Applications/Zotero.app/Contents/MacOS/pdftoppm
When I run the plugin, an ocr file appears (far too quickly!) but when I try to open it I get the following error:
Format Error: Not a PDF or corrupted.
What am I missing?
I'm not attached to using an ocr within Zotero and would be happy with a recommendation for a free easy to use app ...
To check the progression, right-click on the PDF being scanned, and click on "locate". You'll see your pdf in its folder, with a bunch of png and .txt files appearing and disappearing during the scanning process. The PDF seems to be re-written completely, so if you try to open it in Zotero during the process, you'll have an error.
I cant seem to find the pdftoppm exe, i can find a fille with that name but it isnt recognized as a exe. Can someone help
https://pdf2image.readthedocs.io/en/latest/installation.html
following instructions from zuphilip (https://github.com/UB-Mannheim/zotero-ocr/wiki/Install-pdftoppm) I installed both poppler and tesseract but couldn't get any of the .exe files...
So I couldn't put any filepath for zotero addon.
Any clue?
Edit: Partially solved for cases in english, except the pdf creation that seems corrupt but not such a problem. My concern now is the use of scan in spanish. What is the config to enter ? spanish, span or espaƱol? No one seems to work. How to install it?
Please refer to the Tesseract document in the following path: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages-and-scripts
Generally you need to set the parameter for Zotero OCR Preference like this: spa+eng+spa_old