Need for OCR plug-in

DonnaCoxBaker · December 7, 2023

It appears that nothing is happening with Zotero OCR in preparation for version 7. The ability to OCR from within Zotero is a great feature, especially now that the Zotero Reader is making it possible for people to let go of external PDF tools. I hope another developer will take up the challenge to create a way to OCR PDFs in Zotero 7 if the Zotero OCR is going defunct.

erazlogo · December 7, 2023

FYI: directions on how to set up OCR in Zotero: https://publish.obsidian.md/history-notes/04+OCR+in+Zotero

DonnaCoxBaker · December 7, 2023

Thanks, @erazlogo. The instructions appear to require Zotero OCR, or is there a way without it?

erazlogo · December 7, 2023

Ah, that's right. So this will not work with Zotero 7. That is a pity.

erazlogo · December 7, 2023

OCR really should be sherlocked into Zotero

DonnaCoxBaker · December 7, 2023

I would love it to be standard.

documut · January 21, 2024

We don't need an OCR option in Zotero. @adamsmith @dstillman

Zotero needs to handle non-ocred pdf files with ease.

Because even if i use most advanced tools to create an OCRed pdf and believe me i tried on windows and linux. Handwritten or similar documents (old, damaged real word papers like newspapers, pdfs with created by old fonts etc..) wouldn't give wanted results.

Literally i created a trained tesseract language file for this purpose,

Zotero shouldn't let me do that. Zotero should't allow me to waste my time with this.

I shouldn't spend much time to created a trained tesseract language file to scan a handwritten pdf document instead of reading it in Zotero. So i gived up trying that.

At the end if zotero or any plugin developer would created a plugin for ocr purpose. It would be limited and wouldn't give best results without lots options like skewing, unpapering, cleaning, rotating etc.. it would use tesseract mostly and tesseract doesn't give me good enough results, even i used custom language file.

So my humble opinion and suggestion for developers of zotero, draw straight lines with attirbute of notes and tags. if i underline a word with draw tool, i can write what it is in note section and i can find it in left panel easily. By the way that line need to be straight so we need to have draw straight line option and it shouldn't create an image file.

LordMoff · June 7, 2024

+1
@documut I agree that there are many cases where OCR doesn't work well enough to be bothered using it. My wild guess is that you are from the History Department as well. But even there we do have documents in typewriting that would immensely profit from OCR and numerous Papers or scanned Documents dont have OCR yet either.
And Historians aren't the only ones using Zotero. OCR is a useful feature for a lot of people out there.
No one wants to have forced OCR, but only because it was a waste of time for you doesn't mean that the feature is useless.

aborel · June 7, 2024

For those who care: the Zotero-OCR plugin is now compatible with Zotero 7 https://github.com/UB-Mannheim/zotero-ocr

erikmarsh · June 8, 2024

Thank you! Just installed it. I find OCR really useful for older books and all of those go directly into my Zotero library. And for newer publications, unprocessed scans people post on academia, etc.

There's no need to re-invent the wheel with this plug in (or type anything by hand!).

The best open-source OCR program I've used is OCRmyPDF: https://github.com/ocrmypdf. It also the same engine (tesseract) but has a lot of other features (including compression) and excellent documentation. https://ocrmypdf.readthedocs.io/en/latest/index.html#

But it only runs on the command line. A really amazing Zotero plugin could just be a front end GUI for OCRmyPDF, and would be a lot less work than writing a new OCR program.

ForkNTalk · September 16, 2024

I tried installing the Zotero-OCR plug in and after a couple of tries got it to a stage where error messages no longer appeared when I attempted to OCR a pdf; but every time I tried to run it nothing happened. I even allowed it to run overnight in case I had been too impatient- no luck.

I had been using Abbyy FineReader 11 but found it time consuming and at times tedious. Recently I downloaded the Freeware PDF 24 (https://tools.pdf24.org/en/), which among the suite of tools has an OCR engine- based on Tesseract- with reasonable results. It better preserves the original page image than Abbyy and takes a fraction of the time, and the OCR results are respectable, and like all OCR, dependant on the original image quality. If the original image is poor, I enhance it with Photoshop and a sharpening plugin that assists me to maximise the final OCR accuracy.

1. I select to "open the file location"
2. Copy the pdf into the PDF 24 OCR module
3. Change the save path to the original pdf folder
4. I tick the "Remove Background" and "Deskew Pages" options
5. The click on Run
6. I check the new file in Acrobat
7. I rename the original file by putting an X in front of the file name and rename the recognised file by removing the PDF 24 suffix.
8. Check the pdf again in the Zotero Reader
9. Delete the old file

Actual images, photos and graphics, remain untouched in the recognised file and tables seemed to be well recognised. I've used this for a 500 page plus book with fuzzy text and yellowed pages, and while it didn't give me pristine white pages it became a useable searchable text, which is exactly what I needed

aborel · September 16, 2024

I'm always interested to hear more about what went wrong with the plugin: complete error messages, screenshots, maybe the PDF itself.
(no worries if you prefer the workflow that works for you, of course)

ForkNTalk · September 16, 2024

Thanks, there was no error message, it's just that nothing happened. I've checked my pc's windows logs- nada and I also checked Task Manager at the time but nothing showed up there at the time. As I wrote in my original post I left Zotero open overnight after starting the OCR, just in case I had been too impatient with other attempts. When I ran the same 260 page pdf, which had a page image format that was almost as good as you can get, through PDF 24 it took about 8 minutes to run. I'm happy to send you the file if you like.

aborel · September 16, 2024

You said there were messages in your first couple of attempts, it could have been useful (but maybe not). I guess Zotero-OCR crashed rather early in the process, let's see if we can figure out why.
1. Was the PDF in your own library or a group library?
2. Was the PDF a regular attachement or a linked file?
3. Did the PDF have a parent item, or was it a standalone item?

ForkNTalk · September 16, 2024

The first error messages were because I had incorrectly pointed to the required files in settings, I sorted that out pretty quickly.

1. The PDF is in my library and not shared with other users
2. It was an attached file
3. It was a single child item attached to the usual bibliographic entry

Next time I have an unrecognised standalone PDF, I'll try it again in case it being an attachment is an issue.

aborel · September 16, 2024

Regular attachments should work just fine, linked files could fail in some cases.
Thanks for your response! I don't see any obvious problem in your situation.
I could take a closer look at your PDF if you are willing to share it. If it's available online anyway, you can post the link here, otherwise I'd recommend a one-time file sharing solution.

ForkNTalk · September 16, 2024

It's the PDF version of the item at https://archive.org/details/dli.ministry.11365

I agree that linked files would be a problem, because they are not actually in Zotero. Not to worry though, because my work around does a pretty good job and although there are a few steps it doesn't take very long. I mainly posted to let other users know that if they have difficulty with Zotero-OCR there is a low cost (zero) alternative out there.

aborel · September 16, 2024

Thanks for the link, I'll see if anything weird happens with this one. And there's absolutely nothing wrong about presenting alternatives!

Note that linked files are not always a problem, they're fine more often than not. There are just a few use cases where they are less well supported in Zotero (and thus, by plugins).

aborel · September 16, 2024

Well, the OCR was successful on my side (Zotero 7, Zotero-OCR 0.8.1).
Are you running Zotero 6, by any chance? The plugin was broken for Z6 in the previous release.

ForkNTalk · September 17, 2024

Hi, I'm running the latest version of Zotero 7.0.5 (64 bit) , Zotero-OCR 0.8.1;
I was using tesseract 5.3.1.20230401, but just in case I downloaded and installed Tesseract 5.4.0. but still no joy. I've included a screen shot of my Zotero-OCR settings page and a screen shot of the Zotero version.

https://s3.amazonaws.com/zotero.org/images/forums/u5782013/m0ab4pjooed0e2l9u612.png

It's probably not relevant but, when I upgraded to Zotero v.7.0.5 (64 bit) I ran into considerable problems which I outlined at https://forums.zotero.org/discussion/117221/upgrade-to-64-bit-problems#latest

ForkNTalk · September 17, 2024

I uploaded an incorrect settings image, which I deleted. This is the correct image

https://s3.amazonaws.com/zotero.org/images/forums/u5782013/8lceluambbwfxuopsh8s.png

aborel · September 17, 2024

It's working for me with the exact same settings... we probably need to work on our error handling, the plugin should really make it easier for the user to see that something went wrong (and what).
If you have some more time to spend on this, could you check the error console (in the Zotero menu: Tools / Developer / Error console) and debug log (Help / Debug Output Logging / View Output) on a new attempt to OCR the PDF?

poettli · September 17, 2024

Having "undefined" as language doesn't seem to work for me. (While "eng" does.)

aborel · September 17, 2024

Ooooh, right, good catch! I have been blind to the fact that there will be a difference between an empty field and a string with the "undefined" value. We should set an explicit "eng" default, instead of an implicit one...

ForkNTalk · September 17, 2024

I changed the language setting to "eng", then restarted my PC closed all start up programs and ran Zotero with Debug on and saw the following two error messages:

17:02:02.659 does not support changing `store` on the fly. It is most likely that you see this error because you updated to Redux 2.x and React Redux 2.x which no longer hot reload reducers automatically. See https://github.com/reactjs/react-redux/releases/tag/v2.0.0 for the migration instructions. react-redux.js:881:13

17:02:20.777 NS_ERROR_FILE_UNRECOGNIZED_PATH: Component returned failure code: 0x80520001 (NS_ERROR_FILE_UNRECOGNIZED_PATH) [nsIFile.initWithPath] zotero-ocr.js:85

I checked the reactjs link on hithub, but it totally mystified me- I suspect it is secret developers business!! so stayed well away. Although I was an IT manager before I retired my expertise lies in systems (raising them back from the dead), applications, networking, and some very basic VB and batch file programming- but nothing specialised.

aborel · September 17, 2024

There's plenty in the logs that is not related to Zotero-OCR, that's why I didn't go there right away :-)
Can you see a difference in the plugin behavior with the eng value (no quotes in the field)?

ForkNTalk · September 17, 2024

Sorry, when I typed it into the settings I did so just as eng, no quotes. I only added them to the post to distinguish the thr value, but probably should have added something to distinguish that the value in the settings was just eng

aborel · September 17, 2024

No problem, I was just checking.

sdf81 · June 14, 2025

I would really second @erikmarsh‘s idea of a OCRmyPDF plugin !!! Tesseract on its own does a good job, but when it comes to bad quality pdfs, OCRmyPDF has much better pre-processing and irons out most of the errors. For example it has the option to “—deskew” pages. Which leads to tesseract doing its job best.

Who could make a plugin for that command line tool or show me how to do it?

In the end it’s only needs a GUI for that already great command line tool.