Need for OCR plug-in
It appears that nothing is happening with Zotero OCR in preparation for version 7. The ability to OCR from within Zotero is a great feature, especially now that the Zotero Reader is making it possible for people to let go of external PDF tools. I hope another developer will take up the challenge to create a way to OCR PDFs in Zotero 7 if the Zotero OCR is going defunct.
Zotero needs to handle non-ocred pdf files with ease.
Because even if i use most advanced tools to create an OCRed pdf and believe me i tried on windows and linux. Handwritten or similar documents (old, damaged real word papers like newspapers, pdfs with created by old fonts etc..) wouldn't give wanted results.
Literally i created a trained tesseract language file for this purpose,
Zotero shouldn't let me do that. Zotero should't allow me to waste my time with this.
I shouldn't spend much time to created a trained tesseract language file to scan a handwritten pdf document instead of reading it in Zotero. So i gived up trying that.
At the end if zotero or any plugin developer would created a plugin for ocr purpose. It would be limited and wouldn't give best results without lots options like skewing, unpapering, cleaning, rotating etc.. it would use tesseract mostly and tesseract doesn't give me good enough results, even i used custom language file.
So my humble opinion and suggestion for developers of zotero, draw straight lines with attirbute of notes and tags. if i underline a word with draw tool, i can write what it is in note section and i can find it in left panel easily. By the way that line need to be straight so we need to have draw straight line option and it shouldn't create an image file.
@documut I agree that there are many cases where OCR doesn't work well enough to be bothered using it. My wild guess is that you are from the History Department as well. But even there we do have documents in typewriting that would immensely profit from OCR and numerous Papers or scanned Documents dont have OCR yet either.
And Historians aren't the only ones using Zotero. OCR is a useful feature for a lot of people out there.
No one wants to have forced OCR, but only because it was a waste of time for you doesn't mean that the feature is useless.
There's no need to re-invent the wheel with this plug in (or type anything by hand!).
The best open-source OCR program I've used is OCRmyPDF: https://github.com/ocrmypdf. It also the same engine (tesseract) but has a lot of other features (including compression) and excellent documentation. https://ocrmypdf.readthedocs.io/en/latest/index.html#
But it only runs on the command line. A really amazing Zotero plugin could just be a front end GUI for OCRmyPDF, and would be a lot less work than writing a new OCR program.
I had been using Abbyy FineReader 11 but found it time consuming and at times tedious. Recently I downloaded the Freeware PDF 24 (https://tools.pdf24.org/en/), which among the suite of tools has an OCR engine- based on Tesseract- with reasonable results. It better preserves the original page image than Abbyy and takes a fraction of the time, and the OCR results are respectable, and like all OCR, dependant on the original image quality. If the original image is poor, I enhance it with Photoshop and a sharpening plugin that assists me to maximise the final OCR accuracy.
1. I select to "open the file location"
2. Copy the pdf into the PDF 24 OCR module
3. Change the save path to the original pdf folder
4. I tick the "Remove Background" and "Deskew Pages" options
5. The click on Run
6. I check the new file in Acrobat
7. I rename the original file by putting an X in front of the file name and rename the recognised file by removing the PDF 24 suffix.
8. Check the pdf again in the Zotero Reader
9. Delete the old file
Actual images, photos and graphics, remain untouched in the recognised file and tables seemed to be well recognised. I've used this for a 500 page plus book with fuzzy text and yellowed pages, and while it didn't give me pristine white pages it became a useable searchable text, which is exactly what I needed
(no worries if you prefer the workflow that works for you, of course)
1. Was the PDF in your own library or a group library?
2. Was the PDF a regular attachement or a linked file?
3. Did the PDF have a parent item, or was it a standalone item?
1. The PDF is in my library and not shared with other users
2. It was an attached file
3. It was a single child item attached to the usual bibliographic entry
Next time I have an unrecognised standalone PDF, I'll try it again in case it being an attachment is an issue.
Thanks for your response! I don't see any obvious problem in your situation.
I could take a closer look at your PDF if you are willing to share it. If it's available online anyway, you can post the link here, otherwise I'd recommend a one-time file sharing solution.
I agree that linked files would be a problem, because they are not actually in Zotero. Not to worry though, because my work around does a pretty good job and although there are a few steps it doesn't take very long. I mainly posted to let other users know that if they have difficulty with Zotero-OCR there is a low cost (zero) alternative out there.
Note that linked files are not always a problem, they're fine more often than not. There are just a few use cases where they are less well supported in Zotero (and thus, by plugins).
Are you running Zotero 6, by any chance? The plugin was broken for Z6 in the previous release.
I was using tesseract 5.3.1.20230401, but just in case I downloaded and installed Tesseract 5.4.0. but still no joy. I've included a screen shot of my Zotero-OCR settings page and a screen shot of the Zotero version.
https://s3.amazonaws.com/zotero.org/images/forums/u5782013/m0ab4pjooed0e2l9u612.png
It's probably not relevant but, when I upgraded to Zotero v.7.0.5 (64 bit) I ran into considerable problems which I outlined at https://forums.zotero.org/discussion/117221/upgrade-to-64-bit-problems#latest
https://s3.amazonaws.com/zotero.org/images/forums/u5782013/8lceluambbwfxuopsh8s.png
If you have some more time to spend on this, could you check the error console (in the Zotero menu: Tools / Developer / Error console) and debug log (Help / Debug Output Logging / View Output) on a new attempt to OCR the PDF?
17:02:02.659 does not support changing `store` on the fly. It is most likely that you see this error because you updated to Redux 2.x and React Redux 2.x which no longer hot reload reducers automatically. See https://github.com/reactjs/react-redux/releases/tag/v2.0.0 for the migration instructions. react-redux.js:881:13
17:02:20.777 NS_ERROR_FILE_UNRECOGNIZED_PATH: Component returned failure code: 0x80520001 (NS_ERROR_FILE_UNRECOGNIZED_PATH) [nsIFile.initWithPath] zotero-ocr.js:85
I checked the reactjs link on hithub, but it totally mystified me- I suspect it is secret developers business!! so stayed well away. Although I was an IT manager before I retired my expertise lies in systems (raising them back from the dead), applications, networking, and some very basic VB and batch file programming- but nothing specialised.
Can you see a difference in the plugin behavior with the eng value (no quotes in the field)?