Need for OCR plug-in

It appears that nothing is happening with Zotero OCR in preparation for version 7. The ability to OCR from within Zotero is a great feature, especially now that the Zotero Reader is making it possible for people to let go of external PDF tools. I hope another developer will take up the challenge to create a way to OCR PDFs in Zotero 7 if the Zotero OCR is going defunct.
  • Thanks, @erazlogo. The instructions appear to require Zotero OCR, or is there a way without it?
  • Ah, that's right. So this will not work with Zotero 7. That is a pity.
  • OCR really should be sherlocked into Zotero
  • I would love it to be standard.
  • We don't need an OCR option in Zotero. @adamsmith @dstillman

    Zotero needs to handle non-ocred pdf files with ease.

    Because even if i use most advanced tools to create an OCRed pdf and believe me i tried on windows and linux. Handwritten or similar documents (old, damaged real word papers like newspapers, pdfs with created by old fonts etc..) wouldn't give wanted results.

    Literally i created a trained tesseract language file for this purpose,

    Zotero shouldn't let me do that. Zotero should't allow me to waste my time with this.

    I shouldn't spend much time to created a trained tesseract language file to scan a handwritten pdf document instead of reading it in Zotero. So i gived up trying that.

    At the end if zotero or any plugin developer would created a plugin for ocr purpose. It would be limited and wouldn't give best results without lots options like skewing, unpapering, cleaning, rotating etc.. it would use tesseract mostly and tesseract doesn't give me good enough results, even i used custom language file.

    So my humble opinion and suggestion for developers of zotero, draw straight lines with attirbute of notes and tags. if i underline a word with draw tool, i can write what it is in note section and i can find it in left panel easily. By the way that line need to be straight so we need to have draw straight line option and it shouldn't create an image file.
  • +1
    @documut I agree that there are many cases where OCR doesn't work well enough to be bothered using it. My wild guess is that you are from the History Department as well. But even there we do have documents in typewriting that would immensely profit from OCR and numerous Papers or scanned Documents dont have OCR yet either.
    And Historians aren't the only ones using Zotero. OCR is a useful feature for a lot of people out there.
    No one wants to have forced OCR, but only because it was a waste of time for you doesn't mean that the feature is useless.
  • For those who care: the Zotero-OCR plugin is now compatible with Zotero 7 https://github.com/UB-Mannheim/zotero-ocr
  • Thank you! Just installed it. I find OCR really useful for older books and all of those go directly into my Zotero library. And for newer publications, unprocessed scans people post on academia, etc.

    There's no need to re-invent the wheel with this plug in (or type anything by hand!).

    The best open-source OCR program I've used is OCRmyPDF: https://github.com/ocrmypdf. It also the same engine (tesseract) but has a lot of other features (including compression) and excellent documentation. https://ocrmypdf.readthedocs.io/en/latest/index.html#

    But it only runs on the command line. A really amazing Zotero plugin could just be a front end GUI for OCRmyPDF, and would be a lot less work than writing a new OCR program.
  • I tried installing the Zotero-OCR plug in and after a couple of tries got it to a stage where error messages no longer appeared when I attempted to OCR a pdf; but every time I tried to run it nothing happened. I even allowed it to run overnight in case I had been too impatient- no luck.

    I had been using Abbyy FineReader 11 but found it time consuming and at times tedious. Recently I downloaded the Freeware PDF 24 (https://tools.pdf24.org/en/), which among the suite of tools has an OCR engine- based on Tesseract- with reasonable results. It better preserves the original page image than Abbyy and takes a fraction of the time, and the OCR results are respectable, and like all OCR, dependant on the original image quality. If the original image is poor, I enhance it with Photoshop and a sharpening plugin that assists me to maximise the final OCR accuracy.

    1. I select to "open the file location"
    2. Copy the pdf into the PDF 24 OCR module
    3. Change the save path to the original pdf folder
    4. I tick the "Remove Background" and "Deskew Pages" options
    5. The click on Run
    6. I check the new file in Acrobat
    7. I rename the original file by putting an X in front of the file name and rename the recognised file by removing the PDF 24 suffix.
    8. Check the pdf again in the Zotero Reader
    9. Delete the old file

    Actual images, photos and graphics, remain untouched in the recognised file and tables seemed to be well recognised. I've used this for a 500 page plus book with fuzzy text and yellowed pages, and while it didn't give me pristine white pages it became a useable searchable text, which is exactly what I needed
  • I'm always interested to hear more about what went wrong with the plugin: complete error messages, screenshots, maybe the PDF itself.
    (no worries if you prefer the workflow that works for you, of course)
  • Thanks, there was no error message, it's just that nothing happened. I've checked my pc's windows logs- nada and I also checked Task Manager at the time but nothing showed up there at the time. As I wrote in my original post I left Zotero open overnight after starting the OCR, just in case I had been too impatient with other attempts. When I ran the same 260 page pdf, which had a page image format that was almost as good as you can get, through PDF 24 it took about 8 minutes to run. I'm happy to send you the file if you like.
  • You said there were messages in your first couple of attempts, it could have been useful (but maybe not). I guess Zotero-OCR crashed rather early in the process, let's see if we can figure out why.
    1. Was the PDF in your own library or a group library?
    2. Was the PDF a regular attachement or a linked file?
    3. Did the PDF have a parent item, or was it a standalone item?
  • The first error messages were because I had incorrectly pointed to the required files in settings, I sorted that out pretty quickly.

    1. The PDF is in my library and not shared with other users
    2. It was an attached file
    3. It was a single child item attached to the usual bibliographic entry

    Next time I have an unrecognised standalone PDF, I'll try it again in case it being an attachment is an issue.
  • edited September 16, 2024
    Regular attachments should work just fine, linked files could fail in some cases.
    Thanks for your response! I don't see any obvious problem in your situation.
    I could take a closer look at your PDF if you are willing to share it. If it's available online anyway, you can post the link here, otherwise I'd recommend a one-time file sharing solution.
  • It's the PDF version of the item at https://archive.org/details/dli.ministry.11365

    I agree that linked files would be a problem, because they are not actually in Zotero. Not to worry though, because my work around does a pretty good job and although there are a few steps it doesn't take very long. I mainly posted to let other users know that if they have difficulty with Zotero-OCR there is a low cost (zero) alternative out there.
  • Thanks for the link, I'll see if anything weird happens with this one. And there's absolutely nothing wrong about presenting alternatives!

    Note that linked files are not always a problem, they're fine more often than not. There are just a few use cases where they are less well supported in Zotero (and thus, by plugins).
  • Well, the OCR was successful on my side (Zotero 7, Zotero-OCR 0.8.1).
    Are you running Zotero 6, by any chance? The plugin was broken for Z6 in the previous release.
  • edited September 17, 2024
    Hi, I'm running the latest version of Zotero 7.0.5 (64 bit) , Zotero-OCR 0.8.1;
    I was using tesseract 5.3.1.20230401, but just in case I downloaded and installed Tesseract 5.4.0. but still no joy. I've included a screen shot of my Zotero-OCR settings page and a screen shot of the Zotero version.

    https://s3.amazonaws.com/zotero.org/images/forums/u5782013/m0ab4pjooed0e2l9u612.png

    It's probably not relevant but, when I upgraded to Zotero v.7.0.5 (64 bit) I ran into considerable problems which I outlined at https://forums.zotero.org/discussion/117221/upgrade-to-64-bit-problems#latest
  • I uploaded an incorrect settings image, which I deleted. This is the correct image

    https://s3.amazonaws.com/zotero.org/images/forums/u5782013/8lceluambbwfxuopsh8s.png
  • edited September 17, 2024
    It's working for me with the exact same settings... we probably need to work on our error handling, the plugin should really make it easier for the user to see that something went wrong (and what).
    If you have some more time to spend on this, could you check the error console (in the Zotero menu: Tools / Developer / Error console) and debug log (Help / Debug Output Logging / View Output) on a new attempt to OCR the PDF?
  • edited September 17, 2024
    Having "undefined" as language doesn't seem to work for me. (While "eng" does.)
  • Ooooh, right, good catch! I have been blind to the fact that there will be a difference between an empty field and a string with the "undefined" value. We should set an explicit "eng" default, instead of an implicit one...
  • I changed the language setting to "eng", then restarted my PC closed all start up programs and ran Zotero with Debug on and saw the following two error messages:

    17:02:02.659 does not support changing `store` on the fly. It is most likely that you see this error because you updated to Redux 2.x and React Redux 2.x which no longer hot reload reducers automatically. See https://github.com/reactjs/react-redux/releases/tag/v2.0.0 for the migration instructions. react-redux.js:881:13

    17:02:20.777 NS_ERROR_FILE_UNRECOGNIZED_PATH: Component returned failure code: 0x80520001 (NS_ERROR_FILE_UNRECOGNIZED_PATH) [nsIFile.initWithPath] zotero-ocr.js:85

    I checked the reactjs link on hithub, but it totally mystified me- I suspect it is secret developers business!! so stayed well away. Although I was an IT manager before I retired my expertise lies in systems (raising them back from the dead), applications, networking, and some very basic VB and batch file programming- but nothing specialised.
  • edited September 17, 2024
    There's plenty in the logs that is not related to Zotero-OCR, that's why I didn't go there right away :-)
    Can you see a difference in the plugin behavior with the eng value (no quotes in the field)?
  • Sorry, when I typed it into the settings I did so just as eng, no quotes. I only added them to the post to distinguish the thr value, but probably should have added something to distinguish that the value in the settings was just eng
  • No problem, I was just checking.
Sign In or Register to comment.