Unable to use OCR function.

I'm a new Zotero user, interested in being better able to read and mark up pdf documents. I have tried hard to set up OCR functions, on the Zotero 7.0.13 version I've downloaded, but without success. It seems most of the help items I've found are for older releases, making them difficult to follow. Also, there are multiple steps needed, and I'm not finding ways to verify each step.
I'd love to get some coaching help. Or pointers to resources I should be finding.
Tom Goldsmith TTGsmith@TGandA.com
  • There are not so much to do, and everything is explained in the homepage of the plugin: https://github.com/UB-Mannheim/zotero-ocr
  • Thanks for this suggestion.
    The instruction (at https://github.com/UB-Mannheim/zotero-ocr) looks like it covers the version 7 I'm working with. But I'm still in the fog.
    I'm not sure what a .xpi is or does, but the one I have (and point to) for pdftoppm is finding something that is apparently still un-zipped. (I have a faint recollection of downloading something that didn't un-zip.) Also, I see other instructions that look to me like un-familiar territory. This suggests I could be spending a LOT of time guessing, and learning what to do.
    .
    I would be very pleased to find a source that could walk me through this process. And even answer some other questions I have about using Zotero.
    ....Do you have ideas of where I might look?
    Tom Goldsmith
  • I'm a member of the zotero-ocr team, I'll be happy to help.

    Can you follow the instructions as exactly as possible, in sequence, and report the exact point where things become unclear for you?
  • I'm not aware of anything more detailed for Zotero OCR than the instructions they provide.
    If you have specific questions, just asking here is fine.

    A .xpi file is the format for all Zotero add-ons. You just download it and then install it into Zotero from the Tools menu. That's for the Zotero OCR add-on. While .xpi's are technically a form of .zip files, you should never unzip them as part of normal installation and usage.

    pdftoppm is _not_ an xpi format. If you're on Windows, it actually does come as a .zip and you do need to install it as described on the separate page:
    https://github.com/UB-Mannheim/zotero-ocr/wiki/Install-pdftoppm
  • edited 14 days ago
    The Zotero OCR documentation is intended to be sufficient. If it isn't (which is always possible), feedback is of course welcome.
  • Poettli, Aborel, AdamSmith;
    Thanks for these suggestions, but I'm still in trouble.
    Nina, my long-time friend and supporter. has also been working to help me, and we've run into a different bug. ….She has installed Zotero 7, but when she tries to add the Zotero OCR, using the .xpi, she gets the following message:
    “Adobe Acrobat Reader could not open Zotero OCR 0.8.1.xpi.
    It is either not a supported file type or because the file has been damaged.”
    We suspect this message has something to with her "free" Adobe Acrobat. But I also note that I too have that free Adobe Acrobat, and have NOT seen such a message.
    .
    I hope you can help us with Nina's issue. Then too, I can later tackle the problems I'm still having with my installation.
    Tom Goldsmith Mon 17Mar2025
  • She's trying to open the .xpi file with Acrobat.
    Again, you shouldn't open that file with any software. You should download it and then select it from the Tools --> Plugins menu in Zotero (that's how all Zotero plugins work; this isn't specific to Zotero OCR)
  • Poettli, Aborel, AdamSmith;
    I see what you mean about Acrobat trying to open the .xpi !!
    ….Thank you for being polite about that. : )
    .
    I’ve gotten further with the installation, and noted down a number of points I’ve gotten caught on. So would be glad to go through them with you. But I’m not sure my installation has been successful.
    When I’ve tried to use the OCR function the Help>>Debug Output Logging>>View Output shows a lot of activity (which I’m not prepared to interpret). And since I don’t find time-stamps in it, I wonder what might be current.
    ….Can you give me ideas of what may be happening?
    .
    Again, I’m still looking for instructions, as I get started. While I suspect some sort of coaching might be efficient for me.
    Tom Goldsmith Tue 18Mar2025
  • Debug output is used by devs only. Normal users don't need to pay attention to what is written there. You should see a .ocr file under the parent item if the process was successful. Is it the case?
    https://s3.amazonaws.com/zotero.org/images/forums/u2119014/rwbzjioaip0u8v85esph.png
  • edited 10 days ago
    When I look in the sub-collection where my .pdf shows, I don't see a .ocr file. Is this where I should be looking?
    When I was exploring the Debug output, I thought I saw indications of text recognigtion, but when I looked again for that log, it wasn't there.
  • edited 10 days ago
    I see our documentation might be missing a few sentences for users before it jumps into development information... thanks for the report, such feedback is always useful.

    In order to actually perform the recognition using the plugin, you need to right-click on a PDF attachment in Zotero and select "OCR selected PDF(s)" in the contextual menu. If the settings haven't been changed after the installation, the .ocr copy (essentially the same PDF as the original, with an underlying text layer that you can search, annotate, etc.) should appear in Zotero after a while (one second to a couple of minutes depending on the length of the document).
    The plugin doesn't show its progress while running (except by adding some page-1, page-2... attachments depending on the settings, as in poettli's screenshot), we are aware that it would be helpful for many users and we are considering ways to improve that.

    I hope this helps, don't hesitate to ask again if you need further guidance. If so, it would be great if you can explain exactly what you have done, step by step?
  • Poettli, Aborel, AdamSmith;
    I know documentation is an “evolving” product, and a deep-consuming effort. Where there there’s no good substitute for step-by-step “re-test” with a new user. ....So I’m glad if I can help in that effort.
    .
    I don’t believe I’ve changed the OCR settings as I’ve installed.
    But so far, I haven’t yet seen the “note on regression screen, with the “… page-1, page-2, ...” list poetti cited yesterday. (Nor do I grasp how “two parents” might fit in).
    But I did find, in the Zotero far-right panel, a “3 attachments” piece.
    With my install, per this clip:
    >>>>>Ooops I apparently can't show "clipped" picture in this forum.
    Can I e-mail the screen clip to you??
    .
    The top part (of my clilp) shows the second page of the .pdf I’m working with. Then the there are three below lines that connect to .pdf's. The first connects to the raw .pdf version, and the second and third lines connect to OCR’d versions I’ve been seeking. (Two lines, because I must have tried twice to do the OCR process. (Not having realized Zotero was already busy at work.)
    But is this where I should be looking? I’ve expected either to find, in the larger Zotero center panel, either my beginning .pdf but with OCR detail now present, or a second line, with the same file name, but supplemented with and OCR-identifying suffix.
    Perhaps you can clarify this for me, or let me know where I can find version=7 documentation, with details I’ve missed?
    .
    The good news (!!!) is that I now have gotten Zotero OCR to function for me, and thus have an OCR-detailed .pdf to work with. So thank you VERY much for your helping me get this far.
    Tom Goldsmith Wed 19Mar2025
  • You can share your screenshot on a platform such as https://postimages.org/ and post the link here.
  • Aborel;
    Here's the PostImage link, to my screen shot.
    https://postimg.cc/pmmgFzxV
    Tom Goldsmith
  • Can you post a screenshot that shows the item and its attachments in the central pane? I'd like to see the full name of the attachments, here the ending is unfortunately not visible.
  • Aborel;
    Here's the PostImage link, for the center screen shot, with the full file name.
    https://postimg.cc/pmTB6kxN
    Tom Goldsmith
  • We're getting closer... please click on the arrow on the right-hand side of the item so that we can actually see the attachments.
  • edited 8 days ago
    Aborel;
    Clicking the arrow at the LEFT of the item name in the center panel gave me expanded information. : )
    See: https://postimg.cc/1gLrCYpW
    ....I don't find anything about "two parents" or pages.
    So I remain curious on that aspect.
    Tom Goldsmith
  • Great! Just a quick note because I don't have much time right now: the plugin has worked!
    You have the .ocr copy of the original PDF (two copies, so you can delete one), as well as a note that also contains the recognized text (also two identical copies).
  • edited 7 days ago
    I have more time for a detailed complement now.

    "Note on regression screen in the case of two parents" is the title of an item in poettli's library, which means that you won't find it in your own library. The point was to show how the output could look like in the main Zotero pane, but of course the parent item will be different on your system. Now perhaps you didn't have a parent item originally: in such a case, the plugin creates one automatically, with the title based on the PDF file name (because we can't guess otherwise). In Zotero, you can directly add a PDF to your library, but the application will only create a parent if it identifies some crucial metadata in the file. That metadata will not be present in a PDF without any text layer. In the Zotero OCR plugin, we have decided that running the plugin on a PDF without a parent item would create too much mess, so we create that parent. It's also the right thing to do if you want to cite the document using Zotero's normal functionality.

    Using the default settings of the plugin, the first pages of the PDF (up to 5 pages) should have been added as image attachments. According to your screenshot, this hasn't happened, perhaps this is a bug that we need to look into. But the .ocr copies of your original PDF and the associated notes indicate that the plugin has worked successfully, so it feels like a minor issue.

    The .ocr attachments are copies of your original PDF, in the default settings we prefer not to overwrite the original. When you are satisfied with the OCRed copies, you can delete the original.

    I hope this helps. I would perhaps recommend that you familiarize yourself with Zotero's standard features before you add more plugins, but at least I can reassure you that most of your system seems to be working as designed :-) This has also given me a few ideas to improve the Zotero OCR plugin documentation, so I'm grateful for your taking the time to answer my questions. My thanks to all other participants for keeping the discussion alive while I was busy with other things :-)

    If something is still unclear, or if you have further questions, don't hesitate to ask.
Sign In or Register to comment.