Pdf ocr searchable but non selectionable text

TomMon · July 21, 2022

I'm struggling for important an important an huge amount of scanned pdf. I have linux.
- I noticed some ocr tools (ocrmypdf -- force-ocr) are better than the zotero one. Is that normal? No solution to that?
- I finally managed to have a good ocrised pdf in zotero. But there is a problem. (only) Some paragraphs are searchable (if i search for a word within the pdf, the word appears highlighted in yellow in the paragraph), but i cannot select the word (highlighted in blue) or even the entire paragraph. Do you have any solution for that?
Thks

adamsmith · July 21, 2022

1. Zotero doesn't have an OCR tool. If you mean the zotero-ocr add-on, that does use the same OCR engine (tesseract) as ocrmypdf, so it likely comes down to the exact settings used
2. Have you tried in a different PDF reader? I'd guess that the text embedding isn't quite right for the PDF -- in that case, there's not much Zotero can do.

TomMon · July 21, 2022

yep the text is not perfectly right because some scans went wrong. Anyway I don't see the point about using a different pdf reader. In okular (and other pdf readers) everything is fine. But when i add the doc to zotero and open it within zotero, well, i can search for words, but I can merely select half of the paragraphs. And my pb is that i need to use them within zotero
And no: both use tesseract, but results are not equal. It is better with ocrmypdf --force-ocr command

dstillman · July 21, 2022

If the same PDF is behaving worse in Zotero than in another PDF reader, send it to support@zotero.org with a link to this thread and we can take a look.

dstillman · July 21, 2022

@TomMon: To be clear, the email address is just for providing private files. If you have other questions or comments, you should post them here. And note that we can't provide any help with the OCR plugin — we don't have anything to do with that.

We'll take a look at the file you provided.

TomMon · July 21, 2022

Thks for such availability!
Well to make myself clearer: forget about the ocr plugin. My real problem is somewhere else: how to make a pdf readable/searchable/selectionable/highlightable exactly as it is in external pdf viewers? It is not just one file, it is like that for hundreds of them. Systematic when i import in zotero a pdf ocrized. Even: when i make ocr in zotero, the extracted note contains everything inside, but when i open the ocrized version, i can select not even half of the paragraphs.

dstillman · July 21, 2022

how to make a pdf readable/searchable/selectionable/highlightable exactly as it is in external pdf viewers?

I mean, the premise is wrong. Acrobat does OK, but both macOS Preview and PDF Expert have major trouble with the text layer in this PDF. We might be able to do better than we're doing now, though — @martynas_b will have to take a look. But if you're having these problems across files you use regularly, I'd recommend looking into a different OCR tool that creates cleaner text layers.

martynas_b · July 22, 2022

It's unlikely that Zotero will get better at highlighting OCRed PDF text. There are too many ways how text layer can be messed up and therefore there is no point in doing that.

The only rational solution is to re-OCR those PDFs.

Just making a screenshot of the PDF, that was sent as an example, and using the native macOS image OCRing, gives me a good quality text. So this is just a matter of OCRing software.

I would even say that all old OCRed papers should be re-OCRed before adding to Zotero.

TomMon · July 22, 2022

Well, thks a lot for trying, and the time given. Still, I've tested your suggestion. The result is exactly the same. And I'm not sure i've been able to make clear what is my problem.
I can handle having "a good quality text" (outside zotero). The problem is that when i open the doc WITHIN zotero, the text layer is only partially selectionable. However, the layer IS here, since i can search for words within the doc, and since when i re-ocrize it with the zotero plugin, the note it creates is of very good quality. But the problem remains: the pdf so created in zotero is perfect outside zotero, but wrong within it.

Then, it is a zotero problem (i'd even say a problem with the internal pdf reader). This is why i'm asking the question here.

martynas_b · July 22, 2022

It's definitely not perfect and the issues varies between different PDF viewers.

I probably know why this specific case performs better in some other PDF viewers — they are regrouping characters into words, lines and paragraphs and re-generating a new text flow. But that only helps for a fraction of OCRed PDFs and this is OCRing software job to do so.

dstillman · July 22, 2022

But the problem remains: the pdf so created in zotero is perfect outside zotero, but wrong within it.

@TomMon: You keep saying this, but it's just not true. As I said, both macOS Preview and PDF Expert — two extremely widely used PDF readers — have trouble with text selection on this PDF. The problems are different, but the idea that this is just a normal PDF and Zotero alone has trouble on it is absurd.

erazlogo · July 22, 2022

Just making a screenshot of the PDF, that was sent as an example, and using the native macOS image OCRing, gives me a good quality text.

@martynas_b What is "the native macOS image OCRing"? Is it an app that comes with the Apple system? How does one run it?

dstillman · July 22, 2022

Live Text in macOS 12:

https://support.apple.com/guide/preview/interact-with-text-in-a-photo-prvw625a5b2c/mac

(This was just an example for extracting text — not something that you use to actually create text layers in PDFs.)

erazlogo · July 22, 2022

@dstillman Cool, thanks!

clousley · August 3, 2022

I am having this exact same problem. I work largely with print books (humanities) and I do a lot of scanning and OCR. I just tried three different scans of the same short excerpt and in each one there are pages and paragraphs that I cannot select once the file is uploaded to zotero, but I can select all of the text in that file in Adobe Reader. (If it is helpful to know, I am now largely using Adobe scan on my phone to do scan with OCR. I can use Acrobat Professional and a scanner but only by logging into an old computer whose operating system I never updated.)

clousley · August 3, 2022

Additional note: my zotero has lost functionality because of this problem. Before, I could just scan, OCR, and annotate in Adobe Reader or Preview, then upload the file to Zotero and have Zotfile extract the annotations. Now, even annotations I make outside Zotero are mostly not picked up and not extracted. ("Add extracted text" message appears for each one, but I cannot get zotero to add the text as an annotation.) I otherwise LOVE the pdf viewer in Zotero and its excellent capabilities for note-taking and annotations.

dstillman · August 3, 2022

I just tried three different scans of the same short excerpt and in each one there are pages and paragraphs that I cannot select once the file is uploaded to zotero, but I can select all of the text in that file in Adobe Reader.

Send the file to support@zotero.org with a link to this thread.

"Add extracted text"

Not sure what you're referring to here.

Zotero's annotation extraction should be equal to or better than ZotFile's was. If you think you're getting something worse, email us an example PDF.

martynas_b · August 8, 2022

@clousley We'll fix the issue of not appearing highlights. Although the text extraction issue is complicated and depends on scan quality.

Even though the scan quality is really poor, I can confirm that the text layer of this PDF is more usable in Acrobat Reader. We'll try to improve that in future releases.

BrilliantStarfish · September 26, 2022

Encountering this same issue. I have a PDF of a book where the OCR text is selectable using my standard PDF viewer, but within Zotero it's not. I'm guessing that I'll have to figure out how to remove the OCR and add my own, but it's odd that other programs can read it, and that Zotero can search it fine, but that selecting it is not working.

dstillman · September 26, 2022

@BrilliantStarfish — same as above:

If the same PDF is behaving worse in Zotero than in another PDF reader, send it to support@zotero.org with a link to this thread and we can take a look.

lorenzo_avellino · May 8, 2023

Hello I'v got the same issue here. I sent you the file too. Would be really nice if you can fix it, thank you for your work

martynas_b · May 10, 2023

@lorenzo_avellino It seems the PDF file isn't OCRed at all. You need to process the file with OCRing software.