Pdf ocr searchable but non selectionable text
I'm struggling for important an important an huge amount of scanned pdf. I have linux.
- I noticed some ocr tools (ocrmypdf -- force-ocr) are better than the zotero one. Is that normal? No solution to that?
- I finally managed to have a good ocrised pdf in zotero. But there is a problem. (only) Some paragraphs are searchable (if i search for a word within the pdf, the word appears highlighted in yellow in the paragraph), but i cannot select the word (highlighted in blue) or even the entire paragraph. Do you have any solution for that?
Thks
- I noticed some ocr tools (ocrmypdf -- force-ocr) are better than the zotero one. Is that normal? No solution to that?
- I finally managed to have a good ocrised pdf in zotero. But there is a problem. (only) Some paragraphs are searchable (if i search for a word within the pdf, the word appears highlighted in yellow in the paragraph), but i cannot select the word (highlighted in blue) or even the entire paragraph. Do you have any solution for that?
Thks
2. Have you tried in a different PDF reader? I'd guess that the text embedding isn't quite right for the PDF -- in that case, there's not much Zotero can do.
And no: both use tesseract, but results are not equal. It is better with ocrmypdf --force-ocr command
We'll take a look at the file you provided.
Well to make myself clearer: forget about the ocr plugin. My real problem is somewhere else: how to make a pdf readable/searchable/selectionable/highlightable exactly as it is in external pdf viewers? It is not just one file, it is like that for hundreds of them. Systematic when i import in zotero a pdf ocrized. Even: when i make ocr in zotero, the extracted note contains everything inside, but when i open the ocrized version, i can select not even half of the paragraphs.
The only rational solution is to re-OCR those PDFs.
Just making a screenshot of the PDF, that was sent as an example, and using the native macOS image OCRing, gives me a good quality text. So this is just a matter of OCRing software.
I would even say that all old OCRed papers should be re-OCRed before adding to Zotero.
I can handle having "a good quality text" (outside zotero). The problem is that when i open the doc WITHIN zotero, the text layer is only partially selectionable. However, the layer IS here, since i can search for words within the doc, and since when i re-ocrize it with the zotero plugin, the note it creates is of very good quality. But the problem remains: the pdf so created in zotero is perfect outside zotero, but wrong within it.
Then, it is a zotero problem (i'd even say a problem with the internal pdf reader). This is why i'm asking the question here.
I probably know why this specific case performs better in some other PDF viewers — they are regrouping characters into words, lines and paragraphs and re-generating a new text flow. But that only helps for a fraction of OCRed PDFs and this is OCRing software job to do so.
https://support.apple.com/guide/preview/interact-with-text-in-a-photo-prvw625a5b2c/mac
(This was just an example for extracting text — not something that you use to actually create text layers in PDFs.)
Zotero's annotation extraction should be equal to or better than ZotFile's was. If you think you're getting something worse, email us an example PDF.
Even though the scan quality is really poor, I can confirm that the text layer of this PDF is more usable in Acrobat Reader. We'll try to improve that in future releases.