PDF text recognised in Preview but not Zotero
I've scanned some documents onto my computer using DeskScan. The saved pdf files have text that is recognisable and I can highlight in Preview on my Mac. However, the file does not seem to have text recognised in Zotero either on Mac or iOS. Any idea what the issue might be?
We suggest using the “Area” annotation tool to draw a rectangle around the desired lines.
For example, a Japanese file OCRed in Acrobat is often highlightable in readers other than Zotero.
OCRed first line from https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:
Acrobat: 愛知用水事業は農業をふくむ総合開発として最大のもの
Skim: 愛知用水事業は農業をふくむ総合開発として最大のも
Preview: 愛知用水事業は農業をふくむ総合開発として最大のもの
Zotero: 日以上に達し、気候上からは多毛作地帯に属する。また
Zotero is grabbing text from a completely different part of the page, and doesn't highlight the text to show what's being selected.
This has always been the case with the v. 6+ PDF viewer, and is consistent behavior across files.
This is a major issue for those of us who work with vertical text, of course, and it would be wonderful if a fix could be found.
I created a short screen recording (under 20sec) showing a typical example of how, in many vertical-text documents:
1. Highlights don't show for both text selection and annotation
2. Selected and annotated text don't match
https://youtu.be/OuY46DgKqU4
As always, highlighting this file works fine in standalone PDF readers such as Acrobat, Skim, Preview, etc.
Here is what you see:
* 00:00-00:10 I select and highlight the text
* 00:11-00:17 I click on the annotation in the sidebar, illustrating two things: 1) the highlight is still invisible, and 2) the annotation does not match the selected text
Don't know if it's helpful, but in Japanese, articles tend to be horizontal, while books are more often vertical.
Zotero OCR with Tesseract is also a total disappointment -- unless I'm missing something vital?
The documentation claims to have support for Japanese, both horizontal and vertical.
I set the language to "jpn" as directed in the docs.
The results are suboptimal:
https://youtu.be/tAit9eN67-A
In other words, the OCRed PDF is 100% useless. Highlighting selects different text, and even that is incorrect.
What am I doing wrong? Or is it just that the tool is really this bad?
A note of hope to end on: Japan's national library (NDL) has recently made their Japanese OCR tools open source (h/t Lani Alden at Berkeley)
https://github.com/ndl-lab
I will test these tools in the future, but the setup looks too involved for right now.
@martynas_b, has there been any progress on this issue since June? Thank you for any help you can provide here - it would be so wonderful to have this fixed.
I have upgraded to the latest Zotero 7 beta (7.0.0-beta.85+c0c00a00e).
The experience is improved, but problems remain.
I tested the same file as with Z6:
https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:
What's fixed:
* The correct text is selected by Zotero.
What's not:
* The selected text is not shown as selected or highlighted.
Once a highlight color is selected, a thin vertical line (not a highlight) *does* appear, and the text shows up in the lefthand annotation pane. It can be exported as an annotation.
However, because the selected text is not highlighted, it remains difficult and inconvenient to select the correct text.
Note:
I also tested several other PDFs with vertical text. The result was the same.
In contrast, horizontal text presents no problem. In other words, it looks like it's the verticality itself that is giving the Zotero PDF reader headaches.
Yes, that's because the OP was about (scanned and) OCRed documents.
In this case, I downloaded and OCRed the document in question, so it's not technically "scanned." If that's a problem, I can provide data from a scanned document instead (though I can assure you from many years of experience that the results are the same).