PDF text recognised in Preview but not Zotero

michael.amherst · February 27, 2023

I've scanned some documents onto my computer using DeskScan. The saved pdf files have text that is recognisable and I can highlight in Preview on my Mac. However, the file does not seem to have text recognised in Zotero either on Mac or iOS. Any idea what the issue might be?

dstillman · February 27, 2023

If you email the PDF to support@zotero.org with a link to this thread, we can take a look.

michael.amherst · February 27, 2023

Thanks. Have done so.

dstillman · February 27, 2023

We didn't receive anything from you. If the file is large, you should upload it somewhere and send us a link.

michael.amherst · February 27, 2023

Ok just done via WeTransfer.

michael.amherst · March 1, 2023

Any luck with finding the issue?

martynas_b · March 1, 2023

@michael.amherst Zotero PDF reader currently doesn't support highlighting rotated text.

michael.amherst · March 1, 2023

I don't believe it is rotated. It is just as it was scanned in.

martynas_b · March 1, 2023

The problem is that the text is not aligned to horizontal or vertical axis.

michael.amherst · March 1, 2023

So is this likely to be a problem with any scanned documents?

martynas_b · March 1, 2023

OCRing software normally can fix the rotation, but in your case I guess it's complicated because two physical pages are in a single image.

michael.amherst · March 1, 2023

Thats what I don't understand - it has already been through OCR. Is this issue with text unaligned to horizontal or vertical axis going to be changed in a future update? I store all my pdfs now on Zotero servers so I can access them all on the app. But if not, then I'll have to revert to cloud stroage and using Adobe or similar as a pdf reader.

martynas_b · March 2, 2023

The scan quality of this OCR-ed PDF is very poor. Other PDF viewers may let you highlight the text, but the result will not be accurate either. Zotero PDF reader does more than other PDF readers. I.e. it extracts annotation text. Text lines that are not aligned with the coordinate system axis are problematic and incompatible with the current Zotero annotation architecture. There might be some work-arounds to make these PDF files more usable, but we are not sure when or if we will implement them. These kinds of PDF files are usually a very small fraction of all PDFs.

We suggest using the “Area” annotation tool to draw a rectangle around the desired lines.

nathan.hopson · March 15, 2023

With all due respect to @martynas_b, Zotero struggles more than other readers with vertical OCRed text, even when it is perfectly aligned on the vertical axis.

For example, a Japanese file OCRed in Acrobat is often highlightable in readers other than Zotero.

OCRed first line from https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:
Acrobat: 愛知用水事業は農業をふくむ総合開発として最大のもの
Skim: 愛知用水事業は農業をふくむ総合開発として最大のも
Preview: 愛知用水事業は農業をふくむ総合開発として最大のもの
Zotero: 日以上に達し、気候上からは多毛作地帯に属する。また

Zotero is grabbing text from a completely different part of the page, and doesn't highlight the text to show what's being selected.

This has always been the case with the v. 6+ PDF viewer, and is consistent behavior across files.

This is a major issue for those of us who work with vertical text, of course, and it would be wonderful if a fix could be found.

dstillman · March 15, 2023

@nathan.hopson: Can you send the OCRed version of that file to support@zotero.org with a link to this thread?

nathan.hopson · March 15, 2023

Sent

martynas_b · March 16, 2023

@nathan.hopson We should be able to improve that. Thanks for the PDF file.

nathan.hopson · March 16, 2023

Thanks as always to the Zotero team for being so responsive to user needs and feedback!

nathan.hopson · March 22, 2023

Follow up:

I created a short screen recording (under 20sec) showing a typical example of how, in many vertical-text documents:
1. Highlights don't show for both text selection and annotation
2. Selected and annotated text don't match

https://youtu.be/OuY46DgKqU4

As always, highlighting this file works fine in standalone PDF readers such as Acrobat, Skim, Preview, etc.

Here is what you see:
* 00:00-00:10 I select and highlight the text
* 00:11-00:17 I click on the annotation in the sidebar, illustrating two things: 1) the highlight is still invisible, and 2) the annotation does not match the selected text

martynas_b · March 23, 2023

@nathan.hopson How common are documents that are mixing horizontal and vertical text in a single page?

nathan.hopson · March 23, 2023

@martynas_b: That is quite uncommon in Japanese academic documents. I believe that's true for all the variants of Chinese and Korean as well, but I'm not expert enough to comment on that with confidence.

Don't know if it's helpful, but in Japanese, articles tend to be horizontal, while books are more often vertical.

nathan.hopson · June 1, 2023

Follow up on this:

Zotero OCR with Tesseract is also a total disappointment -- unless I'm missing something vital?

The documentation claims to have support for Japanese, both horizontal and vertical.
I set the language to "jpn" as directed in the docs.

The results are suboptimal:
https://youtu.be/tAit9eN67-A

In other words, the OCRed PDF is 100% useless. Highlighting selects different text, and even that is incorrect.

What am I doing wrong? Or is it just that the tool is really this bad?

A note of hope to end on: Japan's national library (NDL) has recently made their Japanese OCR tools open source (h/t Lani Alden at Berkeley)
https://github.com/ndl-lab

I will test these tools in the future, but the setup looks too involved for right now.

andrewkahn · January 22, 2024

Thank you to @nathan.hopson for documenting this problem. I also use Zotero to process Japanese-language texts and am not really able to use the native PDF viewer for highlights and annotations.

@martynas_b, has there been any progress on this issue since June? Thank you for any help you can provide here - it would be so wonderful to have this fixed.

martynas_b · January 22, 2024

@andrewkahn Could you also send an example PDF file to support@zotero.org with a link to this thread?

andrewkahn · January 23, 2024

@martynas_b, thank you very much for your response. I have just sent an example of a Japanese-language PDF with OCR'ed text. The problems that @nathan.hopson described above are universally true for all Japanese-language PDFs. Please let me know if you need any other information from me to work on these issues.

nathan.hopson · June 16, 2024

Update:
I have upgraded to the latest Zotero 7 beta (7.0.0-beta.85+c0c00a00e).
The experience is improved, but problems remain.

I tested the same file as with Z6:
https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:

What's fixed:
* The correct text is selected by Zotero.

What's not:
* The selected text is not shown as selected or highlighted.

Once a highlight color is selected, a thin vertical line (not a highlight) *does* appear, and the text shows up in the lefthand annotation pane. It can be exported as an annotation.

However, because the selected text is not highlighted, it remains difficult and inconvenient to select the correct text.

Note:

I also tested several other PDFs with vertical text. The result was the same.

In contrast, horizontal text presents no problem. In other words, it looks like it's the verticality itself that is giving the Zotero PDF reader headaches.

martynas_b · June 17, 2024

@nathan.hopson The file you provided does not appear to have any text layer, making the text unselectable in any PDF viewer I tested.

nathan.hopson · June 17, 2024

@martynas_b, thanks for the quick response as always!

Yes, that's because the OP was about (scanned and) OCRed documents.

In this case, I downloaded and OCRed the document in question, so it's not technically "scanned." If that's a problem, I can provide data from a scanned document instead (though I can assure you from many years of experience that the results are the same).