PDF text recognised in Preview but not Zotero

I've scanned some documents onto my computer using DeskScan. The saved pdf files have text that is recognisable and I can highlight in Preview on my Mac. However, the file does not seem to have text recognised in Zotero either on Mac or iOS. Any idea what the issue might be?
  • If you email the PDF to support@zotero.org with a link to this thread, we can take a look.
  • Thanks. Have done so.
  • We didn't receive anything from you. If the file is large, you should upload it somewhere and send us a link.
  • Ok just done via WeTransfer.
  • Any luck with finding the issue?
  • @michael.amherst Zotero PDF reader currently doesn't support highlighting rotated text.
  • I don't believe it is rotated. It is just as it was scanned in.
  • The problem is that the text is not aligned to horizontal or vertical axis.
  • So is this likely to be a problem with any scanned documents?
  • OCRing software normally can fix the rotation, but in your case I guess it's complicated because two physical pages are in a single image.
  • Thats what I don't understand - it has already been through OCR. Is this issue with text unaligned to horizontal or vertical axis going to be changed in a future update? I store all my pdfs now on Zotero servers so I can access them all on the app. But if not, then I'll have to revert to cloud stroage and using Adobe or similar as a pdf reader.
  • The scan quality of this OCR-ed PDF is very poor. Other PDF viewers may let you highlight the text, but the result will not be accurate either. Zotero PDF reader does more than other PDF readers. I.e. it extracts annotation text. Text lines that are not aligned with the coordinate system axis are problematic and incompatible with the current Zotero annotation architecture. There might be some work-arounds to make these PDF files more usable, but we are not sure when or if we will implement them. These kinds of PDF files are usually a very small fraction of all PDFs.

    We suggest using the “Area” annotation tool to draw a rectangle around the desired lines.
  • With all due respect to @martynas_b, Zotero struggles more than other readers with vertical OCRed text, even when it is perfectly aligned on the vertical axis.

    For example, a Japanese file OCRed in Acrobat is often highlightable in readers other than Zotero.

    OCRed first line from https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:
    Acrobat: 愛知用水事業は農業をふくむ総合開発として最大のもの
    Skim: 愛知用水事業は農業をふくむ総合開発として最大のも
    Preview: 愛知用水事業は農業をふくむ総合開発として最大のもの
    Zotero: 日以上に達し、気候上からは多毛作地帯に属する。また

    Zotero is grabbing text from a completely different part of the page, and doesn't highlight the text to show what's being selected.

    This has always been the case with the v. 6+ PDF viewer, and is consistent behavior across files.

    This is a major issue for those of us who work with vertical text, of course, and it would be wonderful if a fix could be found.
  • @nathan.hopson: Can you send the OCRed version of that file to support@zotero.org with a link to this thread?
  • @nathan.hopson We should be able to improve that. Thanks for the PDF file.
  • Thanks as always to the Zotero team for being so responsive to user needs and feedback!
  • Follow up:

    I created a short screen recording (under 20sec) showing a typical example of how, in many vertical-text documents:
    1. Highlights don't show for both text selection and annotation
    2. Selected and annotated text don't match

    https://youtu.be/OuY46DgKqU4

    As always, highlighting this file works fine in standalone PDF readers such as Acrobat, Skim, Preview, etc.

    Here is what you see:
    * 00:00-00:10 I select and highlight the text
    * 00:11-00:17 I click on the annotation in the sidebar, illustrating two things: 1) the highlight is still invisible, and 2) the annotation does not match the selected text
  • @nathan.hopson How common are documents that are mixing horizontal and vertical text in a single page?
  • @martynas_b: That is quite uncommon in Japanese academic documents. I believe that's true for all the variants of Chinese and Korean as well, but I'm not expert enough to comment on that with confidence.

    Don't know if it's helpful, but in Japanese, articles tend to be horizontal, while books are more often vertical.
  • Follow up on this:

    Zotero OCR with Tesseract is also a total disappointment -- unless I'm missing something vital?

    The documentation claims to have support for Japanese, both horizontal and vertical.
    I set the language to "jpn" as directed in the docs.

    The results are suboptimal:
    https://youtu.be/tAit9eN67-A

    In other words, the OCRed PDF is 100% useless. Highlighting selects different text, and even that is incorrect.

    What am I doing wrong? Or is it just that the tool is really this bad?

    A note of hope to end on: Japan's national library (NDL) has recently made their Japanese OCR tools open source (h/t Lani Alden at Berkeley)
    https://github.com/ndl-lab

    I will test these tools in the future, but the setup looks too involved for right now.
  • Thank you to @nathan.hopson for documenting this problem. I also use Zotero to process Japanese-language texts and am not really able to use the native PDF viewer for highlights and annotations.

    @martynas_b, has there been any progress on this issue since June? Thank you for any help you can provide here - it would be so wonderful to have this fixed.
  • @andrewkahn Could you also send an example PDF file to support@zotero.org with a link to this thread?
  • @martynas_b, thank you very much for your response. I have just sent an example of a Japanese-language PDF with OCR'ed text. The problems that @nathan.hopson described above are universally true for all Japanese-language PDFs. Please let me know if you need any other information from me to work on these issues.
  • Update:
    I have upgraded to the latest Zotero 7 beta (7.0.0-beta.85+c0c00a00e).
    The experience is improved, but problems remain.

    I tested the same file as with Z6:
    https://www.maff.go.jp/primaff/kanko/nosoken/attach/pdf/195904_nsk13_2_09.pdf:

    What's fixed:
    * The correct text is selected by Zotero.

    What's not:
    * The selected text is not shown as selected or highlighted.

    Once a highlight color is selected, a thin vertical line (not a highlight) *does* appear, and the text shows up in the lefthand annotation pane. It can be exported as an annotation.

    However, because the selected text is not highlighted, it remains difficult and inconvenient to select the correct text.

    Note:

    I also tested several other PDFs with vertical text. The result was the same.

    In contrast, horizontal text presents no problem. In other words, it looks like it's the verticality itself that is giving the Zotero PDF reader headaches.
  • @nathan.hopson The file you provided does not appear to have any text layer, making the text unselectable in any PDF viewer I tested.
  • @martynas_b, thanks for the quick response as always!

    Yes, that's because the OP was about (scanned and) OCRed documents.

    In this case, I downloaded and OCRed the document in question, so it's not technically "scanned." If that's a problem, I can provide data from a scanned document instead (though I can assure you from many years of experience that the results are the same).
Sign In or Register to comment.