Annotation PDFs which are not OCR

Therese-Moeller · April 28, 2022

In zotero, highlighting text and adding notes works well when you open up publishers PDFs. If you open a PDF, which does not have “optical character recognition”, you can still highlight the text, but in the note, which zotero is making, a combination of letters and numbers is popping up, which creates “noise”. You can delete this information in “Edit highlighted text”, but it is a tiresome work, if you highlight a lot. Would it be possible, that you could develop a solution, which made it possible to switch off the automatic note function, when it comes to PDFs, which are not OCR? Thank you for all your work.

dstillman · April 28, 2022

If you can select anything it means there's a text layer — otherwise it wouldn't let you select.

Can you provide a link to a PDF where you're seeing this, or email it to support@zotero.org with a link to this thread? (We'll respond here.)

burgarth · April 28, 2022

Some PDFs have a broken text layer: it only indicates the position of the words, but the encoding is gibberish or not meant to be real (perhaps for copyright reasons)? In such a case, the best option in my opinion is just to re-OCR the whole file yourself.

Therese-Moeller · April 29, 2022

Thank you Burgarth for describing the problem. At the university library, we would like to teach the students to compile their curriculum literature in zotero and in this way build digital compendiums. When the curriculum literature is not available online, the teachers scan / build the pdfs themselves, typically chapters from books. They don’t have the option to make OCR pdfs. Highlighting text in these pdfs works well in zotero, but it would be a more pleasant experience without the symbols, which pops up in the margin as notes. To show what is happening, I am going to send a JPG to support@zotero.org in a second.

dstillman · April 29, 2022

We'd need an actual PDF that's affected, not a JPEG.

Therese-Moeller · April 29, 2022

I have just sent you the PDF.

dstillman · April 29, 2022

OK, yeah, that just has a text layer of placeholder symbols. You can just open it in any other PDF reader (right-click on item → Show File) and you'll see the same thing.

I don't know if that's a bug or intentional, but you should report it to them. There's no reason for a PDF to have a text layer if it's not going to have text. There are plenty of scanned PDFs that are just images.

I don't think it's really the job of Zotero or any other PDF reader to try to detect this.

dstillman · April 29, 2022

Well, I guess the idea is that this allows the PDF to be highlighted, without the text being extracted (e.g., for copyright reasons, as @burgarth suggests).

We could look into detecting this and just creating an annotation without highlight text, but I'm not sure there's any way to predict the different kinds of gibberish different publishers might put in. Maybe if it's all just a repeated character plus spaces, as may be the case here…

Therese-Moeller · April 29, 2022

It is not a publishers pdf. It is a pdf made by using a ordinary copy machine. Using the pdf reader Adobe, you don't get the gibberish in the margin. Here you can add your own notes. My wish is, that the zotero reader works in the same way.

martynas_b · April 29, 2022

Adobe Acrobat Reader doesn't do anything better. The text is still gibberish if trying to copy it. The only difference might be that it doesn't extract text at all.

Therese-Moeller · April 29, 2022

I am not trying to copy it, but in Adobe you can highlight the text though the pdf is not OCR, and use the note function without any problems. From the perspective of the student it works fine.

dbrear · April 29, 2022

Hi — there's a free online PDF OCR site here (up to 100MB) — https://www.ocr2edit.com/create-searchable-pdf