PDF Reader Reports "No Extracted Text" on PDF

poettli · August 25, 2023

Working on this pdf, Zotero PDF reader is not able to extract the text from highlighted area. This is about Zotero 7 beta, but I suspect it is the same for Zotero 6.

Having a closer look at the pdf, it appears the text is not correctly encoded or decoded, as less on the PDF file shows:

^A
^Z^H^R^Y^H^X^P^T^S^A^T ^A^O^Z^\^A^V^H^P^S^R^R^A^V^P^W^Q^A^A
^T ^A^X^\^U^O^T^T^S^A^O^H^N^P ^P^W^A^B^F^D^E^G^C^A^A

While pdf2txtshows:

(cid:2)

(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:4)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:9)(cid:12)(cid:11)(cid:13)(cid:2)(cid:4)(cid:3)(cid:14)(cid:11)(cid:15)(cid:4)(cid:8)(cid:10)(cid:12)(cid:4)(cid:5)(cid:5)(cid:11)(cid:15)(cid:8)(cid:16)(cid:17)(cid:11)(cid:11)

Those are character identifiers and are mapped to postscript names. I suspect it might be a Japanese cimap. I thus wonder if there is something to do at the Zotero PDF reader level or if it is beyond the abilities of the reader.

martynas_b · August 25, 2023

The PDF file has its text obscured on purpose. Preview.app asks for the owner password when trying to copy text, while other viewers just copy garbled text.

poettli · August 25, 2023

Yes I know, but it works with other PDFs of the same issue. I should have mentioned it in my first post. For example, this article. preview.app also ask for a password to copy text, but the Zotero PDF reader can bypass it. Or maybe there is a flaw in the reader, not following the restrictions of the PDF itself.

poettli · August 26, 2023

Beyond this problem, could there be legal issue for the pdf.js for bypassing the pdf restrictions and allowing to copy text, while other readers will comply and ask for a password?

poettli · October 11, 2023

Any thoughts on this matter?

martynas_b · October 12, 2023

It's likely that this feature hasn't been implemented in pdf.js because it's a rare use case. But I'll try to report them the issue.

martynas_b · October 12, 2023

Well, but it only makes sense to do so if you know the master password of the PDF file.