PDF Reader Reports "No Extracted Text" on PDF

edited August 25, 2023
Working on this pdf, Zotero PDF reader is not able to extract the text from highlighted area. This is about Zotero 7 beta, but I suspect it is the same for Zotero 6.

Having a closer look at the pdf, it appears the text is not correctly encoded or decoded, as less on the PDF file shows:
^A
^Z^H^R^Y^H^X^P^T^S^A^T ^A^O^Z^\^A^V^H^P^S^R^R^A^V^P^W^Q^A^A
^T ^A^X^\^U^O^T^T^S^A^O^H^N^P ^P^W^A^B^F^D^E^G^C^A^A
While pdf2txtshows:
(cid:2)

(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:4)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:9)(cid:12)(cid:11)(cid:13)(cid:2)(cid:4)(cid:3)(cid:14)(cid:11)(cid:15)(cid:4)(cid:8)(cid:10)(cid:12)(cid:4)(cid:5)(cid:5)(cid:11)(cid:15)(cid:8)(cid:16)(cid:17)(cid:11)(cid:11)
Those are character identifiers and are mapped to postscript names. I suspect it might be a Japanese cimap. I thus wonder if there is something to do at the Zotero PDF reader level or if it is beyond the abilities of the reader.
  • The PDF file has its text obscured on purpose. Preview.app asks for the owner password when trying to copy text, while other viewers just copy garbled text.
  • edited August 25, 2023
    Yes I know, but it works with other PDFs of the same issue. I should have mentioned it in my first post. For example, this article. preview.app also ask for a password to copy text, but the Zotero PDF reader can bypass it. Or maybe there is a flaw in the reader, not following the restrictions of the PDF itself.
  • edited September 6, 2023
    Beyond this problem, could there be legal issue for the pdf.js for bypassing the pdf restrictions and allowing to copy text, while other readers will comply and ask for a password?
  • Any thoughts on this matter?
  • It's likely that this feature hasn't been implemented in pdf.js because it's a rare use case. But I'll try to report them the issue.
  • Well, but it only makes sense to do so if you know the master password of the PDF file.
Sign In or Register to comment.