Extracting annotations from PDF breaks Korean characters

oco9oco · April 13, 2022

Problem statement
- When I added PDF Korean patent document from Google Patents, and opened it with built-in reader, texts are not properly extracted. It shows some character change (seems not related to encoding, though).

[right]Copied from External PDF reader(SumatraPDF): 본 고안의 관절 도어스토퍼는 원하는 각도로 개방된 도어를 열린 상태대로 걸림 고정되게 하는 것으로

[wrong]Extracted annotations and text copied from built-in pdf:
본 고안의 관윈 도어스토퍼는 원하는 각도띜 개묩댜 도어를 열린 상태대띜 걸림 고윕댘게 하는 것으띜

절:윈
로:띜
정:윕
되:댘

I am struggling to solve this, but I have no idea where to start looking from.
Any helps would be greatly appreciated. Thank you.

martynas_b · April 14, 2022

We need at least one PDF page to diagnose the problem. Could you send it to support@zotero.org with a link to this thread?

oco9oco · April 20, 2022

Sent you the PDF file.

The PDF has "Identity-H" encoding on embedded fonts, so I tried copying identity-H cMap into poppler-data folder but it didn't work.

Same problem on VS Code PDF viewer extension(https://marketplace.visualstudio.com/items?itemName=tomoki1207.pdf).
I "guess" the problem occurs when using PDF.js module(90% sure...).

martynas_b · April 20, 2022

This seems fixed in newer PDF.js versions. We'll upgrade to a newer version later this year.