Extracting annotations from PDF breaks Korean characters
Problem statement
- When I added PDF Korean patent document from Google Patents, and opened it with built-in reader, texts are not properly extracted. It shows some character change (seems not related to encoding, though).
[right]Copied from External PDF reader(SumatraPDF): 본 고안의 관절 도어스토퍼는 원하는 각도로 개방된 도어를 열린 상태대로 걸림 고정되게 하는 것으로
[wrong]Extracted annotations and text copied from built-in pdf:
본 고안의 관윈 도어스토퍼는 원하는 각도띜 개묩댜 도어를 열린 상태대띜 걸림 고윕댘게 하는 것으띜
절:윈
로:띜
정:윕
되:댘
I am struggling to solve this, but I have no idea where to start looking from.
Any helps would be greatly appreciated. Thank you.
- When I added PDF Korean patent document from Google Patents, and opened it with built-in reader, texts are not properly extracted. It shows some character change (seems not related to encoding, though).
[right]Copied from External PDF reader(SumatraPDF): 본 고안의 관절 도어스토퍼는 원하는 각도로 개방된 도어를 열린 상태대로 걸림 고정되게 하는 것으로
[wrong]Extracted annotations and text copied from built-in pdf:
본 고안의 관윈 도어스토퍼는 원하는 각도띜 개묩댜 도어를 열린 상태대띜 걸림 고윕댘게 하는 것으띜
절:윈
로:띜
정:윕
되:댘
I am struggling to solve this, but I have no idea where to start looking from.
Any helps would be greatly appreciated. Thank you.
The PDF has "Identity-H" encoding on embedded fonts, so I tried copying identity-H cMap into poppler-data folder but it didn't work.
Same problem on VS Code PDF viewer extension(https://marketplace.visualstudio.com/items?itemName=tomoki1207.pdf).
I "guess" the problem occurs when using PDF.js module(90% sure...).