End-of-line hyphens in OCRed pdfs

Hi all,

I've found that Zotero's handling of end-of-line hyphens is somewhat unsatisfactory. When copying or annotating text from OCRed pdfs (especially files that have not been created with up-to-date OCR software, I feel) that contains end-of-line hyphens, Zotero will delete the line breaks but also replace the hyphens by a space, thus splitting up words. The hyphens seem to be part of the OCR text layer, as they can be copied in other PDF applications (see image: https://imgur.com/a/PlMFOWK).

I think it would make sense, to automatically "re-assemble" hyphenated words. There might be a few false positives like "self-awareness", but I assume it would still be less manual work to fix them than it is currently to go through the unnecessary spaces.
Sign In or Register to comment.