[Z7 Beta] Remove ¬ in annotations

samvimes · April 19, 2024

Hi all,
some OCRed texts use the ¬-symbol to delineate line breaks. So a text looking like this:

This is sub-
ordination

becomes "This is sub¬ ordination" in a Zotero annotation or note. It would be great if Zotero could automatically fix this by removing the ¬ and the following space, so that the annotation text would read "This is subordination".

martynas_b · April 19, 2024

Could you send an example PDF file to support@zotero.org with a link to this thread?

samvimes · April 19, 2024

Thank you for the quick reply, I just emailed you a file.

martynas_b · April 22, 2024

The problem is due to poor OCR. It would be better to re-OCR the file.

samvimes · April 22, 2024

I don't think it is poor OCR in this case but (maybe a Germany-specific?) guideline for OCRed texts (see, for example: https://www.deutschestextarchiv.de/doku/basisformat/trSilbentrennung.html).

This means that for example all scanned PDFs on pedocs.de, a central database for educational scientific literature, contain the special character. Re-OCRing them all would be a bit of a hassle.

DWL-SDCA · April 22, 2024

Not seeing the file I can't comment on its OCR quality. But this is a larger issue.

Many times a database, say, PubMed, or the metadata from a publisher's website will not contain an abstract but there is an article summary on the publisher's webpage. I copy this text (rendered as html or from a pdf) and paste it into my text editor.

I have my text editor set to show selected non-printing characters such as the ¬ to indicate a line break. Sometimes, although there isn't the ¬ character on the screen image of the PDF, when the text is pasted into my text editor there are two ¬ characters (one printing and one non-printing) at the end of each line. There aren't double-spaced lines on the PDF document.

If I copy the PDF text from the screen (of the publisher's site) and paste it directly into Zotero there are no ¬ characters but there are unwanted line breaks where the text breaks on the PDF.

Over the years, I 'trained' my text editor to handle removing line breaks and the occasional extraneous printing ¬ characters. I've mostly 'taught' it to handle hyphenated words across line breaks. (But I've not been able to automate wanted hyphens like in the next sentence when the phrase crosses a line.) It is labor-intensive and isn't perfect but one-by-one / document-by-document I can get a suitable abstract in Zotero. If Zotero or a plug-in could (mostly) automate what I have been doing by hand it would be great. I long accepted that I need to copy to a text-editor and do some manipulation if I want a pretty abstract. If regex masters within the Zotero community can help, I would be very happy. I can handle the regular expressions fundamentals but if-then with any accuracy is bewildering.

samvimes · April 23, 2024

I've done a bit more research and found that Abbyy Finereader automatically and deliberately inserts the ¬ as an optional hyphen (see https://pdf.abbyy.com/media/1676/users_guide.pdf, p. 317). This, for me, would be another argument for automatically removing the character in annotations, just as happens with some end-of-line hyphens.

martynas_b · April 23, 2024

All right, we will consider adding a special case to handle the '¬' character.

samvimes · April 24, 2024

Great, thank you!