problem with accented characters

After using highlighting on a pdf file im portuguese, I have realized that it was not importing it correctly to annotations. it's not a problem in general as with some pdf I don't have any problems but I wonder if there's a way to fix it.

ex: arbitr ́ario (arbitrário) , n ̃ao (não)
  • edited August 27, 2024
    I am having the same issue for some time, copying and pasting some extracts to another system and getting concentrac~ao or vegetec~ao, it is really strange because when we open the same file in an external PDF reader it copies and paste the text without any issue.

    In Word/Writer it appears to be pasted correctly, but the grammar checker always finds there is an error.

    I tried to find some pdf reader configuration on how it deals with characters, like the Charset used, but I was unable to find a configuration to change, Advanced Preferences has intl.fallbackCharsetList.ISO-8859-1 = windows-1252, but windows-1252 is a Portuguese compatible charset, so I didn't change anything.
  • edited August 27, 2024
    Here is how the text behaves in word, in the other system where I generally paste, a library system software and originally on Zotero.
    https://s3.amazonaws.com/zotero.org/images/forums/u4229194/9use0yp4hj2pwprj61ca.png
  • Is there a way (maybe a plugin) to override text encoding when copying from a PDF on Zotero? Perhaps it could help in this situation. I tested the same pdf in Abbyy Fine Reader, Adobe, Edge, Mendeley and in any other software it works fine.

    It seems a minor problem, but we use the tilde a lot in Portuguese.
  • edited September 27, 2024
    At least concerning the Aleph input, this should come out fine in a browser if using a modern font. There, IMO it's Aleph's fault.

    The background here is most likely that you can encode accented characters either in a composed form (base char + accent char in one code point) or decomposed, that is separate base char in a code point and composing accent char in a code point, in this order. Depending on the system and software you use, it's often normalised to on or the other. Many modern fonts come with the informations built in how to compose the accented form from the composing components (i.e. where does the accent have to go above or below the base glyph). Aleph, as old as it is, can't handle that correctly so you see the underlined tilde which shows you that it is a composing tilde which is differently encoded than a non composing tilde (~). When you publish this record to Primo or another Opac you should see it correctly composed because the browser knows how to handle it.

    EDIT: I think, I remember darkly that Aleph in fact can handle it and only the font doesn't provide the a+tilde composing. IIRC setting Aleph to use a different font can fix it and I did that once. But we don't have Aleph any more so I can't tell exactly what to do.
  • Thank you for the explanation.

    I've found a workaround: I highlight the needed text and create a note. By copying and pasting from the annotations, the characters display correctly in Aleph.

    Do you know why Word also recognizes the characters as incorrect? I'm using the Office 365, a newer or at least more updated software.

    I usually bring citations from annotations, which resolves the issue in Word, but not everyone does this, especially since annotations often disrupt the custom styles used in the document.
  • Yes, applications that expect a certain format often do normalize to that on import, save or even export.

    It seems that the Word spell checker doesn't handle decomposed forms in any language. Just tried it with concentração, as well as with English fiancée and German möglich.
Sign In or Register to comment.