RTF scan straightens apostrophes

marnenlk · March 19, 2013

Hi all!

I'm using Zotero Standalone 3.0.14.1 on Mac OS X 10.8.3. I just tried to use the RTF Scan feature. While it mostly worked as expected, I noticed one problem: properly curled apostrophes in the input RTF file were replaced with straight apostrophes in the output file. This only affected apostrophes (’ U+2019); double quotes (“ U+201C and ”U+201D) were unaffected.

In other words:
John’s wife said “Hello” (input)
became
John's wife said “Hello” (output)

I'm inclined to treat this as a bug, and a reasonably severe one at that. Or is there a Zotero setting that I should change?

adamsmith · March 19, 2013

I can't replicate that. Which word processor are you using to create the RTF? Could you upload a minimal sample RTF file somewhere?

Generally speaking RTF-scan isn't currently treated as much of a priority by anyone - not least because there are a whole number of issues with it that really require a solution at a more fundamental level. If you're serious about writing with non-Word/LO/LaTeX/Pandoc software and Zotero, you may want to consider the ODF scan solution discussed by fbennett and paultroops here: http://forums.zotero.org/discussion/18064/2/please-add-better-integration-with-scrivener/

marnenlk · March 19, 2013

You wrote:

> I can't replicate that.

Meaning that apostrophes don't get straightened when you try? If so, could I see your before and after RTF files?

> Which word processor are you using to create the RTF?

Jer's Novel Writer, but it shouldn't matter: I have confirmed that the RTF is correct until Zotero processes it.

> Could you upload a minimal sample RTF file somewhere?

Sure. See before and after examples at https://docs.google.com/folder/d/0B-9S1Y1t4sLoazdjRGFSSFFxSkk/edit?usp=sharing . In this case, I created the RTF files with TextEdit, just to confirm that it was not some quirk of Novel Writer.

> Generally speaking RTF-scan isn't currently treated as much of a priority by anyone - not least because there are a whole number of issues with it that really require a solution at a more fundamental level.

That's too bad. It works pretty well aside from annoying little bugs like this. What's the big problem that prevents more attention being spent on it?

> If you're serious about writing with non-Word/LO/LaTeX/Pandoc software and Zotero, you may want to consider the ODF scan solution discussed by fbennett and paultroops here: http://forums.zotero.org/discussion/18064/2/please-add-better-integration-with-scrivener/

That's not really a workable solution for me. Novel Writer only exports to RTF, MS Word format, and XHTML. I don't want to have to do *two* conversion steps.

Additionally, it looks like this depends on MLZ; is that correct? If so, that's a problem for me too. I understand from the thread you pointed me to that MLZ only runs in Firefox, which I rather dislike (which is why I'm using Standalone and Chrome...).

The simple answer to fixing this problem is to fix what, to all appearances, is an extremely minor bug in the RTF scan feature. I'd even be happy to take a stab at it myself; I'm pretty good with JavaScript. But the ODF scan "solution" will make things even fiddlier than they already are.

adamsmith · March 19, 2013

I can replicate this with your file, thanks.

Using the same file you shared, saved in Libre Office:
https://docs.google.com/file/d/0B9yc_yMcWirjMUpHMk1tenVmTUk/edit?usp=sharing
And then running RTF scan:
https://docs.google.com/file/d/0B9yc_yMcWirjRWUxNlFfeTlLMlE/edit?usp=sharing
the apostrophe stays as it is.

patches are most certainly welcome, this is all done in one .js file:
https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/rtfScan.js

Let me know if you want to look into this.

Some other issues include
- dealing with non-ascii characters in names (accents, umlauts, etc.) - I'm working on a fix on that,
- dealing with disambiguation,
- adding prefix and suffix fields to citations.
- issuse with creating Footnotes

I think any real fix needs to include some type of unique identifier.
That's implemented in the MLZ-ODF-scan solution, but that doesn't mean it could also exist in RTF scan, though that would require an identifier field (which is also needed for other reasons, e.g. bibtex).

marnenlk · March 19, 2013

So let me get this straight. If you take the RTF file I provided from TextEdit, then open it in LibreOffice and save it as RTF again, the curly apostrophes come out intact after Zotero processing?

If that's so, then I assume there are multiple ways to represent the ’ character in RTF, and Zotero doesn't understand them all properly. I'll take a look at that...

adamsmith · March 19, 2013

So let me get this straight. If you take the RTF file I provided from TextEdit, then open it in LibreOffice and save it as RTF again, the curly apostrophes come out intact after Zotero processing?

yes, exactly.

My suspicion would be that it's connected to the replacing of \\'92 here:
https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/rtfScan.js#L170
(which is due to a difference between UTF-8 and some Windows key mapping used by RTF IIRC).

marnenlk · March 19, 2013

That's exactly the line I had noticed as a spot to investigate, when I'm not lying in bed typing on my iPhone. :)

marnenlk · March 19, 2013

So yeah, I'll give this a shot. I'm busy being a first-year grad student right now, but if I can fix this quickly, I will.

adamsmith · March 20, 2013

see here: https://github.com/zotero/zotero/pull/255#discussion_r3218786 for why it's there.

marnenlk · March 20, 2013

Well, that looks like kind of a pain. I guess I'll have to take a good look at the RTF spec now.

marnenlk · March 20, 2013

Alternatively, I was about to suggest that we consider punting to an existing JavaScript RTF parsing library that's already solved this problem for us...except that such a thing doesn't seem to exist. There are RTF-to-HTML converters in JavaScript, but apparently nothing that does what we're doing here.

Hmm...that suggests that some of this RTF parsing stuff might be worth extracting into a library that others could reuse. (Yes, I know, that has no direct bearing on solving this. Just brainstorming.)

adamsmith · March 20, 2013

yes. The way to _really_ do this right would be to implement a proper RTF parser. I had a quick look when issues kept piling up and came up with the same result you did.

see Simon's comment here: https://groups.google.com/forum/?fromgroups=#!topic/zotero-dev/1cXTLoHVXAc[1-25-false] which also has a link the RTF specs

marnenlk · March 25, 2013

Update: as you suggested, I tried saving an RTF file in LibreOffice before scanning it in Zotero. No luck; apostrophes still wound up straightened. If necessary, I'll send the file. I haven't had time to fix the JavaScript code, but hopefully I will soon.