RTF scan straightens apostrophes
Hi all!
I'm using Zotero Standalone 3.0.14.1 on Mac OS X 10.8.3. I just tried to use the RTF Scan feature. While it mostly worked as expected, I noticed one problem: properly curled apostrophes in the input RTF file were replaced with straight apostrophes in the output file. This only affected apostrophes (’ U+2019); double quotes (“ U+201C and ”U+201D) were unaffected.
In other words:
John’s wife said “Hello” (input)
became
John's wife said “Hello” (output)
I'm inclined to treat this as a bug, and a reasonably severe one at that. Or is there a Zotero setting that I should change?
I'm using Zotero Standalone 3.0.14.1 on Mac OS X 10.8.3. I just tried to use the RTF Scan feature. While it mostly worked as expected, I noticed one problem: properly curled apostrophes in the input RTF file were replaced with straight apostrophes in the output file. This only affected apostrophes (’ U+2019); double quotes (“ U+201C and ”U+201D) were unaffected.
In other words:
John’s wife said “Hello” (input)
became
John's wife said “Hello” (output)
I'm inclined to treat this as a bug, and a reasonably severe one at that. Or is there a Zotero setting that I should change?
Generally speaking RTF-scan isn't currently treated as much of a priority by anyone - not least because there are a whole number of issues with it that really require a solution at a more fundamental level. If you're serious about writing with non-Word/LO/LaTeX/Pandoc software and Zotero, you may want to consider the ODF scan solution discussed by fbennett and paultroops here: http://forums.zotero.org/discussion/18064/2/please-add-better-integration-with-scrivener/
> I can't replicate that.
Meaning that apostrophes don't get straightened when you try? If so, could I see your before and after RTF files?
> Which word processor are you using to create the RTF?
Jer's Novel Writer, but it shouldn't matter: I have confirmed that the RTF is correct until Zotero processes it.
> Could you upload a minimal sample RTF file somewhere?
Sure. See before and after examples at https://docs.google.com/folder/d/0B-9S1Y1t4sLoazdjRGFSSFFxSkk/edit?usp=sharing . In this case, I created the RTF files with TextEdit, just to confirm that it was not some quirk of Novel Writer.
> Generally speaking RTF-scan isn't currently treated as much of a priority by anyone - not least because there are a whole number of issues with it that really require a solution at a more fundamental level.
That's too bad. It works pretty well aside from annoying little bugs like this. What's the big problem that prevents more attention being spent on it?
> If you're serious about writing with non-Word/LO/LaTeX/Pandoc software and Zotero, you may want to consider the ODF scan solution discussed by fbennett and paultroops here: http://forums.zotero.org/discussion/18064/2/please-add-better-integration-with-scrivener/
That's not really a workable solution for me. Novel Writer only exports to RTF, MS Word format, and XHTML. I don't want to have to do *two* conversion steps.
Additionally, it looks like this depends on MLZ; is that correct? If so, that's a problem for me too. I understand from the thread you pointed me to that MLZ only runs in Firefox, which I rather dislike (which is why I'm using Standalone and Chrome...).
The simple answer to fixing this problem is to fix what, to all appearances, is an extremely minor bug in the RTF scan feature. I'd even be happy to take a stab at it myself; I'm pretty good with JavaScript. But the ODF scan "solution" will make things even fiddlier than they already are.
Using the same file you shared, saved in Libre Office:
https://docs.google.com/file/d/0B9yc_yMcWirjMUpHMk1tenVmTUk/edit?usp=sharing
And then running RTF scan:
https://docs.google.com/file/d/0B9yc_yMcWirjRWUxNlFfeTlLMlE/edit?usp=sharing
the apostrophe stays as it is.
patches are most certainly welcome, this is all done in one .js file:
https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/rtfScan.js
Let me know if you want to look into this.
Some other issues include
- dealing with non-ascii characters in names (accents, umlauts, etc.) - I'm working on a fix on that,
- dealing with disambiguation,
- adding prefix and suffix fields to citations.
- issuse with creating Footnotes
I think any real fix needs to include some type of unique identifier.
That's implemented in the MLZ-ODF-scan solution, but that doesn't mean it could also exist in RTF scan, though that would require an identifier field (which is also needed for other reasons, e.g. bibtex).
If that's so, then I assume there are multiple ways to represent the ’ character in RTF, and Zotero doesn't understand them all properly. I'll take a look at that...
My suspicion would be that it's connected to the replacing of \\'92 here:
https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/rtfScan.js#L170
(which is due to a difference between UTF-8 and some Windows key mapping used by RTF IIRC).
Hmm...that suggests that some of this RTF parsing stuff might be worth extracting into a library that others could reuse. (Yes, I know, that has no direct bearing on solving this. Just brainstorming.)
see Simon's comment here: https://groups.google.com/forum/?fromgroups=#!topic/zotero-dev/1cXTLoHVXAc[1-25-false] which also has a link the RTF specs