[citeproc-bug] punctuation in quote failing

adamsmith · November 23, 2013

I promise this is no attempt to subvert
https://twitter.com/fgbjr/status/404052546781585408

but:
Sample data and style:
https://gist.github.com/adam3smith/7621211
Expected citations:
Daniel W. Drezner, “An Open Letter to the New York Times Concerning Thomas Friedman,” January 23, 2013, accessed August 24, 2013, http://drezner.foreignpolicy.com/posts/2013/01/22/an_open_letter_to_the_new_york_times_concerning_thomas_friedman.

actual citation:
Daniel W. Drezner, “An Open Letter to the New York Times Concerning Thomas Friedman”, January 23, 2013, accessed August 24, 2013, http://drezner.foreignpolicy.com/posts/2013/01/22/an_open_letter_to_the_new_york_times_concerning_thomas_friedman.

(note the position of the comma in vs. outside the quotation mark).

I realize that the coding could be cleaner, but there aren't that many nested features here, so this really should work. I'm about to fix the Turabian style otherwise, so this won't affect that style anymore, but it should still work.

fbennett · November 24, 2013

No problem, those are being posted as humour (and will eventually have a good run, making for a happy ending as required of comedy ...).

The style is setting the comma as a prefix buried in a macro. To the processor, the internal structure looks something like this (very roughly):

"wrapper":{
  "delimiter": nil,
  "content":[
    {
      "node":{
        "quotes": true,
        "content": "An Open Letter to ..."
    },
    {
      "node":{
        "content":[
          "node":{
            "prefix":", ",
            "content": "January 23, 2013"
          }
        ]
      }
    }
  ]
}

In order to respect punctuation-in-quote against that sort of structure (to arbitrary depth), the processor would need to do the following for every affix in the output object:

Confirm that content exists on the local node;
For each successive layer to the top of the object:

Check whether a blocking partner exists on the higher-level partner (i.e. a sibling suffix against a prefix, or a sibling prefix against a suffix);
Check whether the higher-level partner requests quotes;
Check whether the higher-level partner has content;

Remove the punctuation from below and add it to the higher-level partner's affix or field content as appropriate.

While the processor does do some of this tree-trawling to control for duplicate spaces and punctuation, this would be pushing things pretty hard. It can legitimately be called a bug, I suppose, since the specification prescription is general, but I think I'll pass on trying to get the processor to handle it.

fbennett · November 24, 2013

(Actually, looking at my little example, it does seem that if the extreme affixes were preserved at each stage of a depth-first rendering traverse of the object, both controlling for duplicates and punctuation-in-quote handling could be greatly simplified. That will (or, more conservatively, would) mean rewriting a good portion of the output code, though, so for the present I'll leave this as it stands. If someone wants to tackle it in the meantime, of course, that would be great!)

DWL-SDCA · November 24, 2013

Is there not an absolute rule that commas and periods are always inside the closing quotation mark in American English? I believe that this may be less rigid with other flavors of English writing.

fbennett · November 24, 2013

(The rule seems to be clear. I just won't have the programming hours to put into making it happen for styles written like this for the next couple of months or so.)

adamsmith · November 24, 2013

there is and the specs do include that (which is why I reported it as a bug), fbennett's point is that the way the style is coded makes this incredibly hard to actually do.

DWL-SDCA · November 24, 2013

I think that it is even harder than it may seem. It appears that for British English commas are placed outside the end quote. For periods in enUS they are always inside but not so in other English conventions where, similar to exclamation marks or question marks, a subjective judgment is required.

Should the style be based upon the user's computer language setting, the language tag of the article or, perhaps, the journal publication place? How might this affect the many users of non-English languages.

adamsmith · November 24, 2013

This is handled like all other style language settings: First by the default-locale set in the style, then by the person's Zotero install language - never by the language tag of the article.
'punctuation-in-quote' is a locale setting. It's false for most languages, true for US English.

I really don't think there is a conceptual problem here - it has certainly never come up as one. This is solely a coding issue.

adamsmith · November 24, 2013

Hmm - I'm not happy about this.
Basically this means that we cannot use any punctuation marks in affixes within macros for styles that include quotes.

That seems like a pretty major restriction on coding to me, which will make any such style hugely more complex - adding about twice the number of macros. We'd have to do for every single style with quotes something along the lines of your re-write for Chicago (the group nesting is a separate issue, but we would need all the different macros even without it).

I'm quite concerned that this may mean many, many hundreds, if not thousands of lines of non-automatable style-rewrites to get around this reliably.

There is really no way we can get this working in the processor with less effort? Post-processing maybe?

fbennett · November 24, 2013

For the time being, we'll just have to suffer with it, but I do think that handling can be improved. As I wrote above, a smarter approach to flattening the output should yield better results (and reduce the amount and complexity of the processor code to boot). But it will be awhile before I can work on it.

DWL-SDCA · November 25, 2013

What I was getting at (and using too many words) is that the comma issue seems style specific. U.S. journals expect the comma within but those English language journals that are published elsewhere seem to want references with the comma outside.

adamsmith · November 25, 2013

right and we have separate en-GB and en-US locales that we use for journal styles as they apply. See e.g. Theory, Culture & Society or MHRA for styles that put punctuation outside of the quotation marks.

aurimas · November 25, 2013

So, for en-US locales, couldn't this be a simple search-and-replace post-processing step, like we already do for double punctuation (at least that's my basic understanding)?

adamsmith · November 25, 2013

I think what Frank is saying is that that's not actually how current post-processing works, including for double periods. But citeproc-js code is, uhm, rather daunting, so I don't actually have any code-based insight here.

fbennett · November 25, 2013

Yes, post-processing of the final string isn't viable, since markup may intervene between the characters to be compared. The check needs to be done immediately before each node of the internal representation is flattened to a string, in a depth-first traverse. Currently, the code tries to clean up the entire node tree before flattening, which was not a very smart design choice on my part. We'll get it fixed up in due course.