Names reform: request for comments

Gracile · March 19, 2015

The "De" in this name should never drop, even when dropping of "non-dropping" particles is forced.

Forced? You mean with "demote-non-dropping-particle" set to “display-and-sort”, right?

fbennett · March 20, 2015

Yes. When those options are set, I have been told that this particular name (for one of the people to which it applies, at least) should be printed, in sort order, as "De Quincy, John."

We don't have a mechanism in CSL for declaring that an apparent particle should be treated as a fixed part of the last name.

nickbart · March 20, 2015

Didn't get around to testing, nor to posting on xbiblio yet, sorry - but does citeproc-js still implement the "dirty trick" of protecting multi-part last names when they are enclosed in double quotes? (see http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name)

And what does citeproc-js do when the space in "De Quincy" is replaced by a non-breaking space?

It's not just "De Quincy", by the way: CMoS 16e, 16.71, also recommends to always use “de Gaulle”, and sort him under "D".

fbennett · March 20, 2015

I think that workaround is still supported in citeproc-js, but I'm not sure whether Zotero strips surrounding quotes.

I'm not sure what would happen with a non-breaking space, but it's likely that would do the trick.

There are plenty of exceptions that would require an override. The key thing is to get agreement on the override syntax in the CSL group - otherwise we don't have a basis for consistent markup across implementations.

aurimas · March 20, 2015

As I said before, I don't think citeproc should be applying this parsing automatically. It's great that an algorithm was developed and citeproc could expose it as a utility function. It would then be up to the citation manager to either use automatic parsing or have the user indicate correct name parts. Most importantly, it would not require any special markup on top of what is already available in citrproc json.

nickbart · April 12, 2015

I'd like to return to this issue since, just like aurimas, I feel the current implementation is markedly problematic:

It does not conform to the CSL specs (the CSL method of dealing with name parts is very well designed and should not be given up needlessly).
It makes debugging much more difficult (you can't just look at a CSL JSON file exported from Zotero to check how a name has been parsed).
Unparsed name fields do not work with other citeprocs, e.g. pandoc-citeproc when using zotxt (or else all other citeprocs would have to implement the same algorithm).
Overriding the algorithm’s decisions is difficult or impossible.

Hence I’d like to repeat aurimas’ and my suggestion:

citeproc-js should not be applying this parsing automatically (or only if explicitly asked to do so by '"parse-names" : "true"').
Zotero should either
- implement this automatic parsing algorithm when creating output or
- have the user indicate correct name parts by
  - introducing (optional) additional fields for dropping-particle, non-dropping-particle and suffix or
  - introducing a suitable in-field markup syntax
- or a combination of these.

My favourite is what I see as the cleanest solution: introducing additional fields for dropping-particle, non-dropping-particle and suffix, by adding a "five-field" state to the name field (in addition to the existing "single field" and "two-field" states).

An in-field markup syntax would also be possible, e.g., by using something like the pipe character ("|") to separate subfields inside lastname and firstname fields.

Unfortunately, I cannot code any of these but I’d certainly be willing to contribute to testing.

fbennett · April 12, 2015

Well, I'm not standing in the way of reform. The processor can of course easily digest five-field output along the lines you suggest, since that's what it deals with internally.

All someone needs to do is convert user data into that form, and prepare code to interface with the processor on that basis. The coding on the processor side will be trivial.

fbennett · April 12, 2015

If you want to move things in this direction, the place to start would probably be the xbiblio-devel mailing list. The CSL specification applies to CSL code only; there is no formal specification for input formats yet. The closest thing is the CSL Test Suite, where names are parsed out of two-field input. If CSL were to move to require a five-field input object for names, the tests would need to be revised to reflect that requirement. There are over a thousand fixtures, so someone would need to put in some work on that.

nickbart · May 11, 2015

Whether or not two-field names should be seen as the preferred format for communication between front-end and processor, and whether or not users should be given complete control over the specification of name components in the various front-ends, a processor’s ability to parse two-field names as correctly as possible is certainly very important.

So let me ask a few more questions on that:

(1) Could we document the available information on name particles in general, and the algorithm used by citeproc-js a little more clearly – also with a view on how other citeprocs could implement this?

Most importantly, could we put together a list of particles we can unambiguously identify as either dropping or non-dropping? And if there are criteria other than simple membership in one of these lists, what are these?

Or, amounting to the same thing essentially, how should the list in https://bitbucket.org/fbennett/citeproc-js/src/tip/src/util_name_particles.js?at=default be understood, especially, which particles on this list are parsed as dropping or non-dropping, what exactly do the numbers mean, and are there any other criteria applied by the algorithm?

(I’ll note that while “van der” seems to be parsed correctly as a non-dropping particle by citeproc-js, “’s-” always seems to behave like part of the family name, and both “al” and “al-” [the latter not on the list] always like a dropping particle.)

(2) Could we also clarify what post-processing is required, resp., actually applied by citeproc-js? For example, name parts are usually separated by spaces, but the space following particles ending with an apostrophe (“d’”) or a hyphen (“al-”) should be removed when followed by another name part – with one apparent exception, “de’”, as in “Lorenzo de’ Medici”. Is anyone aware of other such exceptions?

(3) Is there any simple way to find out about citeproc-js’s parsing decisions, other than inferring them from the output in a Word/LO file?

fbennett · May 11, 2015

Good stuff. I have some travel coming up, so it will be a week or two before I can produce something, but I'll try to get on it soon. The tests will be one channel for clarifying things - at the moment, testing of the various particle types is banged into a single monolithic test. If that's broken out into separate files, the specifics can be noted in each fixture.

Thanks for flagging the anomalies. I'll look into those.