Names reform: request for comments

fbennett · February 28, 2015

Following up on a ping to another thread, I have built a parsing module for the CSL processor that will give us better control over name particles. Before putting the modified processor up for trials, I have a few questions about some of the things it needs to handle.

de: Is this a "dropping-particle" or a "non-dropping-particle" when it appears alone (i.e. not as "de la")?
al-: Same question - is this a "dropping-particle" or a "non-dropping-particle?" The other thread shows that it is to be treated the same as "de," which I'm not sure about.
d': This is dropping-particle in France, but it is a non-particle in the name of Bruce D'Arcus, who originated the CSL citation formatting language. I'm open to suggestions on how to handle the latter case.
des: According to the list I am working from (HT Charles Parnot), this is non-dropping-particle in Italy, and dropping-particle in Germany. Can anyone confirm that? We'll need a reliable and intuitive way of discriminating between the two.

On the third item, I think we have been discriminating on whether the "D" is capitalized. I don't much like that solution, but unless there is a better suggestion, that is what will go into the next processor version.

On the fourth item, we have been discriminating between dropping- and non-dropping- particles by the field in which they are placed (i.e. non-dropping-particle at the front of the family name field, and dropping-particle at the end of the given name field). The new parsing engine doesn't care where the particles are entered: it extracts them from either location or from both, and then just classifies them correctly. The fourth item in the list is the one edge case that poses a problem.

Once I've had some feedback on the above, I can release a revised version of the processor patch plugin for trials, and if that works for people, the new parser can be offered up for the next Zotero release.

joehill · February 28, 2015

Whatever the particulars that end up getting decided upon, will there be any way to override? Whether through specifying in the CSL file, since not all styles have the same rules, or through the format of the name itself in the database? I'm thinking of BibTex, which allows you to put {Curly Brackets} around things you don't want changed, for example. I'd like to see a system that allows for maximum flexibility so that we can tailor the citations and bibliography to what a publishing venue asks for.

fbennett · February 28, 2015

On style parameters, see the CSL Specification.

On item-level override methods, the CSL processor (which is my end of things) recognizes a "static-ordering" flag if it is set on names in its input. Whether and how it is set would be an issue for the Zotero developers - I won't be implementing any in-field markup workaround method for that.

(Correction: I misspoke there: the "static-ordering" flag also freezes the order as family name + given name. I will add a "static-particles" flag to the processor, but the rest is up to Zotero.)

bwiernik · February 28, 2015

Pardon my ignorance, but what is the benefit of the new system over the old "dropping with the given name, non-dropping with the family name" system?

fbennett · February 28, 2015

We still had to do parsing, to figure out where the particle stops and the proper name begins. Early on, I assumed particles were lowercase, which made that easy. Then came several requests for uppercase particles. That caused some false positives, which were controlled with more tweaking ... it pretty much works, but as the requests trickle in over time, the parsing was set to become more and more fragile (read: to consume more and more of my time). If we base parsing on a controlled list of matching particles, everything is clear.

The ability to parse well regardless of which field the particles land in is just gravy.

What would be really nice (from the standpoint of the processor, not necessarily the user) is a means of toggling fixed-particle treatment on a name. That would address all of the edge cases above, and others besides.

zuphilip · February 28, 2015

The librarians have very specific rules about how to normalize a name. There might be other variants, which nowadays with search engines can be found as well, however in times with card catalogues that was different.

My knowledge here is limited, but I can give you some insight about the German system RAK-WB:

Main rule: the normalization depends on the nationality of the author. E.g.

the French mathematician Alembert, Jean Le Rond d'

the Italian D'Annunzio, Gabriellino

the (old) Italian Afflitto, Matteo d'

the American D'Arcus, Bruce

There are a lot of other examples with different prefixes and more languages, see § 314a.

Prefixes/suffixes that are representing relations (e.g. Abu, Ibn, Bar, Neto, Uly) are always belonging to the family name, see § 316.

Question: Can you really deal with all that cases in one list? Or how would we like to define the standard here?

Rintze · February 28, 2015

A standalone "de" is an extremely common particle in Dutch (it means "the"), where it's always non-dropping. E.g. "Hugo de Vries" (Hugo the Frisian).

aurimas · February 28, 2015

I don't think this should be part of citeproc (in an ideal scenario). It's up to the front end (Zotero) to provide the means for the user to indicate dropping/non-dropping parts or do the parsing automatically and pass the separate parts to citeproc.

Citeproc-js could provide a convenience function that could help automate the parsing and help unify behavior across different front ends. Though in the mean time, while all the details are figured out, I don't see a problem with just incorporating this in the main pipeline.

nickbart · February 28, 2015

Pardon my possible ignorance, but I have a few concerns here:

I do not feel just parsing a two-part name field using a list of common dropping and non-dropping-particles will be able to deal with all possible cases in a satisfactory manner:

e.g., 'von' in a German name is almost always a dropping-particle, but CMoS 16e, 8.5 treats the 'von' in 'Wernher von Braun' as non-dropping.
Also, 'de' in French names is usually dropping, but 'de' in 'de Gaulle' is not (CMoS 16e, 8.7).
"al-", as far as I understand, can be treated as dropping or non-dropping, too.

My understanding so far was that the data Zotero exports to CSL JSON and the data it forwards to citeproc-js are exactly the same. But what would 'passing a "static-particles" flag to the processor' look like then, and what would this mean for other processors able to work with Zotero's CSL JSON export, such as pandoc-citeproc?

Thus, wouldn't the following be preferable?

Let's reassess whether the existing CSL name schema (with family, given, suffix, non-dropping-particle, dropping-particle) is sufficiently powerful to cope with the entire set of possible use cases. – I would tend to think so, but can we be sure, e.g., that the name form without dropping and non-dropping particles is always the one that should also be used for sorting/collation? And what about capitalisation: Some E.g., https://en.wikipedia.org/wiki/Van_%28Dutch%29 mentions 'Vincent van Gogh' vs. 'de schilder Van Gogh' ('the painter Van Gogh').
If this is found to be sufficient, or expanded, let Zotero (and other frontends) output/export/forward all names in this CSL name schema (I fully agree with aurimas here) – which of course means that Zotero would have to provide additional fields, other GUI elements, and/or in-field markup (I wouldn't mind either of these; especially since Zotero already uses markup like "," and even "!," for suffixes).

zuphilip · February 28, 2015

(I think "Wernher von Braun" is seen as an American and not German and therefore it is Von Braun, Wernher.)

nickbart · February 28, 2015

Well, CMoS 16e, 8.5 quite clearly lists ‘von Braun’ with a lower-case ‘v’, so, unless CMoS is completely mistaken, ‘von’ can be either dropping or non-dropping, and purely list-based parsing cannot distinguish the two.

adamsmith · February 28, 2015

(there's a long and unresolved tussenvoegsel discussion here. Rather untractable:
https://forums.zotero.org/discussion/27822/2/von-van-de-in-authors-name-appear-as-von-van-de/ )
I think aurimas and nickbart are right on principle -- it'd be much better to properly handle this in the reference manager and don't require parsing from citeproc that, it appears, may just not always be possible. But we also have to deal with the current situation, so we might as well get this as right as possible in citepro-js.

For the original list, "des" as part of a German name is incredibly rare. You can just disregard that case.

bwiernik · February 28, 2015

It seems in general that while a list of particles that exist would work (to avoid some of the parsing problems), there needs to remain a method for the user to specify what type the particle is, as in the current family vs. given placement scheme.

fbennett · February 28, 2015

So now you see why I opened this thread - this is what my mail queue on this subject has looked like.

Keep the comments coming, this is all very useful. I won't respond on policy issues, but when I have code that threads the needle as best may be, I'll post again.

(Re first/last allotment: Yep.)

zuphilip · February 28, 2015

@fbennett: Okay, I understand, you are only looking for the best heuristics to deal with them. You are aware that this will also give false automatic decisisions but maybe we can try to minimize them.

"d'": The only idea I have is to test the language of the work. A French author might write in French mainly, an Italian author in Italian and Bruce D'Arcus in English.

I will try to summarize some information from RAK-WB:

"de" part of last name, non-sorting
* nationality in English-speaking country, e.g. De Quincey, Thomas
* Belgium, e.g. De Lomenie, Edouard
* Luxembourgian, e.g. De Sterio, Alexandre Marius
* Dutch?, Flemish?
* Italian (after 19th century), e.g. De Rossi, Giuseppe Maria (not anymore?)

"de" part of first name, non-sorting
* French, e.g. Broglie, Louis de
* Rumanian, e.g. Puscariu, Emil de
* Spanish, e.g. Pereda, Jose Maria de

I don't see any heuristic here for you, even document language might not really help...

des: Actually, I only find French (or more general Romanian) names here, e.g. "Forêts, Jean des", "Des Rochers, Jacques", "Des Courtils, Jacques", "Des Lauriers, Matthew Richard", "Des Clers, Sophie".

@nickbart: I don't know CMOS but see LoC.

fbennett · March 1, 2015

Okay, trial versions of citeproc-js with the new particle parser are up now in the processor patch plugin and the uppercase gadget.

Not much detail to report. You can see the list of particles (with lots of numeric parameter clutter, sorry about that) here. At the user end, it's enough to know that particles are "more sticky" if placed at the start of the last-name field, and "less sticky" if placed at the end of the first-name field.

On parsing behaviour, particles at the end of the first-name field must either be all lowercase, or must be separated by a comma. It's done that way to avoid false positives on initialized names.

Particles at the front of the last-name field can be uppercase or lowercase, it doesn't matter to the parser.

Particles that the parser thinks are always dropping or always non-dropping are always set that way, no matter how they are entered. So "de la" will always set "de" as dropping, and "la" as non-dropping.

Particles that the parser thinks might be dropping or non-dropping are set according to the field in which they occur (i.e. last-name = non-dropping, first-name = dropping), if there is an exact match (ignoring case as described above). If there is not an exact match in both positions (possible with two-element particles), the parser will choose what it thinks is the most likely assignment.

The one case that this will not handle is names that have a particle-like element that is actually part of the last name ("De Quincy"). There isn't much I can do about that, unfortunately.

fbennett · March 1, 2015

(Oh - and this should fix the issue with the al- prefix reported by @joehill.)

zuphilip · March 1, 2015

Some more prefixes for your list:
* "af", e.g. Geijerstam, Gustaf af
* "aus der", e.g. Au, Otto aus der
* "in der", e.g. Gand, Hanns in der
* "auf der", e.g. Maur, Paul auf der
* "von und zu", e.g. Urff, Georg Ludwig von und zu
* "vom und zum", e.g. Stein, Karl vom und zum
* "aus'm", e.g. Aus'm Weerth, Ernst
* "dall", e.g. Dall'Ongaro, Francesco
* "de'", e.g. Medici, Lorenzo de'
* "degli", e.g. Uberti, Fazio degli
* "dei" and "de li"
* "'s-", e.g. Gravesande, Goverdus 's-
* "'t", e.g. Hoen, Pieter 't
* "z" and "ze", e.g. Zerotina, Karel ze

fbennett · March 1, 2015

@zuphilip: Thanks! Up in a release soon.

nickbart · March 1, 2015

I started testing the new processor gadget. A big thank you, fbennett, but (of course) also a few questions and suggestions:

First, is there any reason why Zotero could not copy the new parsing mechanisms, and output parsed names by default itself?

Second, unexpected behaviour of the new gadget:

Particles as part of the family name

From http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name:

“To suppress parsing and treat such particles as part of the family name field, enclose the family name field content in double-quotes:”

{ &quot;author&quot; : [
    { &quot;family&quot; : &quot;\&quot;van der Vlist\&quot;&quot;,
      &quot;given&quot; : &quot;Eric&quot;,
      &quot;parse-names&quot; : &quot;true&quot;
    }
  ]
}

Is this still expected to work? (Currently, with the new processor gadget, it does not seem to.)

My suggestion would be to (1) keep this double quotes convention, and (2) introduce a new option to use a non-breaking space between family names parts to keep them together if needed, e.g., between ‘De’ and ‘Quincey’ (highly unobtrusive, but also not very obvious; still, I'd prefer this very much …).

al-

‘al-Hakim’/‘Tawiq’ in Zotero’s surname/firstname fields is rendered as ‘Hakim, Tawiq al-’ (using chicago-author-date.csl); ‘Hakim’/‘Tawiq al-’ yields the same result. With ‘al-’ at the start of the surname field, I would have expected ‘al-Hakim, Tawiq’ (which is also what CMoS 16, 8.14, wants).

Assimilated forms of ‘al-’, such as ‘at-’, ‘an-’, ‘ash-’ (which were reported to be working before; https://forums.zotero.org/discussion/28457/arabic-names-with-the-particle-al/) do not seem to be parsed properly either; neither do forms with diacritics, such as ‘aṭ-Ṭūsī’.

van / Van

‘van Gogh’/‘Vincent’ (using chicago-author-date.csl): ‘Van’ is capitalised in the reference list (‘Van Gogh, Vincent. 1983. …’) but not in the in-text citation (‘(van Gogh 1983)’). CMoS 16, 8.10, seems to suggest that ‘van’ should always be capitalised unless preceded by a first name.

Third, a suggestion for Zotero:

parse-names flag

http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#id28 says that simple two-field entries should be parsed (to identify particles and suffixes) only when a ‘parse-names’ flag is present, as in:

{ &quot;author&quot; : [
    { &quot;family&quot; : &quot;van Gogh&quot;,
      &quot;given&quot; : &quot;Vincent&quot;,
      &quot;parse-names&quot; : &quot;true&quot;
    }
  ]
}

I would strongly suggest that Zotero should add this ‘"parse-names" : "true"’ flag to all unparsed two-field name elements (but not to ‘literal’ names, of course) when exporting to CSL JSON. (As soon as Zotero starts exporting parsed names, this could of course be removed again.)

In particular, this is essential for pandoc-citeproc, since pandoc-citeproc needs to be able to distinguish unparsed CSL JSON (e.g., obtained from Zotero) from already parsed CSL JSON (e.g., created from bibtex/biblatex via ‘pandoc-citeproc --bib2json’.

Fourth, documentation. Do we have anything already? (‘Names’ in https://www.zotero.org/support/getting_stuff_into_your_library doesn’t have anything on particles.) What's most important? (The first item on my list would be pointing out that names such as ‘De Quincey’ and ‘Van Rompuy’ (which do not contain particles, and should always be sorted under ‘D’ and ‘V’) must be protected, at least in the current setup.)

aurimas · March 1, 2015

First, is there any reason why Zotero could not copy the new parsing mechanisms, and output parsed names by default itself?

No reason. Hence my suggestion above that this should not be part of citeproc pipeline.

Accordingly...

I would strongly suggest that Zotero should add this ‘"parse-names" : "true"’ flag to all unparsed two-field name elements (but not to ‘literal’ names, of course) when exporting to CSL JSON. (As soon as Zotero starts exporting parsed names, this could of course be removed again.)

If citeproc implemented the parser, we could just jump on that instead of including non-standard CSL JSON fields.

fbennett · March 1, 2015

@nickbart: Thanks very much for this feedback. On the input and field markup details, could you repost to the xbiblio-devel list? I'm trying to get out of the ad hoc hackery game at this point - if the CSL group arrives at a set of preferences for how name formatting hints should be delivered in the input, I'll be happy to implement the recommendations in the processor.

(On the quotes issue, I'm not sure; I may have removed it after complaints about the design [or lack thereof].)

fbennett · March 1, 2015

(Following on from the above, I'll note that the "parse-names" flag described in the processor manual isn't part of CSL. There was a lot of that exploratory coding in the early scramble to get the processor working. There are traces of it in the docs and tests that I should clean up sometime. Basically, now that the waters are calm, all this kind of detail should find its way into the CSL Specification, or at least into a supplementary doc published to the CSL site.)

Gracile · March 2, 2015

In France, main rule: the normalization depends on the nationality of the author. As for French names, only « d' » and « de » are dropping-particles.

http://www.culture.gouv.fr/culture/inventai/extranetIGPC/normes/constit_normesbiblio.pdf => p. 41 (Ministry of Culture, France, extracted from an AFNOR standard)

I'm a bit sceptical, but I will test the gadget, and look at the results with the different values of demote-non-dropping-particle…

Edit: just "for fun", an old (1998) article on this subject.

fbennett · March 2, 2015

Gracile: Thanks! The results should at least not be worse than before. There is some scope for further tuning within this approach, although there will be limits.

Gracile · March 3, 2015

Frank: I had posted it before but this might be of interest (not OCR'ed unfortunately):
Names of Persons. National Usages for Entry in Catalogues. 4th rev. and enl. edition. München : K. G. Saur, 1996. ISBN 3-598-11342-0

adamsmith · March 3, 2015

Once we get our CSL sponsorship money for this year (which should hopefully be soon), I'd be happy to buy this for Frank with that if he's interested.

fbennett · March 3, 2015

Thanks, that might be handy to have- so long as I'm not the only one with a copy. As I wrote in response to @nickbart's post above, I'm trying to get out of the ad hoc hackery game at this point. Names handling is important, but it really needs to be driven by parameters set in CSL. The new processor code is a step toward allowing that to happen, so that processor implementers (myself included) can take comfort in sticking to the requirements.

fbennett · March 17, 2015

There haven't been any negative reports on the new name particle parsing kit described above. In test fixtures, it performs a little better than the previous code. As it will be easier to control things with the new method, I've rolled it out in the latest processor release (citeproc-js v1.1.5).

When Zotero updates the CSL processor, we'll be running on the new parsing mechanism.

Gracile · March 19, 2015

You can see the list of particles (with lots of numeric parameter clutter, sorry about that) here.

Particles that the parser thinks are always dropping or always non-dropping are always set that way, no matter how they are entered.

I'd like to understand the logic here and it's difficult with these numeric paramaters indeed :-) Could you provide a list of the particles which should have a consistent behaviour (i.e. which are always dropping or non-dropping for the parser) please?

On parsing behaviour, particles at the end of the first-name field must either be all lowercase, or must be separated by a comma.

e.g. : "Jean, De" ? (it is just an example)

The one case that this will not handle is names that have a particle-like element that is actually part of the last name ("De Quincy").

I don't follow. What's the problem with "De Quincey"? Sorting?

fbennett · March 19, 2015

Could you provide a list of the particles which should have a consistent behaviour (i.e. which are always dropping or non-dropping for the parser) please?

Will do. I'll be on the road for the next week - if you don't see anything by the first week in April, give me a shout (in case I forget).

I don't follow. What's the problem with "De Quincey"? Sorting?

The "De" in this name should never drop, even when dropping of "non-dropping" particles is forced.