Names reform: request for comments
Following up on a ping to another thread, I have built a parsing module for the CSL processor that will give us better control over name particles. Before putting the modified processor up for trials, I have a few questions about some of the things it needs to handle.
On the fourth item, we have been discriminating between dropping- and non-dropping- particles by the field in which they are placed (i.e. non-dropping-particle at the front of the family name field, and dropping-particle at the end of the given name field). The new parsing engine doesn't care where the particles are entered: it extracts them from either location or from both, and then just classifies them correctly. The fourth item in the list is the one edge case that poses a problem.
Once I've had some feedback on the above, I can release a revised version of the processor patch plugin for trials, and if that works for people, the new parser can be offered up for the next Zotero release.
- de
- Is this a "dropping-particle" or a "non-dropping-particle" when it appears alone (i.e. not as "de la")?
- al-
- Same question - is this a "dropping-particle" or a "non-dropping-particle?" The other thread shows that it is to be treated the same as "de," which I'm not sure about.
- d'
- This is dropping-particle in France, but it is a non-particle in the name of Bruce D'Arcus, who originated the CSL citation formatting language. I'm open to suggestions on how to handle the latter case.
- des
- According to the list I am working from (HT Charles Parnot), this is non-dropping-particle in Italy, and dropping-particle in Germany. Can anyone confirm that? We'll need a reliable and intuitive way of discriminating between the two.
On the fourth item, we have been discriminating between dropping- and non-dropping- particles by the field in which they are placed (i.e. non-dropping-particle at the front of the family name field, and dropping-particle at the end of the given name field). The new parsing engine doesn't care where the particles are entered: it extracts them from either location or from both, and then just classifies them correctly. The fourth item in the list is the one edge case that poses a problem.
Once I've had some feedback on the above, I can release a revised version of the processor patch plugin for trials, and if that works for people, the new parser can be offered up for the next Zotero release.
On item-level override methods, the CSL processor (which is my end of things) recognizes a "static-ordering" flag if it is set on names in its input. Whether and how it is set would be an issue for the Zotero developers - I won't be implementing any in-field markup workaround method for that.
(Correction: I misspoke there: the "static-ordering" flag also freezes the order as family name + given name. I will add a "static-particles" flag to the processor, but the rest is up to Zotero.)
The ability to parse well regardless of which field the particles land in is just gravy.
What would be really nice (from the standpoint of the processor, not necessarily the user) is a means of toggling fixed-particle treatment on a name. That would address all of the edge cases above, and others besides.
My knowledge here is limited, but I can give you some insight about the German system RAK-WB:
Main rule: the normalization depends on the nationality of the author. E.g.
- the French mathematician Alembert, Jean Le Rond d'
- the Italian D'Annunzio, Gabriellino
- the (old) Italian Afflitto, Matteo d'
- the American D'Arcus, Bruce
There are a lot of other examples with different prefixes and more languages, see § 314a.Prefixes/suffixes that are representing relations (e.g. Abu, Ibn, Bar, Neto, Uly) are always belonging to the family name, see § 316.
Question: Can you really deal with all that cases in one list? Or how would we like to define the standard here?
Citeproc-js could provide a convenience function that could help automate the parsing and help unify behavior across different front ends. Though in the mean time, while all the details are figured out, I don't see a problem with just incorporating this in the main pipeline.
Pardon my possible ignorance, but I have a few concerns here:
I do not feel just parsing a two-part name field using a list of common dropping and non-dropping-particles will be able to deal with all possible cases in a satisfactory manner:
My understanding so far was that the data Zotero exports to CSL JSON and the data it forwards to citeproc-js are exactly the same. But what would 'passing a "static-particles" flag to the processor' look like then, and what would this mean for other processors able to work with Zotero's CSL JSON export, such as pandoc-citeproc?
Thus, wouldn't the following be preferable?
https://forums.zotero.org/discussion/27822/2/von-van-de-in-authors-name-appear-as-von-van-de/ )
I think aurimas and nickbart are right on principle -- it'd be much better to properly handle this in the reference manager and don't require parsing from citeproc that, it appears, may just not always be possible. But we also have to deal with the current situation, so we might as well get this as right as possible in citepro-js.
For the original list, "des" as part of a German name is incredibly rare. You can just disregard that case.
Keep the comments coming, this is all very useful. I won't respond on policy issues, but when I have code that threads the needle as best may be, I'll post again.
(Re first/last allotment: Yep.)
"d'": The only idea I have is to test the language of the work. A French author might write in French mainly, an Italian author in Italian and Bruce D'Arcus in English.
I will try to summarize some information from RAK-WB:
"de" part of last name, non-sorting
* nationality in English-speaking country, e.g. De Quincey, Thomas
* Belgium, e.g. De Lomenie, Edouard
* Luxembourgian, e.g. De Sterio, Alexandre Marius
* Dutch?, Flemish?
* Italian (after 19th century), e.g. De Rossi, Giuseppe Maria (not anymore?)
"de" part of first name, non-sorting
* French, e.g. Broglie, Louis de
* Rumanian, e.g. Puscariu, Emil de
* Spanish, e.g. Pereda, Jose Maria de
I don't see any heuristic here for you, even document language might not really help...
des: Actually, I only find French (or more general Romanian) names here, e.g. "Forêts, Jean des", "Des Rochers, Jacques", "Des Courtils, Jacques", "Des Lauriers, Matthew Richard", "Des Clers, Sophie".
@nickbart: I don't know CMOS but see LoC.
Not much detail to report. You can see the list of particles (with lots of numeric parameter clutter, sorry about that) here. At the user end, it's enough to know that particles are "more sticky" if placed at the start of the last-name field, and "less sticky" if placed at the end of the first-name field.
On parsing behaviour, particles at the end of the first-name field must either be all lowercase, or must be separated by a comma. It's done that way to avoid false positives on initialized names.
Particles at the front of the last-name field can be uppercase or lowercase, it doesn't matter to the parser.
Particles that the parser thinks are always dropping or always non-dropping are always set that way, no matter how they are entered. So "de la" will always set "de" as dropping, and "la" as non-dropping.
Particles that the parser thinks might be dropping or non-dropping are set according to the field in which they occur (i.e. last-name = non-dropping, first-name = dropping), if there is an exact match (ignoring case as described above). If there is not an exact match in both positions (possible with two-element particles), the parser will choose what it thinks is the most likely assignment.
The one case that this will not handle is names that have a particle-like element that is actually part of the last name ("De Quincy"). There isn't much I can do about that, unfortunately.
* "af", e.g. Geijerstam, Gustaf af
* "aus der", e.g. Au, Otto aus der
* "in der", e.g. Gand, Hanns in der
* "auf der", e.g. Maur, Paul auf der
* "von und zu", e.g. Urff, Georg Ludwig von und zu
* "vom und zum", e.g. Stein, Karl vom und zum
* "aus'm", e.g. Aus'm Weerth, Ernst
* "dall", e.g. Dall'Ongaro, Francesco
* "de'", e.g. Medici, Lorenzo de'
* "degli", e.g. Uberti, Fazio degli
* "dei" and "de li"
* "'s-", e.g. Gravesande, Goverdus 's-
* "'t", e.g. Hoen, Pieter 't
* "z" and "ze", e.g. Zerotina, Karel ze
I started testing the new processor gadget. A big thank you, fbennett, but (of course) also a few questions and suggestions:
First, is there any reason why Zotero could not copy the new parsing mechanisms, and output parsed names by default itself?
Second, unexpected behaviour of the new gadget:
Particles as part of the family name
From http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name:
“To suppress parsing and treat such particles as part of the family name field, enclose the family name field content in double-quotes:”
Is this still expected to work? (Currently, with the new processor gadget, it does not seem to.)
My suggestion would be to (1) keep this double quotes convention, and (2) introduce a new option to use a non-breaking space between family names parts to keep them together if needed, e.g., between ‘De’ and ‘Quincey’ (highly unobtrusive, but also not very obvious; still, I'd prefer this very much …).
al-
‘al-Hakim’/‘Tawiq’ in Zotero’s surname/firstname fields is rendered as ‘Hakim, Tawiq al-’ (using chicago-author-date.csl); ‘Hakim’/‘Tawiq al-’ yields the same result. With ‘al-’ at the start of the surname field, I would have expected ‘al-Hakim, Tawiq’ (which is also what CMoS 16, 8.14, wants).
Assimilated forms of ‘al-’, such as ‘at-’, ‘an-’, ‘ash-’ (which were reported to be working before; https://forums.zotero.org/discussion/28457/arabic-names-with-the-particle-al/) do not seem to be parsed properly either; neither do forms with diacritics, such as ‘aṭ-Ṭūsī’.
van / Van
‘van Gogh’/‘Vincent’ (using chicago-author-date.csl): ‘Van’ is capitalised in the reference list (‘Van Gogh, Vincent. 1983. …’) but not in the in-text citation (‘(van Gogh 1983)’). CMoS 16, 8.10, seems to suggest that ‘van’ should always be capitalised unless preceded by a first name.
Third, a suggestion for Zotero:
parse-names flag
http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#id28 says that simple two-field entries should be parsed (to identify particles and suffixes) only when a ‘parse-names’ flag is present, as in:
I would strongly suggest that Zotero should add this ‘"parse-names" : "true"’ flag to all unparsed two-field name elements (but not to ‘literal’ names, of course) when exporting to CSL JSON. (As soon as Zotero starts exporting parsed names, this could of course be removed again.)
In particular, this is essential for pandoc-citeproc, since pandoc-citeproc needs to be able to distinguish unparsed CSL JSON (e.g., obtained from Zotero) from already parsed CSL JSON (e.g., created from bibtex/biblatex via ‘pandoc-citeproc --bib2json’.
Fourth, documentation. Do we have anything already? (‘Names’ in https://www.zotero.org/support/getting_stuff_into_your_library doesn’t have anything on particles.) What's most important? (The first item on my list would be pointing out that names such as ‘De Quincey’ and ‘Van Rompuy’ (which do not contain particles, and should always be sorted under ‘D’ and ‘V’) must be protected, at least in the current setup.)
Accordingly... If citeproc implemented the parser, we could just jump on that instead of including non-standard CSL JSON fields.
(On the quotes issue, I'm not sure; I may have removed it after complaints about the design [or lack thereof].)
http://www.culture.gouv.fr/culture/inventai/extranetIGPC/normes/constit_normesbiblio.pdf => p. 41 (Ministry of Culture, France, extracted from an AFNOR standard)
I'm a bit sceptical, but I will test the gadget, and look at the results with the different values of demote-non-dropping-particle…
Edit: just "for fun", an old (1998) article on this subject.
Names of Persons. National Usages for Entry in Catalogues. 4th rev. and enl. edition. München : K. G. Saur, 1996. ISBN 3-598-11342-0
When Zotero updates the CSL processor, we'll be running on the new parsing mechanism.