To respond to a couple of fault reports, I have taken another dive into the parsing code. I've made two changes.
First, I've made parsing case-insensitive throughout. This is not meant to oppose nickbart's proposal to make parsing case-sensitive; it's only meant to produce consistent behaviour in the existing model. Previously, particles were identified with case-insensitive regexps, and then classified with a case-sensitive string comparison. In theory, this could be done, but the way the parser is coded currently, a given case-normalized particle set can only be defined once (the last case-normalized line in PARTICLES wins). If CSL opts for case-sensitive matching, adjustments can be made to the parser, but with this change users will at least get predictable results.
Second, specifically to address one of the fault reports, I have set up the particle sets "de las" and "de la" with a dropping/non-dropping option, defaulting to all-non-dropping.
Where there are multiple entries, the classification of the particle depends on its location (front of the family field, end of the given field). I don't actually know if that is ever desirable, but that's what it does.
That’s highly desirable since the convention of entering non-dropping particles in the family field and dropping particles in the given field is the only way to distinguish particles that could be either, for example “de” or “van”.
(This would need an extension to the categories. We currently have "dropping" and "non-dropping". To that we might want to add "part-of-name" and possibly [but not certainly] "never-dropping" [for Arabic name particles]).
I really don’t think so. I am not aware of any style guide or other authority that would call for listing “Jean de La Fontaine” under “F” and “Van Rompuy” under “R”, so unless someone presents evidence to the contrary, I am pretty sure there’s nothing wrong with treating “La Fontaine” or “Van Rompuy”, or indeed all multi-part family names listed in CMS 8.5 (D’Amato, de la Mare, de Man, De Quincey, du Maurier, La Follette, Le Carré, L’Enfant, Ten Broeck, van Gulik*, Van Rensselaer, von Braun*) or 16.71 (da Cunha, D’Amato, de Gaulle, di Leonardo, La Fontaine, Van Rensselaer) as, well, multi-part family names, or family names that happen to contain spaces; no particles (in the sense of the CSL specs) involved here at all.
* If we assume CMS interprets these as Americanised name forms – but their inclusion in 8.5 seems to suggest this.
This may be a rather ignorant question, but are there any cases where an uppercased family name element is actually a particle? (e.g. we just concluded that "La" in "Jean de La Fontaine" is never demoted, so we don't need to treat it as a particle, right? I assume the same goes for the Americanized "Van")
On the contrary, it’s an excellent question indeed. I’m not aware of any upper-cased family name elements that are actually particles either (with the possible exception of Arabic “Al-”, see below).
In particular, the only genuine non-dropping particles I’m aware of now are either from the Dutch (could someone confirm these are all lower-case, without exception?) or the Arabic (where the vast majority of “al-” and friends are lower-case; to err on the safe side, I previously suggested including the upper-case variants, but I’d have to check whether that’s really ever required).
I actually start asking myself whether it wouldn’t be much easier to build name parsing around one simple rule (or convention): lower-case strings at the front of the family field are parsed as non-dropping particles, and lower-case strings at the end of the given field are parsed as dropping particles.
This would still, just like in the current setup, require protecting some family names by wrapping them in braces (examples from CMS: de la Mare, de Man, du Maurier, da Cunha, de Gaulle, di Leonardo).
Still, I’d expect this to be the least confusing setup for most users while we wait for a five-part name field in the Zotero UI to appear.
Second, specifically to address one of the fault reports, I have set up the particle sets "de las" and "de le" with a dropping/non-dropping option, defaulting to all-non-dropping.
But wouldn’t this give you “De la Fontaine, Jean” now?
As noted above, I"ve coded the default of "de la" to be all-non-dropping; but the "de" can be made dropping, as shown below.
As Rintze points out below, this example may be entirely wrong, insofar as real-world use goes, since in this name "La" should apparently be capitalized and treated as part of the surname. So the treatment of "de la" below may be short-lived, but it at least serves to illustrate the operation of the parser.
I will continue to tweak the particle settings in response to specific user complaints. Meanwhile, if consensus emerges about particular changes, pull requests will be welcome.
Data entered[de la Fontaine] [Jean]
demote-non-dropping-particle="never"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):de la Fontaine, Jean Einstein, Albert Kafka, Franz
demote-non-dropping-particle="sort-only"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert de la Fontaine, Jean Kafka, Franz
demote-non-dropping-particle="display-and-sort"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz
Data entered[la Fontaine] [Jean de]
demote-non-dropping-particle="never"
Citation (form="short"):la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert Kafka, Franz la Fontaine, Jean de
demote-non-dropping-particle="sort-only"
Citation (form="short"):la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert la Fontaine, Jean de Kafka, Franz
demote-non-dropping-particle="display-and-sort"
Citation (form="short"):la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz
Data entered[Fontaine] [Jean de la](reverting to default, so…)
demote-non-dropping-particle="never"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):de la Fontaine, Jean Einstein, Albert Kafka, Franz
demote-non-dropping-particle="sort-only"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert de la Fontaine, Jean Kafka, Franz
demote-non-dropping-particle="display-and-sort"
Citation (form="short"):de la Fontaine Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz
I cannot imagine a way for a translator to parse these examples in a way that can indicate they are the same author.
It's well off the topic of this thread, but a friend demonstrated to me (five years ago) how reputation chains could be used to find probable identity matches in a large data pool, without reliance on name parsing. Maybe someone might pick that up as a project when Zotero makes anonymized data available…
This may be a rather ignorant question, but are there any cases where an uppercased family name element is actually a particle? (e.g. we just concluded that "La" in "Jean de La Fontaine" is never demoted, so we don't need to treat it as a particle, right? I assume the same goes for the Americanized "Van")
On the contrary, it’s an excellent question indeed. I’m not aware of any upper-cased family name elements that are actually particles either (with the possible exception of Arabic “Al-”, see below).
That was the first thing that came to my mind as well, but it's a bit harder relying on proper name capitalization in Zotero, since the metadata we retrieve is very often incorrectly capitalized. It may still be ok and we would be shifting more burden on the users to make sure that particles are in lower case.
@fbennett, since "La" in "La Fontaine" is apparently never demoted, correct entry should be [La Fontaine] [Jean de] where "La" is not considered a particle, and "de" as a dropping particle, and which should produce (with either demote-non-dropping-particle="sort-only" or "display-and-sort"):
Citation (form="short"): La Fontaine
Bibliography (name-as-sort-order="first"): Einstein, Albert Kafka, Franz La Fontaine, Jean de
(if you wanted to demonstrate the behavior of dropping and non-dropping particles, we probably should use different names)
It may still be ok and we would be shifting more burden on the users to make sure that particles are in lower case.
While retrieved metadata is often inconsistent when it comes to particle capitalization, making particle-identification case-sensitive might make things behave more understandably for the average user. Even if it requires more curation by the user. Seems certainly more logical and discoverable than having to put names in quotes.
if you wanted to demonstrate the behavior of dropping and non-dropping particles, we probably should use different names
I was just illustrating the effect of the code change, in response to nickbart's comment. Happy to revise with another name, but the selection isn't really my department. Candidates?
I'm not sure there are good names that contain both a dropping and a non-dropping particle. But "Vincent van Gogh" (non-dropping) and "Alexander von Humboldt" (dropping) seem safe choices.
While retrieved metadata is often inconsistent when it comes to particle capitalization, making particle-identification case-sensitive might make things behave more understandably for the average user. Even if it requires more curation by the user.
I totally agree.
Seems certainly more logical and discoverable than having to put names in quotes.
However, even with field- and case-sensitive particle identification there are still a few names that need to be wrapped in quotes to be parsed correctly:
A French “Paul de Man” (de = dropping particle) is entered as [Man] [Paul de];
a Dutch (de = non-dropping particle) as [de Man] [Paul];
but for an American “Paul de Man” (CMS 8.5, de = part of family name), the family name will still have to be wrapped in quotes: ["de Man"] / [Paul] in order to be parsed correctly.
This might not be pretty, or overly obvious, but it has been working well in citeproc-js for a long time (see http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name). Short of introducing five-part name fields in the UI, I don’t have any better ideas, except maybe allowing the use of non-breaking spaces for protecting multi-part family names from being parsed (prettier, but even less obvious).
“Isa de Jong-van der Zijl”: is “de” the non-dropping particle here, “de Jong-van der Zijl” the in-text (short) form, and “Jong-van der Zijl, Isa de” the correct form for the bibliography (assuming demote-non-dropping-particle="display-and-sort")? If so, this shouldn’t be a problem.
Belgian names: an uppercased family name element followed by a lowercased one, e.g. in “Patsy Van der Meeren” – shouldn’t be a problem if we identify non-dropping particles as lower-case strings at the beginning of a family field; lower-case strings within the field should not count. Similar examples: Ludwig Mies van der Rohe (CMS 8.6), Philippe du Puy de Clinchamps (8.7).
I came across an example in my own work of a name that is not being parsed properly by the current version of the Propachi plugin:
family: D'Mello give: Susan
She is an American author, and the D' is never treated as a particle.
Would this be another case where parsing based on capitalization would be beneficial? Otherwise, I'm not sure that the current rules for de and d' always being treated as dropping work, as any Americanized versions of the name would place them in the family field and expect them not to drop.
Edit: Also, with the current version of the Propachi plugin, the suppress-parsing workaround using double quotes in the family name field is not working.
Entering family: "D'Mello" given: Susan is still resulting in a citation with "Mello".
Edit again:
Okay, so the syntax that forces names to not be parsed is to place two set double quotes around the name: family: ""D'Mello"" given: Susan
This isn't what I expected after reading the the citeproc-js manual, where the wording is simply "double quotes":
The particles preceding some names should be treated as part of the last name, depending on the cultural heritage and personal preferences of the individual. To suppress parsing and treat such particles as part of the family name field, enclose the family name field content in double-quotes:
and the displayed syntax is: { "author" : [ { "family" : "\"van der Vlist\"", "given" : "Eric", "parse-names" : "true" } ] }
Would this be another case where parsing based on capitalization would be beneficial?
Absolutely.
… the suppress-parsing workaround …
Strictly speaking, I wouldn’t call it a “workaround” – for parsing two-field names with ambiguous particles, it is an essential feature. We’d need it less often with case-sensitive parsing, though.
two pairs of double quotes
A single pair of double quotes works nicely for me (Zotero 4.0.27.5, Propachi 1.1.13, LibreOffice).
Sounds like there is consensus on omitting capitalized particles from the parse.
I agree.
@Frank: I've seen that you remove "abbé d'" from the list. I agree that's not at particle at all, but you added it after this request (I don't need it personally, but is it still working?)
First, I've made parsing case-insensitive throughout. This is not meant to oppose nickbart's proposal to make parsing case-sensitive; it's only meant to produce consistent behaviour in the existing model. Previously, particles were identified with case-insensitive regexps, and then classified with a case-sensitive string comparison. In theory, this could be done, but the way the parser is coded currently, a given case-normalized particle set can only be defined once (the last case-normalized line in PARTICLES wins). If CSL opts for case-sensitive matching, adjustments can be made to the parser, but with this change users will at least get predictable results.
Second, specifically to address one of the fault reports, I have set up the particle sets "de las" and "de la" with a dropping/non-dropping option, defaulting to all-non-dropping.
@fbennett:
That’s highly desirable since the convention of entering non-dropping particles in the family field and dropping particles in the given field is the only way to distinguish particles that could be either, for example “de” or “van”. I really don’t think so. I am not aware of any style guide or other authority that would call for listing “Jean de La Fontaine” under “F” and “Van Rompuy” under “R”, so unless someone presents evidence to the contrary, I am pretty sure there’s nothing wrong with treating “La Fontaine” or “Van Rompuy”, or indeed all multi-part family names listed in CMS 8.5 (D’Amato, de la Mare, de Man, De Quincey, du Maurier, La Follette, Le Carré, L’Enfant, Ten Broeck, van Gulik*, Van Rensselaer, von Braun*) or 16.71 (da Cunha, D’Amato, de Gaulle, di Leonardo, La Fontaine, Van Rensselaer) as, well, multi-part family names, or family names that happen to contain spaces; no particles (in the sense of the CSL specs) involved here at all.* If we assume CMS interprets these as Americanised name forms – but their inclusion in 8.5 seems to suggest this.
@Rintze:
On the contrary, it’s an excellent question indeed. I’m not aware of any upper-cased family name elements that are actually particles either (with the possible exception of Arabic “Al-”, see below).In particular, the only genuine non-dropping particles I’m aware of now are either from the Dutch (could someone confirm these are all lower-case, without exception?) or the Arabic (where the vast majority of “al-” and friends are lower-case; to err on the safe side, I previously suggested including the upper-case variants, but I’d have to check whether that’s really ever required).
I actually start asking myself whether it wouldn’t be much easier to build name parsing around one simple rule (or convention): lower-case strings at the front of the family field are parsed as non-dropping particles, and lower-case strings at the end of the given field are parsed as dropping particles.
This would still, just like in the current setup, require protecting some family names by wrapping them in braces (examples from CMS: de la Mare, de Man, du Maurier, da Cunha, de Gaulle, di Leonardo).
Still, I’d expect this to be the least confusing setup for most users while we wait for a five-part name field in the Zotero UI to appear.
As noted above, I"ve coded the default of "de la" to be all-non-dropping; but the "de" can be made dropping, as shown below.
As Rintze points out below, this example may be entirely wrong, insofar as real-world use goes, since in this name "La" should apparently be capitalized and treated as part of the surname. So the treatment of "de la" below may be short-lived, but it at least serves to illustrate the operation of the parser.
I will continue to tweak the particle settings in response to specific user complaints. Meanwhile, if consensus emerges about particular changes, pull requests will be welcome.
Data entered
[de la Fontaine] [Jean]
- demote-non-dropping-particle="never"
- Citation (form="short"):
- demote-non-dropping-particle="sort-only"
- Citation (form="short"):
- demote-non-dropping-particle="display-and-sort"
- Citation (form="short"):
Data enteredde la Fontaine
Bibliography (name-as-sort-order="first"):
de la Fontaine, Jean
Einstein, Albert
Kafka, Franz
de la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
de la Fontaine, Jean
Kafka, Franz
de la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
Fontaine, Jean de la
Kafka, Franz
[la Fontaine] [Jean de]
- demote-non-dropping-particle="never"
- Citation (form="short"):
- demote-non-dropping-particle="sort-only"
- Citation (form="short"):
- demote-non-dropping-particle="display-and-sort"
- Citation (form="short"):
Data enteredla Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
Kafka, Franz
la Fontaine, Jean de
la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
la Fontaine, Jean de
Kafka, Franz
la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
Fontaine, Jean de la
Kafka, Franz
[Fontaine] [Jean de la]
(reverting to default, so…)de la Fontaine
Bibliography (name-as-sort-order="first"):
de la Fontaine, Jean
Einstein, Albert
Kafka, Franz
de la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
de la Fontaine, Jean
Kafka, Franz
de la Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
Fontaine, Jean de la
Kafka, Franz
[La Fontaine] [Jean de]
where "La" is not considered a particle, and "de" as a dropping particle, and which should produce (with either demote-non-dropping-particle="sort-only" or "display-and-sort"):
Citation (form="short"):
La Fontaine
Bibliography (name-as-sort-order="first"):
Einstein, Albert
Kafka, Franz
La Fontaine, Jean de
(if you wanted to demonstrate the behavior of dropping and non-dropping particles, we probably should use different names)
- A French “Paul de Man” (de = dropping particle) is entered as [Man] [Paul de];
- a Dutch (de = non-dropping particle) as [de Man] [Paul];
- but for an American “Paul de Man” (CMS 8.5, de = part of family name), the family name will still have to be wrapped in quotes: ["de Man"] / [Paul] in order to be parsed correctly.
This might not be pretty, or overly obvious, but it has been working well in citeproc-js for a long time (see http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name). Short of introducing five-part name fields in the UI, I don’t have any better ideas, except maybe allowing the use of non-breaking spaces for protecting multi-part family names from being parsed (prettier, but even less obvious).Also, e.g. for Belgian names it's not uncommon to have an uppercased family name element followed by a lowercased one, e.g. in "Patsy Van der Meeren". And in Dutch, traditionally the wife would incorporate her husband's name into her last name, resulting in names like "Isa de Jong-van der Zijl". (see https://onzetaal.nl/taaladvies/advies/hoofdletters-in-namen-patsy-van-der-meeren-patsy-van-der-meeren and https://onzetaal.nl/taaladvies/advies/hoofdletters-in-namen-nynke-van-der-sluis-nynke-van-der-sluis for both cases)
“Isa de Jong-van der Zijl”: is “de” the non-dropping particle here, “de Jong-van der Zijl” the in-text (short) form, and “Jong-van der Zijl, Isa de” the correct form for the bibliography (assuming demote-non-dropping-particle="display-and-sort")? If so, this shouldn’t be a problem.
Belgian names: an uppercased family name element followed by a lowercased one, e.g. in “Patsy Van der Meeren” – shouldn’t be a problem if we identify non-dropping particles as lower-case strings at the beginning of a family field; lower-case strings within the field should not count. Similar examples: Ludwig Mies van der Rohe (CMS 8.6), Philippe du Puy de Clinchamps (8.7).
“Jong-van der Zijl, Isa de” looks natural to me. Probably right. Agreed on the Belgian names.
family: D'Mello
give: Susan
She is an American author, and the D' is never treated as a particle.
Would this be another case where parsing based on capitalization would be beneficial? Otherwise, I'm not sure that the current rules for de and d' always being treated as dropping work, as any Americanized versions of the name would place them in the family field and expect them not to drop.
Edit: Also, with the current version of the Propachi plugin, the suppress-parsing workaround using double quotes in the family name field is not working.
Entering
family: "D'Mello"
given: Susan
is still resulting in a citation with "Mello".
Edit again:
Okay, so the syntax that forces names to not be parsed is to place two set double quotes around the name:
family: ""D'Mello""
given: Susan
This isn't what I expected after reading the the citeproc-js manual, where the wording is simply "double quotes": and the displayed syntax is:
{ "author" : [
{ "family" : "\"van der Vlist\"",
"given" : "Eric",
"parse-names" : "true"
}
]
}
Perhaps the documentation could be clarified?
If it's okay with everyone here, I'll contact other projects relying on citeproc-js separately, then make the change.
Ah, the auto-updating of my installation of Propachi plugin was not working. Working for me now. Thanks.
It's still a static list, but I've updated nickbart's list based on the latest changes made by Frank to util_name_particles.js.
=>It can be read by a human :-) and it's here: http://gist.github.com/gracile-fr/ed650f9ca31faf8f671c
I agree.
@Frank: I've seen that you remove "abbé d'" from the list. I agree that's not at particle at all, but you added it after this request (I don't need it personally, but is it still working?)