Any idea why an "A" author comes last in the bibliography

fbennett · July 26, 2015

To respond to a couple of fault reports, I have taken another dive into the parsing code. I've made two changes.

First, I've made parsing case-insensitive throughout. This is not meant to oppose nickbart's proposal to make parsing case-sensitive; it's only meant to produce consistent behaviour in the existing model. Previously, particles were identified with case-insensitive regexps, and then classified with a case-sensitive string comparison. In theory, this could be done, but the way the parser is coded currently, a given case-normalized particle set can only be defined once (the last case-normalized line in PARTICLES wins). If CSL opts for case-sensitive matching, adjustments can be made to the parser, but with this change users will at least get predictable results.

Second, specifically to address one of the fault reports, I have set up the particle sets "de las" and "de la" with a dropping/non-dropping option, defaulting to all-non-dropping.

nickbart · July 26, 2015

@fbennett:

Where there are multiple entries, the classification of the particle depends on its location (front of the family field, end of the given field). I don't actually know if that is ever desirable, but that's what it does.

That’s highly desirable since the convention of entering non-dropping particles in the family field and dropping particles in the given field is the only way to distinguish particles that could be either, for example “de” or “van”.

(This would need an extension to the categories. We currently have "dropping" and "non-dropping". To that we might want to add "part-of-name" and possibly [but not certainly] "never-dropping" [for Arabic name particles]).

I really don’t think so. I am not aware of any style guide or other authority that would call for listing “Jean de La Fontaine” under “F” and “Van Rompuy” under “R”, so unless someone presents evidence to the contrary, I am pretty sure there’s nothing wrong with treating “La Fontaine” or “Van Rompuy”, or indeed all multi-part family names listed in CMS 8.5 (D’Amato, de la Mare, de Man, De Quincey, du Maurier, La Follette, Le Carré, L’Enfant, Ten Broeck, van Gulik*, Van Rensselaer, von Braun*) or 16.71 (da Cunha, D’Amato, de Gaulle, di Leonardo, La Fontaine, Van Rensselaer) as, well, multi-part family names, or family names that happen to contain spaces; no particles (in the sense of the CSL specs) involved here at all.

* If we assume CMS interprets these as Americanised name forms – but their inclusion in 8.5 seems to suggest this.

nickbart · July 26, 2015

@Rintze:

This may be a rather ignorant question, but are there any cases where an uppercased family name element is actually a particle? (e.g. we just concluded that "La" in "Jean de La Fontaine" is never demoted, so we don't need to treat it as a particle, right? I assume the same goes for the Americanized "Van")

On the contrary, it’s an excellent question indeed. I’m not aware of any upper-cased family name elements that are actually particles either (with the possible exception of Arabic “Al-”, see below).

In particular, the only genuine non-dropping particles I’m aware of now are either from the Dutch (could someone confirm these are all lower-case, without exception?) or the Arabic (where the vast majority of “al-” and friends are lower-case; to err on the safe side, I previously suggested including the upper-case variants, but I’d have to check whether that’s really ever required).

I actually start asking myself whether it wouldn’t be much easier to build name parsing around one simple rule (or convention): lower-case strings at the front of the family field are parsed as non-dropping particles, and lower-case strings at the end of the given field are parsed as dropping particles.

This would still, just like in the current setup, require protecting some family names by wrapping them in braces (examples from CMS: de la Mare, de Man, du Maurier, da Cunha, de Gaulle, di Leonardo).

Still, I’d expect this to be the least confusing setup for most users while we wait for a five-part name field in the Zotero UI to appear.

nickbart · July 26, 2015

Second, specifically to address one of the fault reports, I have set up the particle sets "de las" and "de le" with a dropping/non-dropping option, defaulting to all-non-dropping.

But wouldn’t this give you “De la Fontaine, Jean” now?

fbennett · July 26, 2015

As noted above, I"ve coded the default of "de la" to be all-non-dropping; but the "de" can be made dropping, as shown below.

As Rintze points out below, this example may be entirely wrong, insofar as real-world use goes, since in this name "La" should apparently be capitalized and treated as part of the surname. So the treatment of "de la" below may be short-lived, but it at least serves to illustrate the operation of the parser.

I will continue to tweak the particle settings in response to specific user complaints. Meanwhile, if consensus emerges about particular changes, pull requests will be welcome.

Data entered[de la Fontaine] [Jean]

demote-non-dropping-particle="never": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):de la Fontaine, Jean Einstein, Albert Kafka, Franz
demote-non-dropping-particle="sort-only": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert de la Fontaine, Jean Kafka, Franz
demote-non-dropping-particle="display-and-sort": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz

Data entered[la Fontaine] [Jean de]

demote-non-dropping-particle="never": Citation (form="short"):la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert Kafka, Franz la Fontaine, Jean de
demote-non-dropping-particle="sort-only": Citation (form="short"):la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert la Fontaine, Jean de Kafka, Franz
demote-non-dropping-particle="display-and-sort": Citation (form="short"):la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz

Data entered[Fontaine] [Jean de la](reverting to default, so…)

demote-non-dropping-particle="never": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):de la Fontaine, Jean Einstein, Albert Kafka, Franz
demote-non-dropping-particle="sort-only": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert de la Fontaine, Jean Kafka, Franz
demote-non-dropping-particle="display-and-sort": Citation (form="short"):de la Fontaine
Bibliography (name-as-sort-order="first"):Einstein, Albert Fontaine, Jean de la Kafka, Franz

fbennett · July 26, 2015

I cannot imagine a way for a translator to parse these examples in a way that can indicate they are the same author.

It's well off the topic of this thread, but a friend demonstrated to me (five years ago) how reputation chains could be used to find probable identity matches in a large data pool, without reliance on name parsing. Maybe someone might pick that up as a project when Zotero makes anonymized data available…

aurimas · July 26, 2015

nickbart wrote:

@Rintze:
This may be a rather ignorant question, but are there any cases where an uppercased family name element is actually a particle? (e.g. we just concluded that "La" in "Jean de La Fontaine" is never demoted, so we don't need to treat it as a particle, right? I assume the same goes for the Americanized "Van")
On the contrary, it’s an excellent question indeed. I’m not aware of any upper-cased family name elements that are actually particles either (with the possible exception of Arabic “Al-”, see below).

That was the first thing that came to my mind as well, but it's a bit harder relying on proper name capitalization in Zotero, since the metadata we retrieve is very often incorrectly capitalized. It may still be ok and we would be shifting more burden on the users to make sure that particles are in lower case.

Rintze · July 26, 2015

@fbennett, since "La" in "La Fontaine" is apparently never demoted, correct entry should be
[La Fontaine] [Jean de]
where "La" is not considered a particle, and "de" as a dropping particle, and which should produce (with either demote-non-dropping-particle="sort-only" or "display-and-sort"):

Citation (form="short"):
La Fontaine

Bibliography (name-as-sort-order="first"):

Einstein, Albert
Kafka, Franz
La Fontaine, Jean de

(if you wanted to demonstrate the behavior of dropping and non-dropping particles, we probably should use different names)

Rintze · July 26, 2015

It may still be ok and we would be shifting more burden on the users to make sure that particles are in lower case.

While retrieved metadata is often inconsistent when it comes to particle capitalization, making particle-identification case-sensitive might make things behave more understandably for the average user. Even if it requires more curation by the user. Seems certainly more logical and discoverable than having to put names in quotes.

fbennett · July 26, 2015

if you wanted to demonstrate the behavior of dropping and non-dropping particles, we probably should use different names

I was just illustrating the effect of the code change, in response to nickbart's comment. Happy to revise with another name, but the selection isn't really my department. Candidates?

Rintze · July 26, 2015

I'm not sure there are good names that contain both a dropping and a non-dropping particle. But "Vincent van Gogh" (non-dropping) and "Alexander von Humboldt" (dropping) seem safe choices.

fbennett · July 26, 2015

Okie-doke. I"ll just put a note above to flag the fact that the sample only demonstrates how the parser works.

nickbart · July 27, 2015

While retrieved metadata is often inconsistent when it comes to particle capitalization, making particle-identification case-sensitive might make things behave more understandably for the average user. Even if it requires more curation by the user.

I totally agree.

Seems certainly more logical and discoverable than having to put names in quotes.

However, even with field- and case-sensitive particle identification there are still a few names that need to be wrapped in quotes to be parsed correctly:

A French “Paul de Man” (de = dropping particle) is entered as [Man] [Paul de];
a Dutch (de = non-dropping particle) as [de Man] [Paul];
but for an American “Paul de Man” (CMS 8.5, de = part of family name), the family name will still have to be wrapped in quotes: ["de Man"] / [Paul] in order to be parsed correctly.

This might not be pretty, or overly obvious, but it has been working well in citeproc-js for a long time (see http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#particles-as-part-of-the-last-name). Short of introducing five-part name fields in the UI, I don’t have any better ideas, except maybe allowing the use of non-breaking spaces for protecting multi-part family names from being parsed (prettier, but even less obvious).

Rintze · July 27, 2015

@nickbart, yeah, but hopefully there wouldn't be too many cases like that where quotes would be required.

Rintze · July 27, 2015

I found a list of 333 (!) particles at http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen as compiled by the Dutch civil registry. (there is also a database of 320,000 Dutch family names at http://www.meertens.knaw.nl/nfb/ )

Also, e.g. for Belgian names it's not uncommon to have an uppercased family name element followed by a lowercased one, e.g. in "Patsy Van der Meeren". And in Dutch, traditionally the wife would incorporate her husband's name into her last name, resulting in names like "Isa de Jong-van der Zijl". (see https://onzetaal.nl/taaladvies/advies/hoofdletters-in-namen-patsy-van-der-meeren-patsy-van-der-meeren and https://onzetaal.nl/taaladvies/advies/hoofdletters-in-namen-nynke-van-der-sluis-nynke-van-der-sluis for both cases)

nickbart · July 27, 2015

Dutch names: the list at http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen contains upper- and lower-case particles – so can Dutch non-dropping particles be upper-case after all, or are these just the capitalised forms used in the absence of a preceding given name or initial?

“Isa de Jong-van der Zijl”: is “de” the non-dropping particle here, “de Jong-van der Zijl” the in-text (short) form, and “Jong-van der Zijl, Isa de” the correct form for the bibliography (assuming demote-non-dropping-particle="display-and-sort")? If so, this shouldn’t be a problem.

Belgian names: an uppercased family name element followed by a lowercased one, e.g. in “Patsy Van der Meeren” – shouldn’t be a problem if we identify non-dropping particles as lower-case strings at the beginning of a family field; lower-case strings within the field should not count. Similar examples: Ludwig Mies van der Rohe (CMS 8.6), Philippe du Puy de Clinchamps (8.7).

Rintze · July 27, 2015

so can Dutch non-dropping particles be upper-case after all

I don't think so. Note that the list also seems to include particles from other languages (e.g. from people who immigrated to the Netherlands).

“Jong-van der Zijl, Isa de” looks natural to me. Probably right. Agreed on the Belgian names.

bwiernik · July 28, 2015

I came across an example in my own work of a name that is not being parsed properly by the current version of the Propachi plugin:

family: D'Mello 
give: Susan

She is an American author, and the D' is never treated as a particle.

Would this be another case where parsing based on capitalization would be beneficial? Otherwise, I'm not sure that the current rules for de and d' always being treated as dropping work, as any Americanized versions of the name would place them in the family field and expect them not to drop.

Edit: Also, with the current version of the Propachi plugin, the suppress-parsing workaround using double quotes in the family name field is not working.

Entering

family: "D'Mello"
given: Susan

is still resulting in a citation with "Mello".

Edit again:

Okay, so the syntax that forces names to not be parsed is to place two set double quotes around the name:

family: ""D'Mello""
given: Susan

This isn't what I expected after reading the the citeproc-js manual, where the wording is simply "double quotes":

The particles preceding some names should be treated as part of the last name, depending on the cultural heritage and personal preferences of the individual. To suppress parsing and treat such particles as part of the family name field, enclose the family name field content in double-quotes:

and the displayed syntax is:

{ "author" : [
    { "family" : "\"van der Vlist\"",
      "given" : "Eric",
      "parse-names" : "true"
    }
  ]
}

Perhaps the documentation could be clarified?

nickbart · July 28, 2015

Would this be another case where parsing based on capitalization would be beneficial?

Absolutely.

… the suppress-parsing workaround …

Strictly speaking, I wouldn’t call it a “workaround” – for parsing two-field names with ambiguous particles, it is an essential feature. We’d need it less often with case-sensitive parsing, though.

two pairs of double quotes

A single pair of double quotes works nicely for me (Zotero 4.0.27.5, Propachi 1.1.13, LibreOffice).

fbennett · July 28, 2015

Sounds like there is consensus on omitting capitalized particles from the parse. Sebastian, Rintze, Aurimas?

If it's okay with everyone here, I'll contact other projects relying on citeproc-js separately, then make the change.

aurimas · July 28, 2015

Fine by me

Rintze · July 28, 2015

If it's okay with everyone here, I'll contact other projects relying on citeproc-js separately, then make the change.

We should float it by xbiblio-devel, but yeah, I'm in support.

bwiernik · July 28, 2015

A single pair of double quotes works nicely for me (Zotero 4.0.27.5, Propachi 1.1.13, LibreOffice).

Ah, the auto-updating of my installation of Propachi plugin was not working. Working for me now. Thanks.

Gracile · August 19, 2015

I've followed that silently but a lot of progress have been made!

It's still a static list, but I've updated nickbart's list based on the latest changes made by Frank to util_name_particles.js.

=>It can be read by a human :-) and it's here: http://gist.github.com/gracile-fr/ed650f9ca31faf8f671c

Sounds like there is consensus on omitting capitalized particles from the parse.

I agree.

@Frank: I've seen that you remove "abbé d'" from the list. I agree that's not at particle at all, but you added it after this request (I don't need it personally, but is it still working?)