Parsing problem on Italian names

NMonteix · September 6, 2015

Hi to all
After the problem in Dutch, it seems that the new parsing module generated problems with Italian names:

« d' » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« da » is always non-dropping.
« de » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« de' » is always dropping.
« degli » is always dropping.
« dei » is always dropping.
« della » is always dropping.
« dello » is always dropping.

In fact all of them are always non-dropping particles in Italian.
[in Gracile's list dell' is lacking; dall' is nicely non dropping]
How could it be fixed?
Thanks in advance

fbennett · September 6, 2015

Thanks for reporting.

Is it right that dell' and dall' should both be non-dropping also?

fbennett · September 6, 2015

I've made adjustments in the processor. I'll try to make a release of the processor patch plugin soon, but the GitHub API that I use for releases does not seem to be functioning well at the moment.

As a side-note, the processor patch plugin will need to be signed by Mozilla for every release from September 22nd. I'll try to get that in place, but it will slow down response times on processor issues.

NMonteix · September 7, 2015

Indeed,
Dell' and Dall' are always non dropping
Thanks

nickbart · September 7, 2015

In fact all of them are always non-dropping particles in Italian.

Not quite, in fact none of them are (they are either fixed parts of the family name or dropping particles): CMS, 16e, 8.9 “Italian names”:

Particles in Italian names are most often uppercased and retained when the last name is used alone.
Gabriele D’Annunzio; D’Annunzio
Lorenzo Da Ponte; Da Ponte
Luca Della Robbia; Della Robbia

In many older aristocratic names, however, the particle is traditionally lowercased and dropped when the last name is used alone.
Beatrice d’Este; Este
Lorenzo de’ Medici; Medici

Since the first group always needs to be alphabetised under “D” (CMS 16.71: “D’Amato, Alfonse”), none of these are particles in the CSL sense; for CSL purposes they are fixed parts of the family name.

“d’” and “de’” on the other hand are dropping particles. (Note that a space needs to be inserted after “de’” when rendering, Zotero currently does not do this.)

Note that these examples, too, confirm the rule that only lowercase strings are CSL particles.

fbennett · September 7, 2015

NMonteix?

nickbart · September 7, 2015

@NMonteix: Specifically, the question is, for persons named “D’Annunzio”, “Da Ponte”, or “Della Robbia”, do “Annunzio”, “Ponte”, or “Robbia” ever appear _without_ “D’”, “Da”, or “Della”?

If not, “D’”, “Da”, “Della” are not particles in the CSL sense, but just fixed parts of family names.

Gracile · September 8, 2015

Since the first group always needs to be alphabetised under “D” (CMS 16.71: “D’Amato, Alfonse”), none of these are particles in the CSL sense; for CSL purposes they are fixed parts of the family name.

While I agree with that presentation (which seems to me influenced by BibLaTeX), the CSL concept of "particle" is not as clear in the CSL specs.

Progress have been made since they were written. To avoid more confusion, the examples chosen should be adjusted to remove the "particles" which are *always* fixed parts of the family name ("la"/"La" in "Jean de la Fontaine") even if they're particles in the common sense. At some point, the (future) documentation on the Zotero side should make that clear.

(Please correct me if I'm wrong)

fbennett · September 8, 2015

I have some draft code running for the UI that looks promising. It allows the user to toggle through the permutations of a name with particles by pressing Alt-p in a surname field opened for editing. It needs some more work—it currently only recognizes lowercase terms, but it should recognize known-possible particles in uppercase as well, and include capitalization in the cycle. The cycle of transforms for a complex name would be like this:

van der Merwe, Wikus
Van der Merwe, Wikus
Van Der Merwe, Wikus
"van der Merwe", Wikus
der Merwe, Wikus van
Der Merwe, Wikus van
"der Merwe", Wikus van
Merwe, Wikus van der
van der Merwe, Wikus

A tweaked version of the particles parser can be used to identify uppercase particle candidates, and to apply a color-coded highlight to the field to show whether it conforms to common conventions.

Comments welcome.

(Looking at this, it might be better to generate the full range of permutations [including some not listed above] and present them in a [color-coded] menu list called by Alt-p.)

nickbart · September 8, 2015

@Gracile: “(Please correct me if I'm wrong)” – Not at all, I totally agree.

@fbennett: Ok, now I see a little clearer what you meant by “We'll need support for particle adjustments in the Zotero UI before changing the processor.” Looks interesting though I probably would rarely use it myself since it’s nothing that couldn’t be entered/adjusted manually, but it might help other users, and if you see it as a prerequisite for changing Zotero’s and citeproc-js’s parsing rules, then by all means go ahead.

My bigger concern, as you won’t be surprised to hear, is to get the parsing itself right. I take it that there is some kind of consensus to ultimately adopt the simple rule “unless protected, lowercase words at beginning of family are non-dropping, lowercase words at end of given are dropping” – correct me if I’m wrong here.

Now, depending on how soon this is going to happen, the question is, does it still make sense to fix the flaws and omissions in the current parsing list(s) of both Zotero and citeproc-js?

For citeproc-js, https://bitbucket.org/fbennett/citeproc-js/src/553e934fdbf001ed6f70f47699dbf56151459e84/src/util_name_particles.js?at=default still lists “te”, “ten”, ”ter” as dropping though they are clearly Dutch and non-dropping; “mac” quite clearly isn’t a particle at all, and so on and so forth. Also, parsing is still case-insensitive, making, e.g., the “Van/van” distinction impossible. And of course around 150 out of the 166 unique entries from the Dutch “333” list are missing.

As to Zotero, I’m not sure where to look for the source – pointers welcome! Actual behaviour shows that “te” etc. are treated as dropping, and parsing is still case-insensitive, too.

So, again, should we bother with fixing this, or rather adopt the simple rule as soon as possible?

Gracile · September 8, 2015

Bitbucket is strange, I don't know myself how to create a link to the latest commit of a file but click on the commit name at the top-left or look there directly : https://bitbucket.org/fbennett/citeproc-js/src/12f6abc392c65d93cf5abde4d56538302054b474/src/util_name_particles.js?at=default

As to Zotero, I’m not sure where to look for the source – pointers welcome!

https://github.com/zotero/zotero/pull/806

[edit: thanks Frank]

fbennett · September 8, 2015

For reference, the tip. (Zotero is currently loading the citeproc-js module to pre-parse names, and turning it off in the processor.)

nickbart · September 9, 2015

Zotero is currently loading the citeproc-js module to pre-parse names […]

Thank you, that was essential for understanding what’s going on. I see that in the latest “Propachi: monkey-patch for Zotero CSL processor (standard version), 1.1.16” parsing is case-sensitive, and “te” etc. and a few others have been fixed. Great.

Now, would you accept pull requests for https://bitbucket.org/fbennett/citeproc-js/src/tip/src/util_name_particles.js? If so:

Would you mind if I did the following, apart from adding and fixing stuff?
- sort the list alphabetically by particle again
- add comments like["van", dropping_alt_non_dropping_1], // Dutch non-dropping, German dropping, also non-particle- possibly even indicate sources, like

["van", dropping_alt_non_dropping_1], // Dutch non-dropping [1], German dropping, also non-particle
...
// [1]: http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen

- comment out the highly dubious “mac”, “pietro”, “saint”, “sainte”, “sen”, “st.”, “ste.” (which wouldn’t usually appear in lower case anyway)

Rintze · September 9, 2015

My bigger concern, as you won’t be surprised to hear, is to get the parsing itself right. I take it that there is some kind of consensus to ultimately adopt the simple rule “unless protected, lowercase words at beginning of family are non-dropping, lowercase words at end of given are dropping” – correct me if I’m wrong here.

I'm with @nickbart, here. In my personal experience, I have to correct the casing of particles (i.e. lowercase them) by hand anyway to achieve proper capitalization in rendered references, so the recognition of uppercased particles via a list-based approach is of little value to me.

Adopting the simple rule also makes a lot of sense to me with regards to discoverability. I think it's much more intuitive to have particle recognition be controlled by case then by the presence or absence of quotes (see e.g. the use case at https://forums.zotero.org/discussion/20926/double-surnames-alphabetical-order-in-bibliography-solved/?Focus=120531#Comment_120531).

fbennett · September 9, 2015

Pull requests are always welcome, of course!

I can fix the sorting tomorrow if it's a headache; I have a script for it that just needs a tweak to the keys.

If there are issues about specific changes, we can discuss in the comments.

I agree that mac etc. should go, and that simplifying the parser, shifting the explicit particle identification and classification into the UI, will be an improvement. I talked with European colleagues about particles this afternoon. Some issues came up, but it's late here, so I'll sign off an pick up tomorrow.

fbennett · September 9, 2015

@Rintze,

The parser in citeproc-js is still using list-based parsing, but we're all agreed concerning capitalization. The adjustments for listed-particle case-sensitivity (to limit recognition to lowercase particles in most cases) case-insensitivity were completed in a commit that went in last week. The change is not yet reflected in Zotero, but it's in the pipeline—and will be joined by the changes flagged above.

We are also agreed that list-based parsing should be moved out of the processor. As nickbart has indicated, the double-quotes hack will still be required in that case, to express names such as "de Gaulle," in which the apparent (lowercase) particle is actually a fixed, sortable element of the surname.

The only point on which there seems to be a difference of opinion is the importance of UI support for manipulating particles. I feel it is important for a couple of reasons. First, it will reduce the risk of RSI among users who curate large volumes of data (and some do). Second, a UI based on our distilled knowledge of how name-particles work can help users unfamiliar with the arcana to repair names that come through translation badly.

The heuristics for UI support will be based on the explicit list, which is why I'm keen to get it into shape, so that the switch to simplified parsing and UI support can be cast in one go.

As I wrote in the other thread on this, it would indeed be possible to jump to simplified parsing directly. That's not how I would do things myself, so I won't, but since Zotero now pre-parses names before sending them to the processor, the change can be introduced by Zotero if it's desired.

(Edit: underlined text added, struck-out text removed.)

fbennett · September 10, 2015

I've made those changes now. The following are gone:

"mac"
"pietro"
"saint"
"sainte"
"st."
"ste."

The Dutch particles linked by Rintze have been added, and duplicates resolved.

The list is alphabetized by particle.

nickbart · September 10, 2015

Great. – Just a few things that don't seem to be correct yet:
- duplicates: "'t", "ben" and "bin"
- "van" should be "either" - since "van" in "Ludwig van Beethoven" is dropping
- "de'" should be "either" - since "de'" in "Lorenzo de' Medici" is dropping ("degli", "dei", "del", "dell'", "della", "delle" [to be added], "dello", "di", too; all for older Italian names)
EDIT:
- "either", too: "da", "de li" (both Italian)
- "in der" should be "either": in "Hanns in der Gand" it is dropping.

DWL-SDCA · September 10, 2015

The case issue depends on the publisher supplied metadata being properly set for this strategy to work or that the user to recognizes that the particle is cased incorrectly and can reliably edit the entry. This presumes that the user can know the "correct" name form for what is often an unfamiliar author. While there are clear language and nationality conventions for name particles, it is equally clear that authors' preferred form of their names often does not follow the conventional format.

See my comments in:

https://forums.zotero.org/discussion/51491/double-surnames-starting-with-te-end-up-quoted-incorrect/#Item_30

and in:

https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/#Item_35

Rintze · September 10, 2015

@fbennett, thanks for the update. For the UI, wouldn't adding a second option to the right-click activated menu be more discoverable?

Also, the Dutch particle list seems to have a lot of foreign particles in it (e.g. all the "Auf*" and "Aus*" ones are purely German). I'm not sure if those should all be treated as potentially non-dropping.

nickbart · September 10, 2015

@Rintze: I’d say all the particles from the Dutch list should be treated as potentially non-dropping – but: the "auf *" (except "auf ter"), "aus *", "unter", "von *" (except "von 't/von t"), "vor *", "zu" ones may just as well be German particles, so they should all be "either".

fbennett · September 10, 2015

Thanks everyone for the careful reading. Here's the commit so far. (I missed adding the definition for either_2_dropping_best etc, that's in a later commit - hope the meaning is clear, though.)

@Rintze, I don't think partnering it with swap-names will work, since it should be accessible from the keyboard, but it should be a menu, for sure.

(Edit: But the field can be opened for editing from a partner to swap-names, of course, and that does make it more discoverable. That suggestion is a keeper.)

aurimas · September 10, 2015

So just to recap and make sure I'm getting all of this right, there are three features that are being considered here:

Zotero parsing two-field authors into given name, dropping particle, non-dropping particle, family name.
- The data should be entered in Zotero as "non-dropping particle Family Name", "Given Name dropping particle".
- Given correct data entry, the dropping particle has no significance for CSL or Zotero, so we can ignore parsing that part altogether.
- The non-dropping particle is always lower-cased, can be composed of multiple words, may be joined with family name by punctuation, and always precedes the family name.

Allowing users to "easily" edit incorrectly imported names
- Current suggestion is for some right-click, button, or keyboard shortcut solution that cycles through possible permutations
- User must know what he's looking for anyway, so I don't really see why that is easier than just correcting the case manually. Does this feature even need to exist?
- Edit: I guess the issue is in large part for discoverability purposes, which I agree is lacking. I think that maybe we can just italicize the non-dropping particle.

Correctly parsing/splitting names on import
- We cannot rely on proper capitalization
- Even with lower-cased particles, we may need to know whether the particle is dropping or non-dropping
- With upper-cased particles, we may be able to guess correctly which particles are particles and whether they are dropping or non-dropping
- Use case first. If all parts of name are in title case, apply heuristic parsing based on most likely dropping/non-dropping particle in the name. I think we need to be 99% confident before converting a title-cased particle to lower case
- Need a list of high-confidence dropping and non-dropping particles.

We can implement the first point quite easily. I don't think we need #2 and #3 should go into ZU.cleanAuthor, but we need a very good list of particles.

fbennett · September 10, 2015

David,

Yes, where publisher metadata is awry, entries will need to be fixed manually after download, and the user will need domain-specific knowledge when choosing among alternatives - we all agree that the processor can't automate the choices, and that it shouldn't try to do so.

To support manual editing, the aim will be to present a limited set of alternatives in a menu, with the more-likely choices highlighted or preferenced in some way. That way novice users will have some guidance on the possibilities, and much mousing and typing will be saved in aggregate. It's probably the best we can do.

fbennett · September 10, 2015

aurimas, that all sounds right (apart from my feeling that #2 is useful, but at 3-1, I'll call it quits).

(Edit: by "discoverable," Rintze and I were talking about the discoverability of the particle-editing support menu, not of the particles themselves.)

If anyone has further comments on the particles, do post.

DWL-SDCA · September 10, 2015

@fbennett
...to support manual editing, the aim will be to present a limited set of alternatives...novice users will have some guidance...

Now I understand. This is likely to be a boon to everyone -- novice and experienced alike. Thank you for the explanation.

fbennett · September 10, 2015

David,
I'll carry forward with implementing the UI-support idea, although it may be unlikely to make it into Zotero in the short term.

nickbart · September 11, 2015

@aurimas: Not sure what you meant by “Given correct data entry, the dropping particle has no significance for CSL or Zotero, so we can ignore parsing that part altogether.” – In fact, dropping particles need to be parsed just like non-dropping ones: they are always lower-cased, can be composed of multiple words, may contain and be followed by "’" or "-", and always follow the given name, or rather what remains in the given field after what follows a "," or ",!" has been parsed as suffix.
Zotero’s export seems to work ok for these elements, just as for the non-dropping particles, with the latest list-based parsing – provided that a particle is on the list, of course.
(My worry is that the list will never be complete and will need constant adjustment, that’s why I’m so much in favour of the “simple” rule …)

NMonteix · September 11, 2015

[Sorry, I was off-line the last three days]

@nickbart: yes, a test made this morning makes Matteo Della Corte appear as Corte, Matteo Della in (e.g.) Chicago 16th

Following your quote, "particles in Italian names are most often uppercased and retained when the last name is used alone". Hence, the major use would be of non dropping, at least for 19th-20th century people

I have never faced Lorenzo de' Medici's writings in bibliography, thus I have no idea of the customary use when quoting those (or those from any aristocratic Italian).

@fbennett: [possible stupid question...] how will exactly work "var either_1 = [[null, [0,1]],[[0,1],null]"? what will trigger the shift from non dropping to dropping?

nickbart · September 11, 2015

@NMonteix:
- I trust “Della Corte, Matteo” is what you expected?
- If you get “Corte, Matteo Della” if the family/given fields contain [Della Corte][Matteo], I would guess you are not using the latest processor version, so try https://github.com/Juris-M/propachi-vanilla/releases/latest.
- Since Italian “Della” and friends, and in fact all uppercase elements, as far as I can tell, but correct me if I’m wrong, are never detached from the root family name (“Corte”), uppercase “Della” and friends aren’t non-dropping particles but fixed parts of the family name (unlike genuine so-called “non-dropping” particles, e.g., Dutch ones, which are detached in some circumstances).

NMonteix · September 11, 2015

@nickbart
- yes
- indeed, processor installed, it works
- you're right, they are not proper particles but part of the family name (only exception in my first list being de' Medici which is a lowercase particle)

Thanks!