Parsing problem on Italian names
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
Al-Pitkin, Lemuel (dos)
Al-Pitkin, Lemuel dos
Dos Al-Pitkin, Lemuel
The parsing logic currently runs as follows:
- Tokenize surname-leading and givenname-trailing lowercase elements as a (possibly empty) "core particles" array.
- Tokenize remaining surname-leading and givenname-trailing elements as "extended last" and "extended first" arrays.
- Find the longest and frontmost match to the list that includes the core particles.
No match will be found for "dos al-", but "dos" will match, as a particle known to be either dropping or non-dropping, so both options are presented.For known particles, I think the uppercase variant should be retained, since Americanized names often capitalize the European particles—otherwise users might take the UI to suggest that lower-casing is the only correct form.
For particles not in the list (like the archaic Welsh "ferch"), I agree that the uppercase variants should be omitted from the options, since the only basis for treating them as particles at all is the fact that they have been intentionally lowercased in the data.
On capitalization, with the logic I described above, the menu would be idempotent, which would give the UI support a solid feel. Specified particle sets would always present their upper-case variants, and the upper-case variants would in turn be recognized. Unspecified particle sets would show only lower-case options, so that they would reparse identically after any selection is made from the menu.
IIRC, the "capitalized particles are not particles" logic is for cases where the name has been "Americanized". Would it be uncommon to "Americanize" only a subset of particles (e.g. Lemuel dos Al-Pitkin vs Lemuel Dos Al-Pitkin)? In that case, I agree that we should only offer all-or-none capitalization of the particles.
My initial thought was that we tend to capitalize only the first of multiple particles, but I found a nice list of Dutch Americans (thank you, Wikipedia) and I was shocked (shocked, I tell you) to discover that the only name that conformed to my expectation was that of Robert J. Van de Graff.
So I think you're right: all-or-nothing for capitalization of known particles will cover most cases, and we can reduce clutter by making it the only capitalizing option shown in the list.
(Edit: In the above paragraph, "recognize" -> "offer")
I missed the Van de Kamp's (distracted by my little joke). Americanized Van de Kamp would sort under "V". Our working assumption is that any leading uppercase element is part of the surname proper, and should be included in the sort.
Kamp, John van de
Van de Kamp, John // or maybe Van De Kamp, or both
van de Kamp, John
If user selects the first option, and we forgo capitalizing particles that are in lower-case, then reopening the menu will show these options only:
Kamp, John van de
I think that would be confusing: the user's initial selection should not alter the options presented when the menu is reopened.van de Kamp, John
Kamp, John (van ferch de) // All dropping
Kamp, John (van ferch) de //Shifting into non-dropping
Kamp, John (van) ferch de //...
Kamp, John van ferch de //...
De Kamp, John (van ferch) //Converting to non-particles. Forcing upper case
De Kamp, John (van) ferch //...
De Kamp, John van ferch //...
// Now we skip ferch as possible capitalized form, since we don't know about it
Van ferch de Kamp, John // Only capitalizing first particle and leaving others as they were
I think that list covers all possibilities. Now if the user selects option 4, for instance ("De Kamp, John van ferch"), then the new list would be
Kamp, John (van ferch de) // All dropping
Kamp, John (van ferch) de //Shifting into non-dropping
Kamp, John (van) ferch de //...
Kamp, John van ferch de //...
De Kamp, John (van ferch) //Converting to non-particles
De Kamp, John (van) ferch //...
De Kamp, John van ferch //...
Van ferch De Kamp, John // <-- This one is different
The list is still not idempotent, but the change only affects display and not sorting behavior. Is this what we were after?
van ferch de // then ...
Both match attempts would fail, so you would be left with the unspecified particle set "ferch de" only, and the options would be:ferch de
Kamp, John Van (ferch de)
In this case, the menu would be idempotent. For "Franckenstein, Georg Freiherr Von Und Zu", we would have these:Kamp, John Van (ferch) de
Kamp, John Van ferch de
de Kamp, John Van (ferch)
de Kamp, John Van ferch
ferch de Kamp, John Van
Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr von und zu
von und zu Franckenstein, Georg Freiherr // lowercase-but-quoted
Von Und Zu Franckenstein, Georg Freiherr
Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr (von und) zu
Franckenstein, Georg Freiherr (von) und zu
Franckenstein, Georg Freiherr von und zu
Zu Franckenstein, Georg Freiherr (von und)
Zu Franckenstein, Georg Freiherr (von) und
Zu Franckenstein, Georg Freiherr von und
Von Und Zu Franckenstein, Georg Freiherr // Because "und" is not a "recognized" particle
Edit: actually, if "und" were not recognized as a particle on its own, the "Franckenstein, Georg Freiherr Von Und Zu" starting entry would only recognize "Zu" as a particle. If "von und zu" were a separate entry in the particle list (I haven't checked, maybe it already is), then this would work, and then fewer options would be offered.
Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr von und zu
Von Und Zu Franckenstein, Georg Freiherr
Can we exhaustively list all such joint particles?
When adding a known particle to an unspecific set of core elements (like "Van" in the John Van ferch de Kamp example), I don't see how you can make use of that information. It's such a remote edge case that I think you can stop the parse with the core elements on that one.
On lowercase-but-quoted, that's a real thing. "Charles de Gaulle" sorts under "d". Here's one source from the Web, but IIRC the Chicago Manual says the same thing.
CMoS, 16e, 8.5 and 16.71, lists the following:
- Walter de la Mare; de la Mare
- Paul de Man; de Man
- Daphne du Maurier; du Maurier
- Robert van Gulik; van Gulik
- Wernher von Braun; von Braun
- da Cunha, Euclides
- de Gaulle, Charles
- di Leonardo, Micaela
This proposed tool is groundbreaking. Not only does it assist with proper alphabetization of author names, it helps keep the entered names uniform so that the styles can better implement name disambiguation. Most people don't take the time to think about name particle practices and fewer still worry about the conventions of other languages or places. That Zotero will offer assistance in facilitating the entry of names containing particles is truly a giant step in bibliography management is only part of why this is wonderful. Equally if not more important is the tool's and its documtation's ability enlighten the manuscript author of the complexity of the name particle issue.
I've intentionally used what may seem like hyperbole in this post. Upon implementation of the tool I believe that few will think I've exaggerated the tool's impact. The tool's documentation will be key to success. I see a need for both a straightforward how-to mechanics of the procedure and an optional deeper explanation of the bibliographic conventions of particle usage across the world.
Points to note are:
- The dingbat "komejirushi"-like separator mark, which I hope will render on all systems;
- Idempotence of unspecified particle sets, and near-idempotence of specified sets
- The constraint imposed on "no-particle" capitalized forms
- Disabling of rollover on the headings
(As the documentation draft says and the screencast shows, it's still not extending from the "core particles" unless there is a good match against the list for the entire extended particle set—so the "Van" in "Kamp, John Van ferch de" is not picked up as a particle. I've considered alternatives, but I think this is the best compromise between orderly and complete.)(Edit: in the original post, the idempotence description was backwards.)
Given the number of changes in the offing, we can assume that the team are pretty busy with other issues, but we'll see where we land on the priorities list (and meanwhile, thanks to Aurimas for taking time to review here).
As far as user entry goes, I think we're approaching a point where we need to implement rich data fields, with support for italics, superscript, subscript, etc. that do not require user typing out HTML tags. The rich fields would also handle inserting a non-breaking space (disguised as a "non-dropping particle" button, or something of the sort) and formatting the non-dropping particle in a way that is distinct from the given name (say, italics).
Anyway, haven't explored the thought too much, but it could work out.
The plan is to redo the item pane in HTML anyway, at which point there won't be much of a cost to using HTML in fields.
Within Zotero, these fields will be HTML anyway, so anything sorted or searched upon would need to be converted to (or stored separately as) plain-text. So might as well make it explicit where we can. HTML gives us the ability to mark up text meaningfully.