Parsing problem on Italian names

aurimas · September 17, 2015

The problem here though is that particles are often miscapitalized.

I think the solution above (where all recognizable particles are treated as lower-cased), would solve the issue you're describing. Note that the name is entered with capitalized particles.

fbennett · September 17, 2015

The parse shown in that sample is more aggressive than the current code, which would show only these options for Al-Pitkin, Lemuel Dos:

Al-Pitkin, Lemuel (dos)
Al-Pitkin, Lemuel dos
Dos Al-Pitkin, Lemuel

The parsing logic currently runs as follows:

Tokenize surname-leading and givenname-trailing lowercase elements as a (possibly empty) "core particles" array.
Tokenize remaining surname-leading and givenname-trailing elements as "extended last" and "extended first" arrays.
Find the longest and frontmost match to the list that includes the core particles.

No match will be found for "dos al-", but "dos" will match, as a particle known to be either dropping or non-dropping, so both options are presented.

For known particles, I think the uppercase variant should be retained, since Americanized names often capitalize the European particles—otherwise users might take the UI to suggest that lower-casing is the only correct form.

For particles not in the list (like the archaic Welsh "ferch"), I agree that the uppercase variants should be omitted from the options, since the only basis for treating them as particles at all is the fact that they have been intentionally lowercased in the data.

aurimas · September 17, 2015

The parse shown in that sample is more aggressive than the current code

Is it too aggressive though? I don't think so.

For known particles, I think the uppercase variant should be retained, since Americanized names often capitalize the European particles—otherwise users might take the UI to suggest that lower-casing is the only correct form.

They are in the sample above. "Al-Pitkin, Lemuel dos/(dos)" and "Dos Al-Pitkin, Lemuel" re-capitalize them and treat them as non-particles, which is what this thread concluded some time ago.

For particles not in the list (like the archaic Welsh "ferch"), I agree that the uppercase variants should be omitted from the options, since the only basis for treating them as particles at all is the fact that they have been intentionally lowercased in the data.

I guess we can forgo capitalizing particles that were initially lower-cased.

fbennett · September 17, 2015

Is it too aggressive though? I don't think so.

It depends on what information you want to derive from the match. The current code limits the number of options presented to the known characteristics of the matched particles. For example "van den" is presented only as a non-dropping particle, because that is its only correct form. If multiple matches to the list are treated as particle candidates, making use of those constraints is harder, and if the constraints are ignored, you end up with a larger list of options.

On capitalization, with the logic I described above, the menu would be idempotent, which would give the UI support a solid feel. Specified particle sets would always present their upper-case variants, and the upper-case variants would in turn be recognized. Unspecified particle sets would show only lower-case options, so that they would reparse identically after any selection is made from the menu.

aurimas · September 17, 2015

I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no? (Edit: maybe no. Are you suggesting that we recognize particles as such, but keep them in upper case if that's how they were entered?)

IIRC, the "capitalized particles are not particles" logic is for cases where the name has been "Americanized". Would it be uncommon to "Americanize" only a subset of particles (e.g. Lemuel dos Al-Pitkin vs Lemuel Dos Al-Pitkin)? In that case, I agree that we should only offer all-or-none capitalization of the particles.

fbennett · September 17, 2015

I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no?

Yes, but only for the unrecognized variety. For recognized particles, capitalized options are needed to make the menu idempotent.

My initial thought was that we tend to capitalize only the first of multiple particles, but I found a nice list of Dutch Americans (thank you, Wikipedia) and I was shocked (shocked, I tell you) to discover that the only name that conformed to my expectation was that of Robert J. Van de Graff.

So I think you're right: all-or-nothing for capitalization of known particles will cover most cases, and we can reduce clutter by making it the only capitalizing option shown in the list.

aurimas · September 17, 2015

I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no?

Yes, but only for the unrecognized variety. For recognized particles, capitalized options are needed to make the menu idempotent.

But that doesn't make sense with what we agreed upon earlier. All capitalized particles are treated as being part of the last name, so if we suggest that "Van" may be a non-dropping particle, we should also lower-case it, since the assumption then is that it was improperly capitalized.

My initial thought was that we tend to capitalize only the first of multiple particles, but I found a nice list of Dutch Americans (thank you, Wikipedia) and I was shocked (shocked, I tell you) to discover that the only name that conformed to my expectation was that of Robert J. Van de Graff.

Looking through the list, I see "James Van Der Beek", "John Van de Kamp", "Robert J. Van de Graaff", "Rex Van de Kamp", "William Van Den Broeck", so it seems a bit random. My question, though, is whether "John Van de Kamp" would be sorted under V or K.

fbennett · September 17, 2015

But that doesn't make sense with what we agreed upon earlier.

It doesn't conflict. The point is only that if the user is given the option to lowercase a particle from the menu, they should also be given the option to capitalize it again, since either form might be correct. [Edit: subject to the constraints on which capitalized combinations to offer that we're working on now, of course]

(Edit: In the above paragraph, "recognize" -> "offer")

I missed the Van de Kamp's (distracted by my little joke). Americanized Van de Kamp would sort under "V". Our working assumption is that any leading uppercase element is part of the surname proper, and should be included in the sort.

aurimas · September 17, 2015

I feel like I'm missing what you're trying to suggest. Could you provide an example?

fbennett · September 17, 2015

Sure thing. Suppose we have the input "de Kamp, John Van", and our correct field content is "Van de Kamp, John". The parser finds "van de" as a matching particle that is always non-dropping, so we offer these options (using your display syntax):

Kamp, John van de
Van de Kamp, John // or maybe Van De Kamp, or both
van de Kamp, John

If user selects the first option, and we forgo capitalizing particles that are in lower-case, then reopening the menu will show these options only:

Kamp, John van de
van de Kamp, John

I think that would be confusing: the user's initial selection should not alter the options presented when the menu is reopened.

aurimas · September 17, 2015

OK, so then we treat particles we know about as potentially upper or lower case variants (as you suggested initially, I believe). Starting with "ferch de Kamp, John Van" (for illustration purposes), the list would be

Kamp, John (van ferch de) // All dropping
Kamp, John (van ferch) de //Shifting into non-dropping
Kamp, John (van) ferch de  //...
Kamp, John van ferch de  //...
De Kamp, John (van ferch) //Converting to non-particles. Forcing upper case
De Kamp, John (van) ferch //...
De Kamp, John van ferch //...
// Now we skip ferch as possible capitalized form, since we don't know about it
Van ferch de Kamp, John // Only capitalizing first particle and leaving others as they were

I think that list covers all possibilities. Now if the user selects option 4, for instance ("De Kamp, John van ferch"), then the new list would be

Kamp, John (van ferch de) // All dropping
Kamp, John (van ferch) de //Shifting into non-dropping
Kamp, John (van) ferch de  //...
Kamp, John van ferch de  //...
De Kamp, John (van ferch) //Converting to non-particles
De Kamp, John (van) ferch //...
De Kamp, John van ferch //...
Van ferch De Kamp, John // <-- This one is different

The list is still not idempotent, but the change only affects display and not sorting behavior. Is this what we were after?

fbennett · September 17, 2015

The lowercase elements "ferch de" are mandatory, so the matches attempted with that input would be:

van ferch de // then ...
ferch de

Both match attempts would fail, so you would be left with the unspecified particle set "ferch de" only, and the options would be:

Kamp, John Van (ferch de)
Kamp, John Van (ferch) de
Kamp, John Van ferch de
de Kamp, John Van (ferch)
de Kamp, John Van ferch
ferch de Kamp, John Van

In this case, the menu would be idempotent. For "Franckenstein, Georg Freiherr Von Und Zu", we would have these:

Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr von und zu
von und zu Franckenstein, Georg Freiherr // lowercase-but-quoted
Von Und Zu Franckenstein, Georg Freiherr

aurimas · September 17, 2015

The lowercase elements "ferch de" are mandatory

Well, in the current parser. The discussion above is not really taking into account any current implementations, it's trying to figure out what _should_ be the case.

For "Franckenstein, Georg Freiherr Von Und Zu", we would have these:
Franckenstein, Georg Freiherr (von und zu) Franckenstein, Georg Freiherr von und zu von und zu Franckenstein, Georg Freiherr // lowercase-but-quoted Von Und Zu Franckenstein, Georg Freiherr

"lowercase-but-quoted" option should never be offered, because AFAICT, this is either an edge case or never actually exists. We could implement some sort of special behavior for "joining" words like "und" that would force treatment of adjoining particles as a single particle, but otherwise, the suggested options would be

Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr (von und) zu
Franckenstein, Georg Freiherr (von) und zu
Franckenstein, Georg Freiherr von und zu
Zu Franckenstein, Georg Freiherr (von und)
Zu Franckenstein, Georg Freiherr (von) und
Zu Franckenstein, Georg Freiherr von und
Von Und Zu Franckenstein, Georg Freiherr // Because "und" is not a "recognized" particle

Edit: actually, if "und" were not recognized as a particle on its own, the "Franckenstein, Georg Freiherr Von Und Zu" starting entry would only recognize "Zu" as a particle. If "von und zu" were a separate entry in the particle list (I haven't checked, maybe it already is), then this would work, and then fewer options would be offered.

Franckenstein, Georg Freiherr (von und zu)
Franckenstein, Georg Freiherr von und zu
Von Und Zu Franckenstein, Georg Freiherr

Can we exhaustively list all such joint particles?

fbennett · September 17, 2015

Well, in the current parser. The discussion above is not really taking into account any current implementations, it's trying to figure out what _should_ be the case.

Yes, the implementation is just a draft. But if you want to reduce the number of options presented to the user, the known characteristics of a particle set (grouping, and dropping/non-dropping allocation) are good way of doing that.

When adding a known particle to an unspecific set of core elements (like "Van" in the John Van ferch de Kamp example), I don't see how you can make use of that information. It's such a remote edge case that I think you can stop the parse with the core elements on that one.

On lowercase-but-quoted, that's a real thing. "Charles de Gaulle" sorts under "d". Here's one source from the Web, but IIRC the Chicago Manual says the same thing.

fbennett · September 17, 2015

(Yes, "von und zu" is in the list. We have very good coverage for Dutch, and pretty-good coverage of the other European domains.)

aurimas · September 17, 2015

When adding a known particle to an unspecific set of core elements (like "Van" in the John Van ferch de Kamp example), I don't see how you can make use of that information

I don't see how that changes anything, though I don't really understand what you meant by mandatory in "The lowercase elements 'ferch de' are mandatory..." above.

if you want to reduce the number of options presented to the user, the known characteristics of a particle set (grouping, and dropping/non-dropping allocation) are good way of doing that

Yes, but the parser should work ok with the assumption that we know nothing about these properties for individual particles. We can use the properties to trim down the suggestions.

On lowercase-but-quoted, that's a real thing. "Charles de Gaulle" sorts under "d". Here's one source from the Web, but IIRC the Chicago Manual says the same thing

That's the exception, rather than the rule though (at least from what we concluded before), so it shouldn't be in the suggestions.

fbennett · September 17, 2015

I don't see how that changes anything, though I don't really understand what you meant by mandatory in "The lowercase elements 'ferch de' are mandatory..." above.

By that I meant that those lowercase elements in the example are (in the current draft) always included when attempting matches against the list. Adding further elements that do not combine with the core to form a match will yield more false positives. If that's desired, we can do that; but I don't think it's a good idea, for the reasons you and Rintze have raised (excessive number of options, user confusion).

[lowercase-but-quoted is] the exception, rather than the rule though (at least from what we concluded before), so it shouldn't be in the suggestions

I disagree; I would include it in the list, but place it at the bottom. The quoted syntax covers a well-defined if small category of cases (French particles on names of one syllable), and having it in the list makes it discoverable to users without recourse to documentation or the forums. But if there is a desire to mask it, we can do that.

aurimas · September 17, 2015

If that's desired, we can do that; but I don't think it's a good idea, for the reasons you and Rintze have raised (excessive number of options, user confusion).

Like you say, that's probably an edge case, so I don't think it makes a huge difference, but I think not combining these for a match would be less confusing.

I disagree; I would include it in the list, but place it at the bottom.

Only if this happens with very few particles. "de" only if not followed by other particles, for example. We shouldn't display that option for all particles.

fbennett · September 18, 2015

Like you say, that's probably an edge case, so I don't think it makes a huge difference, but I think not combining these for a match would be less confusing.

While writing out a set of questions, I realized that I don't understand how you want the parsing and list creation to work. Maybe we're just talking past each other.

Only if this happens with very few particles. "de" only if not followed by other particles, for example. We shouldn't display that option for all particles.

Yeah, if the impact can be minimized, that would be good. If it is only ever an issue with "de", limiting it to that lone particle would be a good move.

Gracile · September 18, 2015

Please add a detailed explanation when you adopt a new display syntax here, these threads are already hard to follow :-)

Based on the previous discussions, capitalized particles are treated as non-particles and lower-case as non-dropping

Am I missing something? The italicized part is not true for me.

"lowercase-but-quoted" option should never be offered, because AFAICT, this is either an edge case or never actually exists.

I disagree for the reasons Frank gave. But the option could be limited to "de" as you suggest it.

nickbart · September 18, 2015

Lowercase non-particles do not seem to be that rare.
CMoS, 16e, 8.5 and 16.71, lists the following:

- Walter de la Mare; de la Mare
- Paul de Man; de Man
- Daphne du Maurier; du Maurier
- Robert van Gulik; van Gulik
- Wernher von Braun; von Braun

- da Cunha, Euclides
- de Gaulle, Charles
- di Leonardo, Micaela

DWL-SDCA · September 18, 2015

As this potentially extremely useful particle tool becomes closer to implementation, I echo the concern raised by Gracile and recommend that development of a draft of the documentation start now. The usefulness of the tool will be directly related to how well its purpose and operation can be explained. In my personal experience, I sometimes have found that when I try to provide written instructions on a procedure, I find that the procedure itself needs to be adjusted a bit.

This proposed tool is groundbreaking. Not only does it assist with proper alphabetization of author names, it helps keep the entered names uniform so that the styles can better implement name disambiguation. Most people don't take the time to think about name particle practices and fewer still worry about the conventions of other languages or places. That Zotero will offer assistance in facilitating the entry of names containing particles is truly a giant step in bibliography management is only part of why this is wonderful. Equally if not more important is the tool's and its documtation's ability enlighten the manuscript author of the complexity of the name particle issue.

I've intentionally used what may seem like hyperbole in this post. Upon implementation of the tool I believe that few will think I've exaggerated the tool's impact. The tool's documentation will be key to success. I see a need for both a straightforward how-to mechanics of the procedure and an optional deeper explanation of the bibliographic conventions of particle usage across the world.

fbennett · September 19, 2015

Thanks to everyone for the latest round of comments. Running the use cases posted by aurimas revealed bugs in the code, so double-thanks there. The screencast has been refreshed based on the latest iteration, and I've added a draft of documentation explaining what it's for and how it works to the Juris-M wiki.

Points to note are:

The dingbat "komejirushi"-like separator mark, which I hope will render on all systems;
Idempotence of unspecified particle sets, and near-idempotence of specified sets
The constraint imposed on "no-particle" capitalized forms
Disabling of rollover on the headings

(As the documentation draft says and the screencast shows, it's still not extending from the "core particles" unless there is a good match against the list for the entire extended particle set—so the "Van" in "Kamp, John Van ferch de" is not picked up as a particle. I've considered alternatives, but I think this is the best compromise between orderly and complete.)

(Edit: in the original post, the idempotence description was backwards.)

fbennett · September 21, 2015

I've filed pull requests for simplified parsing, and for UI support.

Given the number of changes in the offing, we can assume that the team are pretty busy with other issues, but we'll see where we land on the priorities list (and meanwhile, thanks to Aurimas for taking time to review here).

dstillman · September 21, 2015

Thanks to everyone who's been working on this. Unfortunately for Zotero itself we'll need a solution that doesn't corrupt the fields, as the double-quotes do. If that means adding additional creator fields, we can plan to do that along with the other schema/API changes, once API syncing is in place. But I don't want a proliferation of hacks on top of the plain-text data.

aurimas · September 21, 2015

we'll need a solution that doesn't corrupt the fields, as the double-quotes do. If that means adding additional creator fields, we can plan to do that

I'm not a fan of double quotes either, TBH, but I don't think we need separate DB fields for this. What if we use a non-breaking space to separate lower-case particles that are supposed to be part of the given name? This would be a solution for the data layer, not user entry.

As far as user entry goes, I think we're approaching a point where we need to implement rich data fields, with support for italics, superscript, subscript, etc. that do not require user typing out HTML tags. The rich fields would also handle inserting a non-breaking space (disguised as a "non-dropping particle" button, or something of the sort) and formatting the non-dropping particle in a way that is distinct from the given name (say, italics).

Anyway, haven't explored the thought too much, but it could work out.

dstillman · September 21, 2015

Well, if we think we're going to need HTML anyway, we could just use spans and classes, which would be much clearer. The non-breaking space still feels like a hack if we're really just using it as a signifier that's parsed out.

The plan is to redo the item pane in HTML anyway, at which point there won't be much of a cost to using HTML in fields.

aurimas · September 21, 2015

The non-breaking space still feels like a hack if we're really just using it as a signifier that's parsed out

I'm also thinking that it could work nicely when exported in various formats (RIS, maybe BibTeX, etc.). I'm not sure how SQLite collation algorithms treat NBSP, but it might work nicely in those cases as well (I'm guessing we're not that lucky).

fbennett · September 21, 2015

You might also consider how you want the sorting of names containing particles to work in the UI.

dstillman · September 21, 2015

I think RIS and BibTeX are on their own — if they don't have a mechanism for dealing with this, there's not much point in worrying about it. We can export in richer formats as HTML with semantic classes. (Actually, we might even export to RIS as HTML now, for better or worse.)

Within Zotero, these fields will be HTML anyway, so anything sorted or searched upon would need to be converted to (or stored separately as) plain-text. So might as well make it explicit where we can. HTML gives us the ability to mark up text meaningfully.