Parsing problem on Italian names

Rintze · September 11, 2015

@nickbart, a second (and maybe better) criterion to distinguish non-dropping particles from non-particles is probably whether or not family names always include these name elements in alphabetical sorting. If they are sometimes ignored (e.g. with "de Koning" needing to be sorted under "K"), they're particles.

Rintze · September 11, 2015

Coming back to @aurimas's summary:

Like @nickbart, I disagree with "Given correct data entry, the dropping particle has no significance for CSL or Zotero, so we can ignore parsing that part altogether.". Dropping particles are treated differently from initials or non-dropping particles, so you need to treat them as their own class of name element.

As for "The non-dropping particle ... may be joined with family name by punctuation", I would just like to add that the punctuation options are probably rather limited in (Western) names. I think spaces, apostrophes/quote marks, and hyphens cover most cases.

Then with regard to "Current suggestion is for some ... solution that cycles through possible permutations" and "User must know what he's looking for anyway, so I don't really see why that is easier than just correcting the case manually.". The main issue I have with the current setup is that users need to be explicitly aware of the existence of dropping and non-dropping particles, as well as the precise formatting requirements in Zotero to get correct output. This all without the Zotero UI giving any guidance or feedback when editing the name field.

Instead of a UI option that cycles through the different two-field name element storage options, like Frank's example:


van der Merwe, Wikus
Van der Merwe, Wikus
Van Der Merwe, Wikus
"van der Merwe", Wikus
der Merwe, Wikus van
Der Merwe, Wikus van
"der Merwe", Wikus van
Merwe, Wikus van der
van der Merwe, Wikus

I would like to propose two possible changes. First, I think the user should be confronted with all options simultaneously. I think cycling through this many options is just confusing. Even looking at the entire list right now it takes me a long time to figure out what's what. Second, I think the UI should focus on the desired output instead of the way the name needs to be stored in Zotero. E.g. the list could also be presented as a menu with the following options:


"(van der Merwe)" - "Merwe, W. van der"
"(Van der Merwe)" - "Van der Merwe, W."
"(Van Der Merwe)" - "Van Der Merwe, W."
"(van der Merwe)" - "van der Merwe, W."
"(der Merwe)" - "Merwe, W. van der"
"(Der Merwe)" - "Der Merwe, W. van"
"(der Merwe)" - "der Merwe, W. van"
"(Merwe)" - "Merwe, W. van der"

That seems much more intuitive to me. Zotero can then easily rearrange the name elements as required.

Finally, regarding "Correctly parsing/splitting names on import", I generally agree. In my own experience there are two annoying things. First, capitalization on import is often wrong. I deal with a fair number of Dutch names, and often particles come into Zotero uppercased. Second, sometimes non-dropping particles come in as dropping particles (that is, they reside in the given name field). It would be nice if Zotero had an easier way to move particles between the two name fields. Currently it's quite an ordeal: activate name field A, select particle, cut particle, activate name field B, select insertion point, paste particle.

Gracile · September 11, 2015

Like nickbart and Rintze, I disagree with "Given correct data entry, the dropping particle has no significance for CSL or Zotero, so we can ignore parsing that part altogether."
To their arguments, I'd add that it would be nice if CSL could, *in the future*, format the dropping particle, at least to add parenthesis. To take a (now) well-known example: "La Fontaine, Jean (de)" and even "La Fontaine (de), Jean" is sometimes the desired output.

I didn't understand the purpose of the keyboard permutations of a name with particles at first, but I'm now convinced: it will make the mechanism of the particles parsing discoverable and clear to the user, especially with Rintze's proposals above which I second!

nickbart · September 12, 2015

Two Four apparent processor bugs (observed with 1.1.19):

When there’s a suffix, a preceding (dropping) particle is not parsed at all:
[Author] [Ann de, III] is parsed as"family": "Author", "given": "Ann de", "suffix": "III"
Parsing is not exhaustive:[van Author] [Ann von] is parsed as"family": "Author", "given": "Ann von", "non-dropping-particle": "van"
EDIT: And[de l’Author] [Ann] is parsed as"family": "l’Author", "given": "Ann", "non-dropping-particle": "de"(though de l’ is on the list),
EDIT 2: And[vom und zum Author] [Ann] is parsed as"family": "und zum Author", "given": "Ann", "non-dropping-particle": "vom"(though vom und zum is on the list).

fbennett · September 12, 2015

Thanks.

(1) is a bug.
(2) "von van" is not a listed particle, so "van" as NDP is expected.
(3) is a bug.
(4) is a bug.

I'll take a look, although the current parser will have a limited life expectancy. In the parser for the UI code I'm building with the same particles data, (1) fails, and (2)-(3) pass.

(Edit: Actually, I think I can just adapt the UI parsing code to do the classification. The new code is much more transparent, and since there may be a role for the classifier in translators, refactoring it will be a good use of time.)

fbennett · September 14, 2015

I have proof-of-concept code for name particle UI support running in a trial build of Zotero. To save the trouble of installing the client build for testing, I've made a short screencast that takes it through its paces.

Please click and view. I'm curious to know how it will be received.

Rintze · September 14, 2015

Great work! But wouldn't it make more sense to show formatted names with "name-as-sort-order" active? That would show how particles are demoted.

fbennett · September 15, 2015

Sort-order for the full name would be very clear with aligned columns, but with jagged layout it seems a little cluttered, at least for someone encountering the issue for the first time.

Would a decoration hint with normal ordering work to distinguish the parts? Something like highlights, or mild boldface, or italics?

Rintze · September 15, 2015

@fbennett, in my example above I chose to not show the actual two-field content in the menu. I think users can quite easily observe how the name formatting has changed after selecting the desired display format, and adding the information to the menu is IMHO just confusing, since the user now has to figure out what represents the data entry options and what represents the corresponding rendering options.

Instead, I would show just the family name by itself (`form="short"`), and separately, the full name (`form="long"` and `demote-non-dropping-particle="never"`) with `name-as-sort-order` active. Those two examples would very clearly show which name elements are recognized as particles, which particles are recognized as non-dropping and dropping, and how particles are capitalized.

Gracile · September 15, 2015

(I think the Particler option should not appear when there's no particle at all in the fields.)

fbennett · September 15, 2015

@Rintze: Removing the quotes, in other words. Got it, and that helps the clutter. Formatting apparently isn't possible in a simple XUL menu, so highlighting and whatnot (which would probably have made things worse anyway) is out. I'll take another shot and refresh the screencast.

@Gracile: I think we'll show the option disabled, but you're right that it shouldn't do anything if there is nothing to do.

Maybe "Particler" is a little too casual?

fbennett · September 15, 2015

@Rintze: Coming soon. (I think you meant `demote-non-dropping-particle="display-and-sort"` above.)

fbennett · September 15, 2015

I've revised, adopting the comments from Rintze and Gracile, and refreshed the screencast (same URL as above).

adamsmith · September 15, 2015

really liking this new version (and sorry for not responding earlier to e-mail & proof of concept. Busy time).

Would love to have this in Zotero.

Rintze · September 15, 2015

I like the new version, and would love to see something like it in Zotero. The number of options increases rather quickly with two or more particles (made worse because of all the capitalization variants), but I guess we'd need them all. I noticed that "uit den" cannot be changed into dropping particles. I take it that this is because these are always non-dropping in your list (https://bitbucket.org/fbennett/citeproc-js/src/bb1ae92730be079210adc0e6b47b0bc50a06d7db/src/util_name_particles.js?at=default&fileviewer=file-view-default#util_name_particles.js-221)?

Do you think the menu would be easier to read if the options are alphabetically sorted?

(I think you meant `demote-non-dropping-particle="display-and-sort"` above.)

I was on the fence (not demoting looks more natural to my Dutch eyes), but yeah, demoting the non-dropping particles provides more information since it shows the distinction between non-particles and non-dropping particles.

Maybe "Particler" is a little too casual?

Similar to the "Transform Text" option in the title menu, I would go for an action description, e.g. "Adjust Particles".

fbennett · September 16, 2015

I noticed that "uit den" cannot be changed into dropping particles. I take it that this is because these are always non-dropping in your list.

Yes, that's right; the idea is to restrict the options to those most likely to be meaningful. Is that a correct spec for "uit den" and the other pure Dutch particles?

We may be able to slim down the number of options presented with unspec'd particles - we should only present things likely to be possible, the user can edit manually for rare combinations. We should also put a ceiling on the number of options or the number of particles somehow, to prevent mischievous people from DoS'ing the UI.

fbennett · September 16, 2015

Alphabetizing the list could have unexpected effects, but we can classify it, with headings. I'll try that this evening and refresh the screencast.

Rintze · September 16, 2015

Is that a correct spec for "uit den" and the other pure Dutch particles?

Yes, although that means that for ambiguous particles, the menu would have even more options, right?

fbennett · September 16, 2015

Ambiguous particles have more options, yes. I've refreshed the screencast with a view that divides the particles by type, with headings.

Rintze · September 16, 2015

I don't really understand those headers. E.g. "(Al-Pitkin) <> Al-Pitkin, Lemuel dos" is listed under "Fixed surname", but "dos" is a dropping particle here, right?

fbennett · September 16, 2015

If the option is selected, the field content becomes:Al-Pitkin, Lemuel dosSo the headings describe what is happening with the surname—not sure if that's best, but that's what it's doing. It's actually reporting processor semantics, isn't it. I suppose this one should be dropping-particle.

Headings (maybe with some further refinement) seem like progress, but what do you think?

Rintze · September 16, 2015

Ah, okay. I get it now. Not sure it's clear enough, though.

How about forgoing headers, and only alphabetizing the list (is that really problematic?). That would change

(dos al-Pitkin) <> Pitkin, Lemuel dos al-
(dos Al-Pitkin) <> Al-Pitkin, Lemuel dos
(al-Pitkin) <> Pitkin, Lemuel dos al-
(Pitkin) <> Pitkin, Lemuel dos al-
(Dos al-Pitkin) <> Dos al-Pitkin, Lemuel
(Dos Al-Pitkin) <> Dos Al-Pitkin, Lemuel
(Al-Pitkin) <> Al-Pitkin, Lemuel dos
(dos al-Pitkin) <> dos al-Pitkin, Lemuel
(dos Al-Pitkin) <> dos Al-Pitkin, Lemuel
(al-Pitkin) <> al-Pitkin, Lemuel dos

to

(Al-Pitkin) <> Al-Pitkin, Lemuel dos
(al-Pitkin) <> al-Pitkin, Lemuel dos
(al-Pitkin) <> Pitkin, Lemuel dos al-
(dos Al-Pitkin) <> Al-Pitkin, Lemuel dos
(Dos al-Pitkin) <> Dos al-Pitkin, Lemuel
(Dos Al-Pitkin) <> Dos Al-Pitkin, Lemuel
(dos al-Pitkin) <> dos al-Pitkin, Lemuel
(dos Al-Pitkin) <> dos Al-Pitkin, Lemuel
(dos al-Pitkin) <> Pitkin, Lemuel dos al-
(Pitkin) <> Pitkin, Lemuel dos al-

The latter seems much more readable to me.

Alternatively, maybe you could offer dedicated particle capitalizing and particle type menus? That would reduce the options of the particle type menu to:

(al-Pitkin) <> al-Pitkin, Lemuel dos
(al-Pitkin) <> Pitkin, Lemuel dos al-
(dos al-Pitkin) <> dos al-Pitkin, Lemuel
(dos al-Pitkin) <> Pitkin, Lemuel dos al-
(Pitkin) <> Pitkin, Lemuel dos al-

(I see that the menu currently excludes the option that is identical to how the name is already formatted?)

The separate particle capitalizing menu would then also have very few options.

fbennett · September 16, 2015

I've fixed up the header logic, here's a refreshed screencast.

An undecorated list would be more readable in alphabetical order, but I think there may be value in the headers. They ease the user into the terminology that we use for the different forms, and help to tie the CSL documentation to what the user sees in the UI. When they are removed, the user is on their own to figure out what all those options mean.

(Editing for completeness)

The current form is not excluded from the list - it's just shown with the non-dropping particle demoted.

The list only gets big when the name contains unspecified particles (multiple terms in lower-case in a particle position). It seems like that would be uncommon (apart from typos). I'm not sure the added complexity in the UI would be worth it.

Gracile · September 16, 2015

Headers are useful in my opinion.

I was on the fence (not demoting looks more natural to my Dutch eyes), but yeah, demoting the non-dropping particles provides more information since it shows the distinction between non-particles and non-dropping particles.

Same here.

Edit: Just to clarify the "columns" represent: in-text citation ⬄ bibliography , right?

fbennett · September 16, 2015

Yes, it's as Rintze described above—set as if in a citation with form="short" on the left, and as if in a bibliography with form="long", demote-non-dropping-particle="display-and-sort", and name-as-sort-order="true". on the right.

Gracile · September 17, 2015

Ok. The double arrow (⬄) is a little bit confusing, no?

fbennett · September 17, 2015

Could use a single right-arrow, or something else? Open to suggestions!

Rintze · September 17, 2015

Semicolon?

aurimas · September 17, 2015

I haven't seen Dan's take on this, but in my opinion, the list is too complicated to figure out and the discoverability of the list is too difficult as well. IMO the latter part should be addressed by displaying dropping and non-dropping particles in the Zotero pane with some special decoration and we can address that later (baby steps).

As for the list, we need to simplify it more. I think it's ok to be missing some obscure cases and have the users ask us how to enter those in, rather than make everyone confused with all the options. Since we're also concerned about improper capitalization, we can assume that any words in the name that match known particles should be lower-cased and proceed from there. Going with parsing a name entered as "Al-Pitkin, Lemuel Dos", I think we should offer the following transform options


Pitkin, Lemuel (dos al-) // Treat all particles as dropping
Pitkin, Lemuel (dos) al- // Shift each of the particles (left-to-right) into non-dropping mode
Pitkin, Lemuel dos al-  //...
Al-Pitkin, Lemuel (dos) // transform each particle (left-to-right) into non-particle and repeat above
Al-Pitkin, Lemuel dos  //...
Dos Al-Pitkin, Lemuel  //...

Now, the obscure thing that remains is that (...) is a dropping particle. Maybe we can figure out a better way to display this.

Based on the previous discussions, capitalized particles are treated as non-particles and lower-case as non-dropping, so I don't think there's much sense in displaying the transforms that break this rule.

Thoughts?

Edit: maybe add an additional comma between given name/dropping particle and the non-dropping particle (and underline/bold/something it?). We can then drop the parentheses.

Pitkin, Lemuel dos al- // Treat all particles as dropping
Pitkin, Lemuel dos, _al-_ // Shift each of the particles (left-to-right) into non-dropping mode
Pitkin, Lemuel, _dos al-_  //...
Al-Pitkin, Lemuel dos // transform each particle (left-to-right) into non-particle and repeat above
Al-Pitkin, Lemuel, _dos_  //...
Dos Al-Pitkin, Lemuel  //...

Rintze · September 17, 2015

Yeah, I agree that there are too many options currently, and that the options are difficult to understand.

Based on the previous discussions, capitalized particles are treated as non-particles and lower-case as non-dropping, so I don't think there's much sense in displaying the transforms that break this rule.

The problem here though is that particles are often miscapitalized. The two most common changes I have to make (since I'm mostly dealing with Dutch non-dropping particles) are lowercasing particles (e.g. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1144764/ and https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2013081456 incorrectly import with uppercased particles "Van" and "De"), and changing dropping particles into non-dropping ones (e.g. http://femsre.oxfordjournals.org/content/29/3/477 and http://www.jbc.org/content/271/46/28953 incorrectly save with "van den", "de", and "van" as dropping particles).