Parsing problem on Italian names
Hi to all
After the problem in Dutch, it seems that the new parsing module generated problems with Italian names:
« d' » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« da » is always non-dropping.
« de » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« de' » is always dropping.
« degli » is always dropping.
« dei » is always dropping.
« della » is always dropping.
« dello » is always dropping.
In fact all of them are always non-dropping particles in Italian.
[in Gracile's list dell' is lacking; dall' is nicely non dropping]
How could it be fixed?
Thanks in advance
After the problem in Dutch, it seems that the new parsing module generated problems with Italian names:
« d' » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« da » is always non-dropping.
« de » is dropping if found in the first-name field ; non-dropping if found in the last-name field.
« de' » is always dropping.
« degli » is always dropping.
« dei » is always dropping.
« della » is always dropping.
« dello » is always dropping.
In fact all of them are always non-dropping particles in Italian.
[in Gracile's list dell' is lacking; dall' is nicely non dropping]
How could it be fixed?
Thanks in advance
Is it right that dell' and dall' should both be non-dropping also?
As a side-note, the processor patch plugin will need to be signed by Mozilla for every release from September 22nd. I'll try to get that in place, but it will slow down response times on processor issues.
Dell' and Dall' are always non dropping
Thanks
“d’” and “de’” on the other hand are dropping particles. (Note that a space needs to be inserted after “de’” when rendering, Zotero currently does not do this.)
Note that these examples, too, confirm the rule that only lowercase strings are CSL particles.
If not, “D’”, “Da”, “Della” are not particles in the CSL sense, but just fixed parts of family names.
Progress have been made since they were written. To avoid more confusion, the examples chosen should be adjusted to remove the "particles" which are *always* fixed parts of the family name ("la"/"La" in "Jean de la Fontaine") even if they're particles in the common sense. At some point, the (future) documentation on the Zotero side should make that clear.
(Please correct me if I'm wrong)
van der Merwe, Wikus
Van der Merwe, Wikus
Van Der Merwe, Wikus
"van der Merwe", Wikus
der Merwe, Wikus van
Der Merwe, Wikus van
"der Merwe", Wikus van
Merwe, Wikus van der
van der Merwe, Wikus
A tweaked version of the particles parser can be used to identify uppercase particle candidates, and to apply a color-coded highlight to the field to show whether it conforms to common conventions.
Comments welcome.
(Looking at this, it might be better to generate the full range of permutations [including some not listed above] and present them in a [color-coded] menu list called by Alt-p.)
@fbennett: Ok, now I see a little clearer what you meant by “We'll need support for particle adjustments in the Zotero UI before changing the processor.” Looks interesting though I probably would rarely use it myself since it’s nothing that couldn’t be entered/adjusted manually, but it might help other users, and if you see it as a prerequisite for changing Zotero’s and citeproc-js’s parsing rules, then by all means go ahead.
My bigger concern, as you won’t be surprised to hear, is to get the parsing itself right. I take it that there is some kind of consensus to ultimately adopt the simple rule “unless protected, lowercase words at beginning of family are non-dropping, lowercase words at end of given are dropping” – correct me if I’m wrong here.
Now, depending on how soon this is going to happen, the question is, does it still make sense to fix the flaws and omissions in the current parsing list(s) of both Zotero and citeproc-js?
For citeproc-js, https://bitbucket.org/fbennett/citeproc-js/src/553e934fdbf001ed6f70f47699dbf56151459e84/src/util_name_particles.js?at=default still lists “te”, “ten”, ”ter” as dropping though they are clearly Dutch and non-dropping; “mac” quite clearly isn’t a particle at all, and so on and so forth. Also, parsing is still case-insensitive, making, e.g., the “Van/van” distinction impossible. And of course around 150 out of the 166 unique entries from the Dutch “333” list are missing.
As to Zotero, I’m not sure where to look for the source – pointers welcome! Actual behaviour shows that “te” etc. are treated as dropping, and parsing is still case-insensitive, too.
So, again, should we bother with fixing this, or rather adopt the simple rule as soon as possible?
[edit: thanks Frank]
Now, would you accept pull requests for https://bitbucket.org/fbennett/citeproc-js/src/tip/src/util_name_particles.js? If so:
Would you mind if I did the following, apart from adding and fixing stuff?
- sort the list alphabetically by particle again
- add comments like
["van", dropping_alt_non_dropping_1], // Dutch non-dropping, German dropping, also non-particle
- possibly even indicate sources, like["van", dropping_alt_non_dropping_1], // Dutch non-dropping [1], German dropping, also non-particle
- comment out the highly dubious “mac”, “pietro”, “saint”, “sainte”, “sen”, “st.”, “ste.” (which wouldn’t usually appear in lower case anyway)...
// [1]: http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen
Adopting the simple rule also makes a lot of sense to me with regards to discoverability. I think it's much more intuitive to have particle recognition be controlled by case then by the presence or absence of quotes (see e.g. the use case at https://forums.zotero.org/discussion/20926/double-surnames-alphabetical-order-in-bibliography-solved/?Focus=120531#Comment_120531).
I can fix the sorting tomorrow if it's a headache; I have a script for it that just needs a tweak to the keys.
If there are issues about specific changes, we can discuss in the comments.
I agree that mac etc. should go, and that simplifying the parser, shifting the explicit particle identification and classification into the UI, will be an improvement. I talked with European colleagues about particles this afternoon. Some issues came up, but it's late here, so I'll sign off an pick up tomorrow.
The parser in citeproc-js is still using list-based parsing, but we're all agreed concerning capitalization. The adjustments for listed-particle case-sensitivity (to limit recognition to lowercase particles in most cases) case-insensitivity were completed in a commit that went in last week. The change is not yet reflected in Zotero, but it's in the pipeline—and will be joined by the changes flagged above.
We are also agreed that list-based parsing should be moved out of the processor. As nickbart has indicated, the double-quotes hack will still be required in that case, to express names such as "de Gaulle," in which the apparent (lowercase) particle is actually a fixed, sortable element of the surname.
The only point on which there seems to be a difference of opinion is the importance of UI support for manipulating particles. I feel it is important for a couple of reasons. First, it will reduce the risk of RSI among users who curate large volumes of data (and some do). Second, a UI based on our distilled knowledge of how name-particles work can help users unfamiliar with the arcana to repair names that come through translation badly.
The heuristics for UI support will be based on the explicit list, which is why I'm keen to get it into shape, so that the switch to simplified parsing and UI support can be cast in one go.
As I wrote in the other thread on this, it would indeed be possible to jump to simplified parsing directly. That's not how I would do things myself, so I won't, but since Zotero now pre-parses names before sending them to the processor, the change can be introduced by Zotero if it's desired.
(Edit: underlined text added, struck-out text removed.)
"mac"
"pietro"
"saint"
"sainte"
"st."
"ste."
The Dutch particles linked by Rintze have been added, and duplicates resolved.
The list is alphabetized by particle.
- duplicates: "'t", "ben" and "bin"
- "van" should be "either" - since "van" in "Ludwig van Beethoven" is dropping
- "de'" should be "either" - since "de'" in "Lorenzo de' Medici" is dropping ("degli", "dei", "del", "dell'", "della", "delle" [to be added], "dello", "di", too; all for older Italian names)
EDIT:
- "either", too: "da", "de li" (both Italian)
- "in der" should be "either": in "Hanns in der Gand" it is dropping.
See my comments in:
https://forums.zotero.org/discussion/51491/double-surnames-starting-with-te-end-up-quoted-incorrect/#Item_30
and in:
https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/#Item_35
Also, the Dutch particle list seems to have a lot of foreign particles in it (e.g. all the "Auf*" and "Aus*" ones are purely German). I'm not sure if those should all be treated as potentially non-dropping.
@Rintze, I don't think partnering it with swap-names will work, since it should be accessible from the keyboard, but it should be a menu, for sure.
(Edit: But the field can be opened for editing from a partner to swap-names, of course, and that does make it more discoverable. That suggestion is a keeper.)
- Zotero parsing two-field authors into given name, dropping particle, non-dropping particle, family name.
- The data should be entered in Zotero as "non-dropping particle Family Name", "Given Name dropping particle".
- Given correct data entry, the dropping particle has no significance for CSL or Zotero, so we can ignore parsing that part altogether.
- The non-dropping particle is always lower-cased, can be composed of multiple words, may be joined with family name by punctuation, and always precedes the family name.
- Allowing users to "easily" edit incorrectly imported names
- Current suggestion is for some right-click, button, or keyboard shortcut solution that cycles through possible permutations
- User must know what he's looking for anyway, so I don't really see why that is easier than just correcting the case manually. Does this feature even need to exist?
- Edit: I guess the issue is in large part for discoverability purposes, which I agree is lacking. I think that maybe we can just italicize the non-dropping particle.
- Correctly parsing/splitting names on import
- We cannot rely on proper capitalization
- Even with lower-cased particles, we may need to know whether the particle is dropping or non-dropping
- With upper-cased particles, we may be able to guess correctly which particles are particles and whether they are dropping or non-dropping
- Use case first. If all parts of name are in title case, apply heuristic parsing based on most likely dropping/non-dropping particle in the name. I think we need to be 99% confident before converting a title-cased particle to lower case
- Need a list of high-confidence dropping and non-dropping particles.
We can implement the first point quite easily. I don't think we need #2 and #3 should go into ZU.cleanAuthor, but we need a very good list of particles.Yes, where publisher metadata is awry, entries will need to be fixed manually after download, and the user will need domain-specific knowledge when choosing among alternatives - we all agree that the processor can't automate the choices, and that it shouldn't try to do so.
To support manual editing, the aim will be to present a limited set of alternatives in a menu, with the more-likely choices highlighted or preferenced in some way. That way novice users will have some guidance on the possibilities, and much mousing and typing will be saved in aggregate. It's probably the best we can do.
(Edit: by "discoverable," Rintze and I were talking about the discoverability of the particle-editing support menu, not of the particles themselves.)
If anyone has further comments on the particles, do post.
...to support manual editing, the aim will be to present a limited set of alternatives...novice users will have some guidance...
Now I understand. This is likely to be a boon to everyone -- novice and experienced alike. Thank you for the explanation.
I'll carry forward with implementing the UI-support idea, although it may be unlikely to make it into Zotero in the short term.
Zotero’s export seems to work ok for these elements, just as for the non-dropping particles, with the latest list-based parsing – provided that a particle is on the list, of course.
(My worry is that the list will never be complete and will need constant adjustment, that’s why I’m so much in favour of the “simple” rule …)
@nickbart: yes, a test made this morning makes Matteo Della Corte appear as Corte, Matteo Della in (e.g.) Chicago 16th
Following your quote, "particles in Italian names are most often uppercased and retained when the last name is used alone". Hence, the major use would be of non dropping, at least for 19th-20th century people
I have never faced Lorenzo de' Medici's writings in bibliography, thus I have no idea of the customary use when quoting those (or those from any aristocratic Italian).
@fbennett: [possible stupid question...] how will exactly work "var either_1 = [[null, [0,1]],[[0,1],null]"? what will trigger the shift from non dropping to dropping?
- I trust “Della Corte, Matteo” is what you expected?
- If you get “Corte, Matteo Della” if the family/given fields contain [Della Corte][Matteo], I would guess you are not using the latest processor version, so try https://github.com/Juris-M/propachi-vanilla/releases/latest.
- Since Italian “Della” and friends, and in fact all uppercase elements, as far as I can tell, but correct me if I’m wrong, are never detached from the root family name (“Corte”), uppercase “Della” and friends aren’t non-dropping particles but fixed parts of the family name (unlike genuine so-called “non-dropping” particles, e.g., Dutch ones, which are detached in some circumstances).
- yes
- indeed, processor installed, it works
- you're right, they are not proper particles but part of the family name (only exception in my first list being de' Medici which is a lowercase particle)
Thanks!