Parsing problem on Italian names

13
  • The problem here though is that particles are often miscapitalized.
    I think the solution above (where all recognizable particles are treated as lower-cased), would solve the issue you're describing. Note that the name is entered with capitalized particles.
  • The parse shown in that sample is more aggressive than the current code, which would show only these options for Al-Pitkin, Lemuel Dos:Al-Pitkin, Lemuel (dos)
    Al-Pitkin, Lemuel dos
    Dos Al-Pitkin, Lemuel

    The parsing logic currently runs as follows:
    1. Tokenize surname-leading and givenname-trailing lowercase elements as a (possibly empty) "core particles" array.
    2. Tokenize remaining surname-leading and givenname-trailing elements as "extended last" and "extended first" arrays.
    3. Find the longest and frontmost match to the list that includes the core particles.
    No match will be found for "dos al-", but "dos" will match, as a particle known to be either dropping or non-dropping, so both options are presented.

    For known particles, I think the uppercase variant should be retained, since Americanized names often capitalize the European particles—otherwise users might take the UI to suggest that lower-casing is the only correct form.

    For particles not in the list (like the archaic Welsh "ferch"), I agree that the uppercase variants should be omitted from the options, since the only basis for treating them as particles at all is the fact that they have been intentionally lowercased in the data.
  • edited September 17, 2015
    The parse shown in that sample is more aggressive than the current code
    Is it too aggressive though? I don't think so.
    For known particles, I think the uppercase variant should be retained, since Americanized names often capitalize the European particles—otherwise users might take the UI to suggest that lower-casing is the only correct form.
    They are in the sample above. "Al-Pitkin, Lemuel dos/(dos)" and "Dos Al-Pitkin, Lemuel" re-capitalize them and treat them as non-particles, which is what this thread concluded some time ago.
    For particles not in the list (like the archaic Welsh "ferch"), I agree that the uppercase variants should be omitted from the options, since the only basis for treating them as particles at all is the fact that they have been intentionally lowercased in the data.
    I guess we can forgo capitalizing particles that were initially lower-cased.
  • edited September 17, 2015
    Is it too aggressive though? I don't think so.
    It depends on what information you want to derive from the match. The current code limits the number of options presented to the known characteristics of the matched particles. For example "van den" is presented only as a non-dropping particle, because that is its only correct form. If multiple matches to the list are treated as particle candidates, making use of those constraints is harder, and if the constraints are ignored, you end up with a larger list of options.

    On capitalization, with the logic I described above, the menu would be idempotent, which would give the UI support a solid feel. Specified particle sets would always present their upper-case variants, and the upper-case variants would in turn be recognized. Unspecified particle sets would show only lower-case options, so that they would reparse identically after any selection is made from the menu.
  • edited September 17, 2015
    I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no? (Edit: maybe no. Are you suggesting that we recognize particles as such, but keep them in upper case if that's how they were entered?)

    IIRC, the "capitalized particles are not particles" logic is for cases where the name has been "Americanized". Would it be uncommon to "Americanize" only a subset of particles (e.g. Lemuel dos Al-Pitkin vs Lemuel Dos Al-Pitkin)? In that case, I agree that we should only offer all-or-none capitalization of the particles.
  • I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no?
    Yes, but only for the unrecognized variety. For recognized particles, capitalized options are needed to make the menu idempotent.

    My initial thought was that we tend to capitalize only the first of multiple particles, but I found a nice list of Dutch Americans (thank you, Wikipedia) and I was shocked (shocked, I tell you) to discover that the only name that conformed to my expectation was that of Robert J. Van de Graff.

    So I think you're right: all-or-nothing for capitalization of known particles will cover most cases, and we can reduce clutter by making it the only capitalizing option shown in the list.
  • I think we agree on capitalization then, given my last remark "we can forgo capitalizing particles that were initially lower-cased", no?

    Yes, but only for the unrecognized variety. For recognized particles, capitalized options are needed to make the menu idempotent.
    But that doesn't make sense with what we agreed upon earlier. All capitalized particles are treated as being part of the last name, so if we suggest that "Van" may be a non-dropping particle, we should also lower-case it, since the assumption then is that it was improperly capitalized.
    My initial thought was that we tend to capitalize only the first of multiple particles, but I found a nice list of Dutch Americans (thank you, Wikipedia) and I was shocked (shocked, I tell you) to discover that the only name that conformed to my expectation was that of Robert J. Van de Graff.
    Looking through the list, I see "James Van Der Beek", "John Van de Kamp", "Robert J. Van de Graaff", "Rex Van de Kamp", "William Van Den Broeck", so it seems a bit random. My question, though, is whether "John Van de Kamp" would be sorted under V or K.
  • edited September 18, 2015
    But that doesn't make sense with what we agreed upon earlier.
    It doesn't conflict. The point is only that if the user is given the option to lowercase a particle from the menu, they should also be given the option to capitalize it again, since either form might be correct. [Edit: subject to the constraints on which capitalized combinations to offer that we're working on now, of course]

    (Edit: In the above paragraph, "recognize" -> "offer")

    I missed the Van de Kamp's (distracted by my little joke). Americanized Van de Kamp would sort under "V". Our working assumption is that any leading uppercase element is part of the surname proper, and should be included in the sort.
  • I feel like I'm missing what you're trying to suggest. Could you provide an example?
  • Sure thing. Suppose we have the input "de Kamp, John Van", and our correct field content is "Van de Kamp, John". The parser finds "van de" as a matching particle that is always non-dropping, so we offer these options (using your display syntax):Kamp, John van de
    Van de Kamp, John // or maybe Van De Kamp, or both
    van de Kamp, John

    If user selects the first option, and we forgo capitalizing particles that are in lower-case, then reopening the menu will show these options only:Kamp, John van de
    van de Kamp, John
    I think that would be confusing: the user's initial selection should not alter the options presented when the menu is reopened.
  • OK, so then we treat particles we know about as potentially upper or lower case variants (as you suggested initially, I believe). Starting with "ferch de Kamp, John Van" (for illustration purposes), the list would be
    Kamp, John (van ferch de) // All dropping
    Kamp, John (van ferch) de //Shifting into non-dropping
    Kamp, John (van) ferch de //...
    Kamp, John van ferch de //...
    De Kamp, John (van ferch) //Converting to non-particles. Forcing upper case
    De Kamp, John (van) ferch //...
    De Kamp, John van ferch //...
    // Now we skip ferch as possible capitalized form, since we don't know about it
    Van ferch de Kamp, John // Only capitalizing first particle and leaving others as they were

    I think that list covers all possibilities. Now if the user selects option 4, for instance ("De Kamp, John van ferch"), then the new list would be
    Kamp, John (van ferch de) // All dropping
    Kamp, John (van ferch) de //Shifting into non-dropping
    Kamp, John (van) ferch de //...
    Kamp, John van ferch de //...
    De Kamp, John (van ferch) //Converting to non-particles
    De Kamp, John (van) ferch //...
    De Kamp, John van ferch //...
    Van ferch De Kamp, John // <-- This one is different

    The list is still not idempotent, but the change only affects display and not sorting behavior. Is this what we were after?
  • The lowercase elements "ferch de" are mandatory, so the matches attempted with that input would be:van ferch de // then ...
    ferch de
    Both match attempts would fail, so you would be left with the unspecified particle set "ferch de" only, and the options would be:Kamp, John Van (ferch de)
    Kamp, John Van (ferch) de
    Kamp, John Van ferch de
    de Kamp, John Van (ferch)
    de Kamp, John Van ferch
    ferch de Kamp, John Van
    In this case, the menu would be idempotent. For "Franckenstein, Georg Freiherr Von Und Zu", we would have these:Franckenstein, Georg Freiherr (von und zu)
    Franckenstein, Georg Freiherr von und zu
    von und zu Franckenstein, Georg Freiherr // lowercase-but-quoted
    Von Und Zu Franckenstein, Georg Freiherr
  • edited September 17, 2015
    The lowercase elements "ferch de" are mandatory
    Well, in the current parser. The discussion above is not really taking into account any current implementations, it's trying to figure out what _should_ be the case.
    For "Franckenstein, Georg Freiherr Von Und Zu", we would have these:
    Franckenstein, Georg Freiherr (von und zu)
    Franckenstein, Georg Freiherr von und zu
    von und zu Franckenstein, Georg Freiherr // lowercase-but-quoted
    Von Und Zu Franckenstein, Georg Freiherr
    "lowercase-but-quoted" option should never be offered, because AFAICT, this is either an edge case or never actually exists. We could implement some sort of special behavior for "joining" words like "und" that would force treatment of adjoining particles as a single particle, but otherwise, the suggested options would be
    Franckenstein, Georg Freiherr (von und zu)
    Franckenstein, Georg Freiherr (von und) zu
    Franckenstein, Georg Freiherr (von) und zu
    Franckenstein, Georg Freiherr von und zu
    Zu Franckenstein, Georg Freiherr (von und)
    Zu Franckenstein, Georg Freiherr (von) und
    Zu Franckenstein, Georg Freiherr von und
    Von Und Zu Franckenstein, Georg Freiherr // Because "und" is not a "recognized" particle


    Edit: actually, if "und" were not recognized as a particle on its own, the "Franckenstein, Georg Freiherr Von Und Zu" starting entry would only recognize "Zu" as a particle. If "von und zu" were a separate entry in the particle list (I haven't checked, maybe it already is), then this would work, and then fewer options would be offered.
    Franckenstein, Georg Freiherr (von und zu)
    Franckenstein, Georg Freiherr von und zu
    Von Und Zu Franckenstein, Georg Freiherr

    Can we exhaustively list all such joint particles?
  • Well, in the current parser. The discussion above is not really taking into account any current implementations, it's trying to figure out what _should_ be the case.
    Yes, the implementation is just a draft. But if you want to reduce the number of options presented to the user, the known characteristics of a particle set (grouping, and dropping/non-dropping allocation) are good way of doing that.

    When adding a known particle to an unspecific set of core elements (like "Van" in the John Van ferch de Kamp example), I don't see how you can make use of that information. It's such a remote edge case that I think you can stop the parse with the core elements on that one.

    On lowercase-but-quoted, that's a real thing. "Charles de Gaulle" sorts under "d". Here's one source from the Web, but IIRC the Chicago Manual says the same thing.
  • (Yes, "von und zu" is in the list. We have very good coverage for Dutch, and pretty-good coverage of the other European domains.)
  • When adding a known particle to an unspecific set of core elements (like "Van" in the John Van ferch de Kamp example), I don't see how you can make use of that information
    I don't see how that changes anything, though I don't really understand what you meant by mandatory in "The lowercase elements 'ferch de' are mandatory..." above.
    if you want to reduce the number of options presented to the user, the known characteristics of a particle set (grouping, and dropping/non-dropping allocation) are good way of doing that
    Yes, but the parser should work ok with the assumption that we know nothing about these properties for individual particles. We can use the properties to trim down the suggestions.
    On lowercase-but-quoted, that's a real thing. "Charles de Gaulle" sorts under "d". Here's one source from the Web, but IIRC the Chicago Manual says the same thing
    That's the exception, rather than the rule though (at least from what we concluded before), so it shouldn't be in the suggestions.
  • I don't see how that changes anything, though I don't really understand what you meant by mandatory in "The lowercase elements 'ferch de' are mandatory..." above.
    By that I meant that those lowercase elements in the example are (in the current draft) always included when attempting matches against the list. Adding further elements that do not combine with the core to form a match will yield more false positives. If that's desired, we can do that; but I don't think it's a good idea, for the reasons you and Rintze have raised (excessive number of options, user confusion).
    [lowercase-but-quoted is] the exception, rather than the rule though (at least from what we concluded before), so it shouldn't be in the suggestions
    I disagree; I would include it in the list, but place it at the bottom. The quoted syntax covers a well-defined if small category of cases (French particles on names of one syllable), and having it in the list makes it discoverable to users without recourse to documentation or the forums. But if there is a desire to mask it, we can do that.
  • If that's desired, we can do that; but I don't think it's a good idea, for the reasons you and Rintze have raised (excessive number of options, user confusion).
    Like you say, that's probably an edge case, so I don't think it makes a huge difference, but I think not combining these for a match would be less confusing.
    I disagree; I would include it in the list, but place it at the bottom.
    Only if this happens with very few particles. "de" only if not followed by other particles, for example. We shouldn't display that option for all particles.
  • edited September 18, 2015
    Like you say, that's probably an edge case, so I don't think it makes a huge difference, but I think not combining these for a match would be less confusing.
    While writing out a set of questions, I realized that I don't understand how you want the parsing and list creation to work. Maybe we're just talking past each other.
    Only if this happens with very few particles. "de" only if not followed by other particles, for example. We shouldn't display that option for all particles.
    Yeah, if the impact can be minimized, that would be good. If it is only ever an issue with "de", limiting it to that lone particle would be a good move.
  • edited September 18, 2015
    Please add a detailed explanation when you adopt a new display syntax here, these threads are already hard to follow :-)
    Based on the previous discussions, capitalized particles are treated as non-particles and lower-case as non-dropping
    Am I missing something? The italicized part is not true for me.
    "lowercase-but-quoted" option should never be offered, because AFAICT, this is either an edge case or never actually exists.
    I disagree for the reasons Frank gave. But the option could be limited to "de" as you suggest it.
  • Lowercase non-particles do not seem to be that rare.
    CMoS, 16e, 8.5 and 16.71, lists the following:

    - Walter de la Mare; de la Mare
    - Paul de Man; de Man
    - Daphne du Maurier; du Maurier
    - Robert van Gulik; van Gulik
    - Wernher von Braun; von Braun

    - da Cunha, Euclides
    - de Gaulle, Charles
    - di Leonardo, Micaela
  • edited September 18, 2015
    As this potentially extremely useful particle tool becomes closer to implementation, I echo the concern raised by Gracile and recommend that development of a draft of the documentation start now. The usefulness of the tool will be directly related to how well its purpose and operation can be explained. In my personal experience, I sometimes have found that when I try to provide written instructions on a procedure, I find that the procedure itself needs to be adjusted a bit.

    This proposed tool is groundbreaking. Not only does it assist with proper alphabetization of author names, it helps keep the entered names uniform so that the styles can better implement name disambiguation. Most people don't take the time to think about name particle practices and fewer still worry about the conventions of other languages or places. That Zotero will offer assistance in facilitating the entry of names containing particles is truly a giant step in bibliography management is only part of why this is wonderful. Equally if not more important is the tool's and its documtation's ability enlighten the manuscript author of the complexity of the name particle issue.

    I've intentionally used what may seem like hyperbole in this post. Upon implementation of the tool I believe that few will think I've exaggerated the tool's impact. The tool's documentation will be key to success. I see a need for both a straightforward how-to mechanics of the procedure and an optional deeper explanation of the bibliographic conventions of particle usage across the world.
  • edited September 19, 2015
    Thanks to everyone for the latest round of comments. Running the use cases posted by aurimas revealed bugs in the code, so double-thanks there. The screencast has been refreshed based on the latest iteration, and I've added a draft of documentation explaining what it's for and how it works to the Juris-M wiki.

    Points to note are:
    • The dingbat "komejirushi"-like separator mark, which I hope will render on all systems;
    • Idempotence of unspecified particle sets, and near-idempotence of specified sets
    • The constraint imposed on "no-particle" capitalized forms
    • Disabling of rollover on the headings
    (As the documentation draft says and the screencast shows, it's still not extending from the "core particles" unless there is a good match against the list for the entire extended particle set—so the "Van" in "Kamp, John Van ferch de" is not picked up as a particle. I've considered alternatives, but I think this is the best compromise between orderly and complete.)

    (Edit: in the original post, the idempotence description was backwards.)
  • edited September 21, 2015
    I've filed pull requests for simplified parsing, and for UI support.

    Given the number of changes in the offing, we can assume that the team are pretty busy with other issues, but we'll see where we land on the priorities list (and meanwhile, thanks to Aurimas for taking time to review here).
  • Thanks to everyone who's been working on this. Unfortunately for Zotero itself we'll need a solution that doesn't corrupt the fields, as the double-quotes do. If that means adding additional creator fields, we can plan to do that along with the other schema/API changes, once API syncing is in place. But I don't want a proliferation of hacks on top of the plain-text data.
  • we'll need a solution that doesn't corrupt the fields, as the double-quotes do. If that means adding additional creator fields, we can plan to do that
    I'm not a fan of double quotes either, TBH, but I don't think we need separate DB fields for this. What if we use a non-breaking space to separate lower-case particles that are supposed to be part of the given name? This would be a solution for the data layer, not user entry.

    As far as user entry goes, I think we're approaching a point where we need to implement rich data fields, with support for italics, superscript, subscript, etc. that do not require user typing out HTML tags. The rich fields would also handle inserting a non-breaking space (disguised as a "non-dropping particle" button, or something of the sort) and formatting the non-dropping particle in a way that is distinct from the given name (say, italics).

    Anyway, haven't explored the thought too much, but it could work out.
  • Well, if we think we're going to need HTML anyway, we could just use spans and classes, which would be much clearer. The non-breaking space still feels like a hack if we're really just using it as a signifier that's parsed out.

    The plan is to redo the item pane in HTML anyway, at which point there won't be much of a cost to using HTML in fields.
  • edited September 21, 2015
    The non-breaking space still feels like a hack if we're really just using it as a signifier that's parsed out
    I'm also thinking that it could work nicely when exported in various formats (RIS, maybe BibTeX, etc.). I'm not sure how SQLite collation algorithms treat NBSP, but it might work nicely in those cases as well (I'm guessing we're not that lucky).
  • You might also consider how you want the sorting of names containing particles to work in the UI.
  • edited September 21, 2015
    I think RIS and BibTeX are on their own — if they don't have a mechanism for dealing with this, there's not much point in worrying about it. We can export in richer formats as HTML with semantic classes. (Actually, we might even export to RIS as HTML now, for better or worse.)

    Within Zotero, these fields will be HTML anyway, so anything sorted or searched upon would need to be converted to (or stored separately as) plain-text. So might as well make it explicit where we can. HTML gives us the ability to mark up text meaningfully.
Sign In or Register to comment.