double surnames starting with "te" end up quoted incorrect

stroom · August 31, 2015

Authors with double surnames (e.g. J. aan de Valk) might end up as: "Valk, J aan de" or "aan de Valk J".

The last option is best for me, and it all seemed to work fine for years. Now I found out that in a new document I get mixed results:

"aan de Valk J" turns out ok, but another
"Heer G ter" goes wrong. In the same document.

But it seems to be something related to certain Author Names. Creating a new doc with fake new entries (with different types: journal, report, document) went fine, adding an old Author Name did not:
"
aan de Valk J (1900a) prut JA.
aan de Valk J (1900b) prut Rep.
aan de Valk J (1900c) prut Doc.
Heer GNJ ter, Schut A, Bakker JP (1999) The effect of…
Voet x. te (1900) prut.

"

I tested an older doc, with all correct quotations. After Zotero Refresh, the ones that were correct "ter Heer G" switched to "Heer G ter"...

In fact: it might be so that only double surnames starting with “te” . I tried several: ALL G. ter Heer (and some others starting with “te”) Author Names seem to fail, and the other double surnames are doing fine.

What is happening here? How to resolves this?

regards,
jasper

Gracile · August 31, 2015

A new particle parser was released recently (original discussion here, there now).

The citation processor always treats "te", "ten", "ter" as dropping particles (here's a list of parsed particles). Always, i.e. no matter how they are entered (in the first-name or last-name field).

However,

The particles preceding some names should be treated as part of the last name, depending on the cultural heritage and personal preferences of the individual. To suppress parsing and treat such particles as part of the family name field, enclose the family name field content in double-quotes:

i.e., in your case: "ter Heer" in the last-name field (don't forget the pair of double quotes).

stroom · August 31, 2015

Thank you for your quick response. I see the big picture. And the workaround.

Is there any option to edit the list of parsed particles? Some seem a bit odd. Here the list of what I would consider typical Dutch:

4 « 's- » is always dropping.
5 « 't » is always dropping.
72 « in 't » is always dropping.
73 « in de » is always dropping.
74 « in der » is always dropping.
75 « in het » is always dropping.
90 « te » is always dropping.
91 « ten » is always dropping.
92 « ter » is always dropping.
93 « uit de » is always dropping.
94 « uit den » is always dropping.
96 « v. » is always dropping.
97 « v.d. » is always non-dropping.
98 « van de » is always non-dropping.
99 « van den » is always non-dropping.
100 « van der » is always non-dropping.
101 « van het » is always non-dropping.
102 « van » is always non-dropping.
103 « vander » is always non-dropping.
104 « vd » is always non-dropping.

These are typical Dutch preceding words, I cannot grasp why they should not all be treated non-dropping.
Of course I do not know if they are also used in other languages, which would bias this opinion.

But chances are that e.g. « uit den » and « van der » are really very typical Duch, but they still have a different status.

But thanks again!

nickbart · August 31, 2015

@Gracile: No, you shouldn’t have to use double quotes for “ter” (except as a temporary workaround) – “ter” is clearly a non-dropping-particle (CMOS 16e 8.10, and https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/2/#Item_6).

It’s the particle parser that needs to be updated. (Not only “ter” but also “La” and a number of others are still wrong.)

Again, I’d like to suggest trying a simple case- and position- (rather than list-/string-) based parsing: to the best of my knowledge, there are no particles (in the CSL sense) that are uppercase, so the rules for parsing can be as simple as: lower-case strings at the front of the family field are parsed as non-dropping particles, and lower-case strings at the end of the given field are parsed as dropping particles.

Also, a string-based particle parser would have to use a much more comprehensive list, including the 333 Dutch particles reported by Rintze (https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/2/#Item_26) and many more.

Gracile · August 31, 2015

You can't edit the list, but it can be discussed here…

I have no opinion on those "always dropping" names, I don't know if this was a conscious decision (indeed, because used in different languages), but FIY @Rintze – one of the developers of the Citation Style Language (CSL) – speaks (and is, I think) Dutch:
https://forums.zotero.org/discussion/30974/any-idea-why-an-a-author-comes-last-in-the-bibliography/?Focus=229616#Comment_229616

https://forums.zotero.org/discussion/30974/any-idea-why-an-a-author-comes-last-in-the-bibliography/?Focus=229772#Comment_229772

There's also this old thread: https://forums.zotero.org/discussion/27822/1/von-van-de-in-authors-name-appear-as-von-van-de/

Gracile · August 31, 2015

@nickbart: I did not say that "ter" *was* a dropping-particle, but that Zotero/CSL *treated* it as dropping… I have no opinion on the question and I'm sure that everyone is opened to amend and improve the particle parser and, hence, list. Thanks.
(We cross-posted the link to Rintze's comment which do not include "ter", but that can be an omission).

fbennett · August 31, 2015

@stroom: Thanks for your comment, I'll update the code.

(Apart from that, I'll just point out that the names parser is now called from Zotero code, so a simple position-and-case parsing module can be introduced by Zotero if it is preferred.)

fbennett · August 31, 2015

I have posted a revision to the processor following @stroom's feedback, with case-sensitive parsing to address @nickbart's reminder on the "de La Fontaine" issue. The changes affect seven tests:

citeproc-js local tests (one fixture)

CSL processor tests (six fixtures)

I'll hold off on the release for a couple of days to allow time for comments on the changes.

stroom · September 1, 2015

Thanks a lot for this quick response, really great.

I cannot fully grasp the codes in the links you provided. But as I understand it you changed my list of dropping to non-dropping? That would really (!) save me a bunch of work, putting all (co-)authors in quotes is pretty undoable.

And I do not get de La Fontaine issue, but how will "G ter Heer" and others be displayed: "ter Heer, G" or "Ter Heer, G"? As quotation the same as in bibliography?

And in the Bibliography sorted under "T" or "H"?

fbennett · September 1, 2015

That would really (!) save me a bunch of work

That's what we're here for. :-)

On La Fontaine, there seemed to be agreement that the "La" is not a non-dropping particle, but should be treated as a fixed part of the name itself - and so sort under "L," always. At least that's the latest story on that one.

For the (newly) non-dropping particles, the treatment will vary according to the settings on a citation style, but you can get results like this with a non-dropping particle:

Citation (with form="short")

ter Heer

Bibliography (with form="short" and demote-non-dropping-particle-"sort-only")

ter Heer
Stuyvesant
Vermeer

stroom · September 1, 2015

Looks good to me! Tnx again.

nickbart · September 1, 2015

Just noticed that for "J. aan de Valk" @stroom is only seeing the "expected" behaviour because "aan de" is not parsed at all, i.e., treated as a fixed part of the family name – but of course "aan de" is yet another non-dropping particle that’s not is the parser’s list yet.

I think this only goes to show that – unless we skip the idea of list-based parsing and just look for case and position – we will be needing a much more comprehensive list.

If the general feeling is we should stick with the list, maybe those who speak Dutch could have a look at http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen and identify those entries they feel should be included.

Note that this list includes upper-case forms, too, but if I understood @Rintze correctly, these are just the forms that are used when no given name(s) or initial(s) appear in front of the family name; in other words, the canonical form of a non-dropping particle in a complete (i.e., given and family) Dutch name is always lower-case. This in turn suggests that it is the lower-case forms only that should be listed in databases such as Zotero’s and that should be included in the parsing list. It’d be great if those who speak Dutch could confirm this once again.

fbennett · September 1, 2015

Also added "aan de", thanks.

DWL-SDCA · September 1, 2015

The proper positioning and casing of name particles is indeed relevant for Zotero's correct listing author names when output in a user-selected style.

I see problems with a user getting author names correctly input into the Zotero record. How is a Zotero user to know what name form is correct? It is obvious that the publisher metadata does not always contain the "correct" name format and casing. I frequently see reference lists with prefix particles of all names in upper case -- even when some should be in lower case. If Zotero output is in-part based on the casing of the particle then the name casing must be correct in the Zotero record.

Even highly literate and experienced Zotero users here need to have a back-and-forth discussion here to determine correct forms. Can less experienced users be expected to know how to edit names into the correct format so that Zotero can work magic when outputting a styled reference?

Clearly, there will need to also need to be work on the translators or within the Zotero name parser to automatically edit publisher-provided names with errors (as well as user-entered name errors) into the standard name format. Name authorities such as VIAF and ORCID do not necessarily present names that include particles in a consistent way. The VIAF depends on transcription of publisher data so there can be inconsistencies there. Names in ORCID, however, are author controlled. While authors' works can be imported into the ORCID database, the author has full control of the way(s) her or his name appears.

If Zotero is able to convert improperly formatted names into correctly formatted names (without astounding and frustrating users who see a name that is different from the one they entered) the developers will have accomplished a task that has vexed catalogers and indexers for decades.

stroom · September 1, 2015

I agree with DWL: being a dedicated Zotero user, I know of rather nothing under the hood of Zotero. Backed up by feedback of our editor I would suggest the following, which I might be basically nickbarts line of thought:

G. ter Heer:
Since 'ter' is low case, it is not a fixed part of the surname.
- In text: "According to Ter Heer (2015), etc", so t becomes T
- In quotation: "...has been shown (ter Heer 2015).", so within brackets, t stays t
- In bibliogr.: "ter Heer, G", file Always under H, not T. "Heer, G. ter", I would not use it. But if a style demands that a bibliogr. entry filed under H should actually start with an H, it's is no problem of course. All before the first high case character is transported to the back.

G. Ter Heer:
Since 'Ter' is high case, it is a fixed part of the surname.
- Treat "Ter Heer" as any one-worded-surname like "Terheer", filing under T of course. "Ter Heer, G." should never be changed.

So in Dutch a simple check on the first high case of the surname would suffice I guess. That's where the fixed surname starts, whathever amount of dashes, spaces and other high cases are following. I checked the 333-list, as far as I know they all can be treated like this. One option seems to be to add the complete 333 in your list, or otherwise get rid of the list and use a set of rules.

Also the earlier mentioned "de La Fontaine" would not be a problem: "de" might drop, or printed as high case in a sentence, using these rules. But if you want to drop only "de" of "de la Fontaine", then you need a rather detailed list or playing with some sort of quoting rules.

Like stated, I might miss some absolutely relevant technical knowledge here, but so far this seems workable to me.

aurimas · September 1, 2015

Regarding improper capitalization messing with proper parsing, I think we should still go with "position and case"-based parsing, but add a list of words that are known to always indicate particles. This way, even if those words are capitalized, we can parse them correctly. Additionally, having the non-list based parsing would give the user some control to "force" particle parsing.

fbennett · September 2, 2015

List-based parsing with an override is an interesting idea. Depending on the state of names in the wild (which I still think we don't know for certain), it might be possible to give users a hot-key or easily accessible menu for repairing names recognized in the list (possibly in bulk), and then rely exclusively on position and case, as nickbart has been pressing.

If it would work for everyone everywhere, that would yield clean (and easily parseable) data, which would be good all around. The less magic the citation formatter performs, the better - I just want to be sure we don't kick up yet another round of uncertainty and doubt around names when the current straightjacket approach is eased.

My only worry is that our own concern with smooth operation across multiple languages and citation styles, on the one hand, and the views of area specialists with strong opinions on specifics, on the other, are completely separate domains [thinking most immediately of feedback we've had from Arabic specialists]. Building this thing raises novel problems, and we need to learn from one another, but (a) the conversation is really time-consuming when it happens, so it is hard to get people to engage, and (b) discovery of new issues is completely hit-or-miss.

So … it's a hard problem. If there is consensus for a particular solution on Zotero-side, though, go for it—always happy to follow.

stroom · September 2, 2015

Sounds fine. Using a list will have an advantage if manually copy-pasting names in the Author field. Doing this G. ter Heer ends up as surname "Heer", first name "G. ter". Which is difficult to settle with rules. Entries like "Rob Ter Heer": is Ter a 2nd first name or part of a surname? Using the 333-list would settle that for most Dutch cases.

Rintze · September 2, 2015

I think we should still go with "position and case"-based parsing, but add a list of words that are known to always indicate particles. This way, even if those words are capitalized, we can parse them correctly.

Can you give an example where such a list would work? The problem with any Dutch non-dropping particle (like "van") is that they are often capitalized and no longer treated as a particle in Americanized names, like in the name "Dick Van Dyke" (https://en.wikipedia.org/wiki/Dick_Van_Dyke).

fbennett · September 2, 2015

There are two potential roles for lists. One would be in the translation and UI layers of Zotero, to grab names more or less correctly, and make it easier to fix them up when translation guesses wrongly.

The other role would be in the processor, to identify particles that begin with a capital letter.

If all particles are all lowercase, and if all leading lowercase words are particles, the second type of list is unnecessary: the only role of list-based parsing in that case would be lighten the burden of creating clean data for input.

Rintze · September 2, 2015

Ah, right. That all makes sense, then.

nickbart · September 3, 2015

There are two potential roles for lists. One would be in the translation and UI layers of Zotero, to grab names more or less correctly, and make it easier to fix them up when translation guesses wrongly. The other role would be in the processor, to identify particles that begin with a capital letter.

Precisely.

If all particles are all lowercase, …

To the best of my current knowledge this ‘hypothesis’ about particles is correct. If we cannot come up with any counterexamples ourselves, the best additional test, I guess, is to implement this and see whether users at large can spot any problems.

… and if all leading lowercase words are particles, …

A few are not, but a list typically won’t help either: Strings like ‘de’ or ‘van’ could be either of

a dropping particle (Alfred de Musset [CMoS 16e 8.7], Ludwig van Beethoven [8.8])
a non-dropping particle (Hugo de Vries, Vincent van Gogh [8.10])
a fixed part of the family name (Charles de Gaulle [8.7, 16.71], Robert van Gulik [8.5])

We can only distinguish 1. from 2. by the string’s position:[Beethoven] [Ludwig van] vs. [van Gogh] [Vincent]and 2. from 3. by the use of protecting double quotes:["van Gulik"] [Robert]

… the second type of list is unnecessary.

Yes. The only case in which we’d need a list (though probably a much shorter one) would be if we could identify any uppercase strings that are particles after all. Again, currently I don’t really think so.

fbennett · September 3, 2015

Yes. The only case in which we’d need a list (though probably a much shorter one) would be if we could identify any uppercase strings that are particles after all.

That all lines up for me.

Again, currently I don't really think so.

The feedback on Arabic names worries me slightly.

nickbart · September 6, 2015

Do you have any specific worries? Re-reading previous discussions on Arabic names on this forum (https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/, https://forums.zotero.org/discussion/28457/arabic-names-with-the-particle-al/ plus http://www.hedden-information.com/Indexing-Arabic-names.pdf), I still think we’re close to a satisfactory solution now.

As to “al-” and friends, lower-case seems to be “the most common practice by far” (Christian Moe, https://forums.zotero.org/discussion/28457/arabic-names-with-the-particle-al/), so I guess we could try the “lower-case = particle” rule, and wait and see whether anyone comes forward with an actual need for introducing an exception for upper-case “Al-” etc.

Of course, most CSL styles will have to switch to `demote-non-dropping-particle="sort-only"` or `"display-and-sort"`. The Chicago Manual of Style rules, e.g., clearly call for `demote-non-dropping-particle="display-and-sort"`.

The only potential difficulty I see is that within the current CSL schema you can’t have "sort-only" for one type of names (say, Arabic) but "display-and-sort" for others (say, Dutch) – but this has nothing to with the parsing itself.

fbennett · September 6, 2015

Nothing specific, and going forward with a change to test the waters probably makes sense. We'll need support for particle adjustments in the Zotero UI before changing the processor. Both changes can now be done without touching the processor directly, so I'll leave it in Zotero's court for now.

nickbart · September 6, 2015

I’m not sure I understand: Changes in the Zotero UI are, in all likelihood, still a long way off.

What we can do _now_ is to improve the algorithms for parsing the existing two-field names.

Zotero has begun to parse names itself when exporting CSL JSON – which is great –, but it also makes sense for citeproc-js to retain its own parsing capabilities for cases when it’s explicitly asked to use them.

My point is simply: We should update the parsing algorithms, we should do it now, and we should do it _both_ in Zotero and in citeproc-js to avoid any differences in behaviour between these two.

fbennett · September 6, 2015

I disagree, but it's not really my call at this point. If Zotero chooses to simplify the parsing behaviour without providing support for the use case flagged by stroom above (mis-parsing of names by the translators), that can happen; but it's a decision for the core team, and not for me to make.

nickbart · September 7, 2015

But the use case flagged by stroom is about data import, not about parsing.

A list might help when importing or manually copy-pasting names, but even if stroom, quite rightly, observes, “Using the 333-list would settle that for most _Dutch_ cases” [my emph.], this list won’t handle Americanised names correctly. Take “ten/Ten”, which is on the 333-list, too: If uppercase, this could be a fixed part of an Americanised name (“Abraham Ten Broeck; Ten Broeck”, CMS 8.5), or (contra stroom) a second given name, or a wrongly capitalised Dutch non-dropping “ten”; so the list is almost useless here. This definitely needs more discussion.

Parsing, however, can and should be improved independent of this.

At the very least those strings where there is consensus they are _not_ CSL particles, i.e., all those that start with an uppercase letter should be removed from the parsing lists of both Zotero and citeproc-js.

fbennett · September 7, 2015

But the use case flagged by stroom is about data import, not about parsing.

They're related. Lists would have a role, not at the translation stage, but in UI to make data cleanup easier. At least that's what I suggested above.

My view is that the two problems should be solved together, because that would yield both cleaner data and a better user experience. As I have said repeatedly above, Zotero can make a different choice if they see things differently. It really is their call, so I'll be slipping away here.

stroom · September 10, 2015

Thank you for all your quick responses. This will take time to implement.

My question is, fbennett posted 31/8 and 1/9 that the list was editted, but: "I'll hold off on the release for a couple of days to allow time for comments on the changes."

When will this quick-fix/list update be released? That would help me out for now.

fbennett · September 10, 2015

To get the latest, you can install the Propachi Vanilla processor patch plugin. It should work with Zotero or Standalone (I just tested it in Zotero for Firefox to be sure it's working there).

If you hit any other anomalies, let us know.