double surnames starting with "te" end up quoted incorrect
Authors with double surnames (e.g. J. aan de Valk) might end up as: "Valk, J aan de" or "aan de Valk J".
The last option is best for me, and it all seemed to work fine for years. Now I found out that in a new document I get mixed results:
"aan de Valk J" turns out ok, but another
"Heer G ter" goes wrong. In the same document.
But it seems to be something related to certain Author Names. Creating a new doc with fake new entries (with different types: journal, report, document) went fine, adding an old Author Name did not:
"
aan de Valk J (1900a) prut JA.
aan de Valk J (1900b) prut Rep.
aan de Valk J (1900c) prut Doc.
Heer GNJ ter, Schut A, Bakker JP (1999) The effect of…
Voet x. te (1900) prut.
"
I tested an older doc, with all correct quotations. After Zotero Refresh, the ones that were correct "ter Heer G" switched to "Heer G ter"...
In fact: it might be so that only double surnames starting with “te” . I tried several: ALL G. ter Heer (and some others starting with “te”) Author Names seem to fail, and the other double surnames are doing fine.
What is happening here? How to resolves this?
regards,
jasper
The last option is best for me, and it all seemed to work fine for years. Now I found out that in a new document I get mixed results:
"aan de Valk J" turns out ok, but another
"Heer G ter" goes wrong. In the same document.
But it seems to be something related to certain Author Names. Creating a new doc with fake new entries (with different types: journal, report, document) went fine, adding an old Author Name did not:
"
aan de Valk J (1900a) prut JA.
aan de Valk J (1900b) prut Rep.
aan de Valk J (1900c) prut Doc.
Heer GNJ ter, Schut A, Bakker JP (1999) The effect of…
Voet x. te (1900) prut.
"
I tested an older doc, with all correct quotations. After Zotero Refresh, the ones that were correct "ter Heer G" switched to "Heer G ter"...
In fact: it might be so that only double surnames starting with “te” . I tried several: ALL G. ter Heer (and some others starting with “te”) Author Names seem to fail, and the other double surnames are doing fine.
What is happening here? How to resolves this?
regards,
jasper
The citation processor always treats "te", "ten", "ter" as dropping particles (here's a list of parsed particles). Always, i.e. no matter how they are entered (in the first-name or last-name field).
However, i.e., in your case: "ter Heer" in the last-name field (don't forget the pair of double quotes).
Is there any option to edit the list of parsed particles? Some seem a bit odd. Here the list of what I would consider typical Dutch:
4 « 's- » is always dropping.
5 « 't » is always dropping.
72 « in 't » is always dropping.
73 « in de » is always dropping.
74 « in der » is always dropping.
75 « in het » is always dropping.
90 « te » is always dropping.
91 « ten » is always dropping.
92 « ter » is always dropping.
93 « uit de » is always dropping.
94 « uit den » is always dropping.
96 « v. » is always dropping.
97 « v.d. » is always non-dropping.
98 « van de » is always non-dropping.
99 « van den » is always non-dropping.
100 « van der » is always non-dropping.
101 « van het » is always non-dropping.
102 « van » is always non-dropping.
103 « vander » is always non-dropping.
104 « vd » is always non-dropping.
These are typical Dutch preceding words, I cannot grasp why they should not all be treated non-dropping.
Of course I do not know if they are also used in other languages, which would bias this opinion.
But chances are that e.g. « uit den » and « van der » are really very typical Duch, but they still have a different status.
But thanks again!
It’s the particle parser that needs to be updated. (Not only “ter” but also “La” and a number of others are still wrong.)
Again, I’d like to suggest trying a simple case- and position- (rather than list-/string-) based parsing: to the best of my knowledge, there are no particles (in the CSL sense) that are uppercase, so the rules for parsing can be as simple as: lower-case strings at the front of the family field are parsed as non-dropping particles, and lower-case strings at the end of the given field are parsed as dropping particles.
Also, a string-based particle parser would have to use a much more comprehensive list, including the 333 Dutch particles reported by Rintze (https://forums.zotero.org/discussion/30974/2/any-idea-why-an-a-author-comes-last-in-the-bibliography/2/#Item_26) and many more.
I have no opinion on those "always dropping" names, I don't know if this was a conscious decision (indeed, because used in different languages), but FIY @Rintze – one of the developers of the Citation Style Language (CSL) – speaks (and is, I think) Dutch:
https://forums.zotero.org/discussion/30974/any-idea-why-an-a-author-comes-last-in-the-bibliography/?Focus=229616#Comment_229616
https://forums.zotero.org/discussion/30974/any-idea-why-an-a-author-comes-last-in-the-bibliography/?Focus=229772#Comment_229772
There's also this old thread: https://forums.zotero.org/discussion/27822/1/von-van-de-in-authors-name-appear-as-von-van-de/
(We cross-posted the link to Rintze's comment which do not include "ter", but that can be an omission).
(Apart from that, I'll just point out that the names parser is now called from Zotero code, so a simple position-and-case parsing module can be introduced by Zotero if it is preferred.)
citeproc-js local tests (one fixture)
CSL processor tests (six fixtures)
I'll hold off on the release for a couple of days to allow time for comments on the changes.
I cannot fully grasp the codes in the links you provided. But as I understand it you changed my list of dropping to non-dropping? That would really (!) save me a bunch of work, putting all (co-)authors in quotes is pretty undoable.
And I do not get de La Fontaine issue, but how will "G ter Heer" and others be displayed: "ter Heer, G" or "Ter Heer, G"? As quotation the same as in bibliography?
And in the Bibliography sorted under "T" or "H"?
On La Fontaine, there seemed to be agreement that the "La" is not a non-dropping particle, but should be treated as a fixed part of the name itself - and so sort under "L," always. At least that's the latest story on that one.
For the (newly) non-dropping particles, the treatment will vary according to the settings on a citation style, but you can get results like this with a non-dropping particle:
Citation (with form="short")
ter Heer
Bibliography (with form="short" and demote-non-dropping-particle-"sort-only")
ter Heer
Stuyvesant
Vermeer
I think this only goes to show that – unless we skip the idea of list-based parsing and just look for case and position – we will be needing a much more comprehensive list.
If the general feeling is we should stick with the list, maybe those who speak Dutch could have a look at http://www.vernoeming.nl/alle-333-voorvoegsels-tussenvoegsels-in-nederlandse-achternamen and identify those entries they feel should be included.
Note that this list includes upper-case forms, too, but if I understood @Rintze correctly, these are just the forms that are used when no given name(s) or initial(s) appear in front of the family name; in other words, the canonical form of a non-dropping particle in a complete (i.e., given and family) Dutch name is always lower-case. This in turn suggests that it is the lower-case forms only that should be listed in databases such as Zotero’s and that should be included in the parsing list. It’d be great if those who speak Dutch could confirm this once again.
I see problems with a user getting author names correctly input into the Zotero record. How is a Zotero user to know what name form is correct? It is obvious that the publisher metadata does not always contain the "correct" name format and casing. I frequently see reference lists with prefix particles of all names in upper case -- even when some should be in lower case. If Zotero output is in-part based on the casing of the particle then the name casing must be correct in the Zotero record.
Even highly literate and experienced Zotero users here need to have a back-and-forth discussion here to determine correct forms. Can less experienced users be expected to know how to edit names into the correct format so that Zotero can work magic when outputting a styled reference?
Clearly, there will need to also need to be work on the translators or within the Zotero name parser to automatically edit publisher-provided names with errors (as well as user-entered name errors) into the standard name format. Name authorities such as VIAF and ORCID do not necessarily present names that include particles in a consistent way. The VIAF depends on transcription of publisher data so there can be inconsistencies there. Names in ORCID, however, are author controlled. While authors' works can be imported into the ORCID database, the author has full control of the way(s) her or his name appears.
If Zotero is able to convert improperly formatted names into correctly formatted names (without astounding and frustrating users who see a name that is different from the one they entered) the developers will have accomplished a task that has vexed catalogers and indexers for decades.
G. ter Heer:
Since 'ter' is low case, it is not a fixed part of the surname.
- In text: "According to Ter Heer (2015), etc", so t becomes T
- In quotation: "...has been shown (ter Heer 2015).", so within brackets, t stays t
- In bibliogr.: "ter Heer, G", file Always under H, not T. "Heer, G. ter", I would not use it. But if a style demands that a bibliogr. entry filed under H should actually start with an H, it's is no problem of course. All before the first high case character is transported to the back.
G. Ter Heer:
Since 'Ter' is high case, it is a fixed part of the surname.
- Treat "Ter Heer" as any one-worded-surname like "Terheer", filing under T of course. "Ter Heer, G." should never be changed.
So in Dutch a simple check on the first high case of the surname would suffice I guess. That's where the fixed surname starts, whathever amount of dashes, spaces and other high cases are following. I checked the 333-list, as far as I know they all can be treated like this. One option seems to be to add the complete 333 in your list, or otherwise get rid of the list and use a set of rules.
Also the earlier mentioned "de La Fontaine" would not be a problem: "de" might drop, or printed as high case in a sentence, using these rules. But if you want to drop only "de" of "de la Fontaine", then you need a rather detailed list or playing with some sort of quoting rules.
Like stated, I might miss some absolutely relevant technical knowledge here, but so far this seems workable to me.
If it would work for everyone everywhere, that would yield clean (and easily parseable) data, which would be good all around. The less magic the citation formatter performs, the better - I just want to be sure we don't kick up yet another round of uncertainty and doubt around names when the current straightjacket approach is eased.
My only worry is that our own concern with smooth operation across multiple languages and citation styles, on the one hand, and the views of area specialists with strong opinions on specifics, on the other, are completely separate domains [thinking most immediately of feedback we've had from Arabic specialists]. Building this thing raises novel problems, and we need to learn from one another, but (a) the conversation is really time-consuming when it happens, so it is hard to get people to engage, and (b) discovery of new issues is completely hit-or-miss.
So … it's a hard problem. If there is consensus for a particular solution on Zotero-side, though, go for it—always happy to follow.
The other role would be in the processor, to identify particles that begin with a capital letter.
If all particles are all lowercase, and if all leading lowercase words are particles, the second type of list is unnecessary: the only role of list-based parsing in that case would be lighten the burden of creating clean data for input.
We can only distinguish 1. from 2. by the string’s position:
Yes. The only case in which we’d need a list (though probably a much shorter one) would be if we could identify any uppercase strings that are particles after all. Again, currently I don’t really think so.[Beethoven] [Ludwig van]
vs.[van Gogh] [Vincent]
and 2. from 3. by the use of protecting double quotes:["van Gulik"] [Robert]
As to “al-” and friends, lower-case seems to be “the most common practice by far” (Christian Moe, https://forums.zotero.org/discussion/28457/arabic-names-with-the-particle-al/), so I guess we could try the “lower-case = particle” rule, and wait and see whether anyone comes forward with an actual need for introducing an exception for upper-case “Al-” etc.
Of course, most CSL styles will have to switch to `demote-non-dropping-particle="sort-only"` or `"display-and-sort"`. The Chicago Manual of Style rules, e.g., clearly call for `demote-non-dropping-particle="display-and-sort"`.
The only potential difficulty I see is that within the current CSL schema you can’t have "sort-only" for one type of names (say, Arabic) but "display-and-sort" for others (say, Dutch) – but this has nothing to with the parsing itself.
What we can do _now_ is to improve the algorithms for parsing the existing two-field names.
Zotero has begun to parse names itself when exporting CSL JSON – which is great –, but it also makes sense for citeproc-js to retain its own parsing capabilities for cases when it’s explicitly asked to use them.
My point is simply: We should update the parsing algorithms, we should do it now, and we should do it _both_ in Zotero and in citeproc-js to avoid any differences in behaviour between these two.
A list might help when importing or manually copy-pasting names, but even if stroom, quite rightly, observes, “Using the 333-list would settle that for most _Dutch_ cases” [my emph.], this list won’t handle Americanised names correctly. Take “ten/Ten”, which is on the 333-list, too: If uppercase, this could be a fixed part of an Americanised name (“Abraham Ten Broeck; Ten Broeck”, CMS 8.5), or (contra stroom) a second given name, or a wrongly capitalised Dutch non-dropping “ten”; so the list is almost useless here. This definitely needs more discussion.
Parsing, however, can and should be improved independent of this.
At the very least those strings where there is consensus they are _not_ CSL particles, i.e., all those that start with an uppercase letter should be removed from the parsing lists of both Zotero and citeproc-js.
My view is that the two problems should be solved together, because that would yield both cleaner data and a better user experience. As I have said repeatedly above, Zotero can make a different choice if they see things differently. It really is their call, so I'll be slipping away here.
My question is, fbennett posted 31/8 and 1/9 that the list was editted, but: "I'll hold off on the release for a couple of days to allow time for comments on the changes."
When will this quick-fix/list update be released? That would help me out for now.
If you hit any other anomalies, let us know.