[citeproc-js] 't name particle results in extra space

aurimas · November 13, 2013

There's an extra space added before the 't particle when citing the following article: http://www.nature.com/nbt/journal/v31/n11/full/nbt.2702.html (see first author "Peter A C 't Hoen")

citeproc is also not picking up the particle as a non-dropping particle when sorting bibliography (in Cell style for example) and sorts it to the very beginning because of the apostrophe.

I'm also not sure if the apostrophe is getting replaced and whether it should be doing that.

fbennett · November 14, 2013

I'll confess that this possibility had never occurred to me. I'll see what can be done.

fbennett · November 14, 2013

I have a fix for this, which I'll release tomorrow.

The CSL archives have a list of name particles contributed by Charles Parnot of Papers. It is not a complete list, but does show "in 't" as a dropping particle. Out of curiousity, is 't a dropping or a non-dropping particle?

Rintze · November 14, 2013

It seems to be dropping. See e.g. http://jpet.aspetjournals.org/content/299/3/921.full : "as described in detail earlier (Hoen et al., 2000)."

It's the abbreviated Dutch neuter singular form of "the" (unabbreviated form is "het"). See http://www.dutchgrammar.com/en/?n=NounsAndArticles.03

Edit: changed my mind.

fbennett · November 14, 2013

Hmz. I'm getting this sort order in a test with demote-non-dropping-particle="never":

Frinkle, B
Horvath, P A B in ’t
Horvath, P A A ’t
In ’t Horvath, P A D
Klabdaggit, M
’t Horvath, P A C
Vooz, B

This follows the CSL 1.0.1 specification. We may have discussed it already, but looking at this on the page, I wonder whether maybe either the second and third entries should be reversed, or the display position of the dropping-particle should be adjusted ... ?

aurimas · November 14, 2013

Sorry for my ignorance, but why does the dropping or non-dropping part have to be hard-coded into the processor? I thought if a particle was dropping it's supposed to be entered as part of the First Name and if it is non-dropping, it's part of the Last Name. Am I missing something?

fbennett · November 14, 2013

I wasn't completely clear, sorry. The question about dropping vs non-dropping was just a point of curiousity. It's not hard-coded in the processor, and both cases will now be handled for particles that contain apostrophes.

In the sort question, the name entries behind the test set the particle in the family field (non-dropping) for P.A.C. Horvath and P.A.D. Horvath. It's set in the given name field (dropping) for P.A.A. Horvath and P.A.B. Horvath.

(Rintze and I have had an off-list exchange about this one. We'll keep an eye on the use case, and wait for further evidence on sort vs display conventions.)

aurimas · November 14, 2013

Could you point me to the list of recognized particles in citeproc-js? I found a couple more unusual looking ones, and I want to check if they are being treated right (don't feel like testing all of the possibilities in Word).

fbennett · November 14, 2013

Sure thing. It's here.

aurimas · November 14, 2013

I meant in the citeproc-js code. I assume that list will not be up to date with what's currently in the processor.

fbennett · November 14, 2013

(Oh wait, sorry -- clarity again. The link above is to the list of particles provided by Charles Parnot. At present, citeproc-js uses regular expressions to identify particles.)

aurimas · November 14, 2013

That's ok. Where is the RegExp stored?

fbennett · November 14, 2013

Here's the code:

non-dropping particle
dropping particle

Read it and weep. :-)

aurimas · November 14, 2013

Sorry, one last question. Could you link me to the test suite for non-dropping (and perhaps dropping) particles? I'm not very familiar with citeproc-js code layout.

From a quick look, I think the regexps are a bit too relaxed and there doesn't seem to be any additional checking of what they match, so I would like to see the list of known particles that we are trying to match here. (I assume they are in the test suite. I remember you were adding some Arabic particles recently, which are not in Charles' list)

Another comment I have is that some of the regexp is not entirely correct. Or, rather, is not matching what I'm sure you intended. E.g. [-|\s+|\'\u2019]I think you probably wanted something like(?:[-'\u2019]|\s+)instead, because (/[-|\s+|\'\u2019]/).test("+") == true

fbennett · November 14, 2013

There is a scattering of tests related to particle handling, but nothing yet that checks a list of candidates. It's been rather ad hoc.

Thanks for catching that regexp bug. It looks like I was using group matching, and then changed it to a character match without thinking it through.

aurimas · November 14, 2013

That seems to have happened a couple times. Give me a day or so and I'll try to come up with a less relaxed mechanism for particle matching.

fbennett · November 15, 2013

That would be great.

For reference, I've prepared a test that covers the particles in Charles' list, plus the 't case: name_ParticlesDemoteNonDroppingNever. The result segment of the test shows what the processor actually produces in the current release, which is not entirely correct. I've included a note of the particles that it gets wrong by the requirements of Charles' list -- some of the items (Saint/Sainte) may be open to discussion.

aurimas · November 15, 2013

I'm a bit confused by the purpose of this test. I thought the point would be to test what citeproc-js treats as a non-dropping particle (whether it recognizes them all and if it doesn't recognize things that are not particles). In that case, wouldn't it make more sense to set demote-non-dropping-particle to display-and-sort? Otherwise it's not clear, for example, that La is recognized as non-dropping particle.

fbennett · November 15, 2013

In the test, recognized particles receive separate markup (boldface for non-dropping particles, italics for dropping particles). "La" is recognized, so it has separate markup. "Pietro" is not recognized, so it is in the same boldface span as the family name part.

With display-and-sort, we would still need the markup to identify failed parses of the dropping-particle part, but if you find that sequence easier to read, we can use that for initial testing instead. Just say the word and I'll replace the test.

fbennett · November 15, 2013

(I've amended the test slightly, to remove an intervening space from "L'Familyname" -- the link above now points and the revised version. The "L'" particle then fails to parse, so I've added it to the list of suspects in the comment.)

aurimas · November 15, 2013

I see. Didn't notice the deference with the extra bold tags before. Thanks. This should be just fine.