zotero / citeproc-rs

CSL processor in Rust.

Home Page:https://cormacrelf.github.io/citeproc-wasm-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Demo disambiguation favors disambiguate-add-names

fbennett opened this issue · comments

In the current disambiguation demo, citeproc-rs is adding names where disambiguation can be done by adding names or initials to existing partners. Steps to reproduce:

  1. Align the names, changing "Amadeus" to "Ariadne" to yield:

    K. Camembert et al. / K. Camembert et al.

  2. Add a char to the first given name in the first item to yield:

    Kurtz Camembert et al. / Kurt Camembert et al.

  3. Add a char to the last family name in the first item.

Expected
Further disambiguation is unnecessary, and with the style set to etal-use-first="1" and etal-min="1" or etal-min="2", a single name should be preferred. The result should remain:

Kurtz Camembert et al. / Kurt Camembert et al.

Åctual Result
The first given name is reverted to an initial, and two names are added:

K. Camembert, A. Rossi, I. Irrelevantz / K. Camembert, A. Rossi, I. Irrelevant

It favours disambiguate-add-names because the spec demands adding names first, and stopping if that works. In this case, it does. “Adding names” here does not start with the previous attempt at disambiguation (ie Kurt/Kurtz), it starts from zero every time, which is the only deterministic approach. So it looks like, for the first reference:

K. Camembert et al (ambiguous, of degree 2 i.e. could have been produced by either citekey/citekey2)

Perform step 1 of disambiguation, add names until no further names yield a reduction in degree. It expands out to

K. Camembert, A. Rossi, I. Irrelevantz (unambiguous, degree 1)

And stops as successful. The same happens to the second cite, but it resolves to the second reference/citekey2 without the z. If adding names does not reduce the degree, then the name count bumper is reset to the last value that reduced it, or 0 if it was never reduced. Hence disambiguate-add-givenname can start with any number of additional names, as long as adding them reduced the degree.

(You’ll note we use different language to describe the process, and citeproc-rs doesn’t have “sets of ambiguous cites”, only sets of references that could have produced a cite. If you want me to expound a little, then I can, but I think it’s clear what’s going on in this example.)

Sticking with Kurt/Kurtz is the expected behaviour if you have givenname-disambiguation-rule="all-names", and this works. This is because those given names have been globally expanded before add-names comes into play. But the demo and the default are by-cite, which means the givenname disambiguation only occurs in step 2 of the process described in the spec, and not globally or at any point before add-names.

You're right that it follows the specification exactly, but it's not what a copy editor would expect (and not what citeproc-js currently produces), and I'm pretty sure that users will complain.

Are you saying citeproc-js reverses the first two disambiguation passes, because it’s simply better?

Yes, with by-cite it attempts given name addition/expansion before adding a name, to keep the cite as compact as possible.

Only with by-cite? I don’t immediately see why the others wouldn’t also benefit, e.g. primary-name getting the second name Kurtzed. Right?

By second name, do you mean the A.Rossi names in the example? My understanding has been that styles that apply primary-name disambiguation do that and only that. The ones that I have seen mostly have year-suffix as a fallback, in case the limited disambiguation by name fails to resolve everything.

(Following up on the speculation above, I checked the CSL repo. Out of 217 styles that use primary-name, just one (swedish-legal.csl) uses it without year-suffix as a disambiguation fallback. There were previously two such styles, but the other international-journal-of-spatial-data-infrastructures-research) was amended on August 22 last year specifically to add the attribute.)

Ah yes, my mistake, I think you’re right only by-cite needs it. In all other GNDRs, names that would have initials and/or given names added during cite disambiguation have already been added globally. I think this also means that the description of all-names as being for both name and cite disambiguation is redundant. If all names are as unique as possible already, you can’t disambiguate a cite by expanding them.

The upshot is you can always choose exactly one never have to do more than one of add-names and add-givennames, and you choose which based only on the GNDR: by-cite needs add-givennames, which includes an add-names pass; for the rest, add-givennames is guaranteed to do nothing, so you can skip it. This will make the program faster!

I think that's right, yes. Impressive and handy demo, by the way.

Thanks for this discussion folks! Did either of you happen to write tests for the disambiguation behavior?

No but I'll take it

Cool. Did you see the clarified spec linked above?

https://gist.github.com/cormacrelf/84bc9592cd10602d05a52bed938adece

While I was at it I wrote a better tool to convert between the two test case formats in case anyone feels like going full YAML anytime. I'll publish that tomorrow, it does have a caveat where it writes out fully parsed names and also won't handle ALL the weird sections/modes, but at least it does multiline strings for the csl: | field. I think you guys have some new sections (VERSION? I thought CSL had a version field for exactly that! 😄 ) that you can figure out.

One note is that (it may be my setup that's wrong, but) citeproc-js fails that test, by not writing out the et al.. Source is in https://github.com/cormacrelf/citeproc-rs/tree/master/citeproc-js-runner, I already updated my citeproc/citeproc-test-runner packages.

There is nothing wrong with your setup, and the test fails here as well. At first I thought, "WTF?" but on a closer look I realized it's caused by et-al-min="1"/et-al-use-first="1". With et-al-min="2" the logical outcome is the same, but "et al." is included. It's by intention, and me being off-spec bad boy again. Suppressing "et al." in the former case was a hack to support bibliographies of individual members of staff. It should not be default behavior, and the test should pass. Thanks for flagging it.