Juris-M / citeproc-js

A JavaScript implementation of the Citation Style Language (CSL) https://citeproc-js.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Infinite loop during disambiguation

dstillman opened this issue · comments

Reported here.

This is how I'm reproducing it in Zotero:

var item1 = new Zotero.Item;
item1.fromJSON({"key":"WB338HGS","version":0,"itemType":"journalArticle","creators":[{"firstName":"Carl G.","lastName":"de Boer","creatorType":"author"},{"firstName":"John P.","lastName":"Ray","creatorType":"author"},{"firstName":"Nir","lastName":"Hacohen","creatorType":"author"},{"firstName":"Aviv","lastName":"Regev","creatorType":"author"}],"tags":[{"tag":"CRISPR/Cas9","type":1},{"tag":"Enhancers","type":1},{"tag":"Gene regulation","type":1},{"tag":"Transcriptional regulation","type":1},{"tag":"Gene expression","type":1},{"tag":"Pooled screen","type":1},{"tag":"R","type":1}],"date":"June 3, 2020","title":"MAUDE: inferring expression changes in sorting-based CRISPR screens","journalAbbreviation":"Genome Biology","pages":"134","volume":"21","issue":"1","abstractNote":"","ISSN":"1474-760X","url":"https://doi.org/10.1186/s13059-020-02046-8","DOI":"10.1186/s13059-020-02046-8","publicationTitle":"Genome Biology","libraryCatalog":"BioMed Central","accessDate":"2021-02-17T02:40:40Z","shortTitle":"MAUDE"});
await item1.saveTx();
var item2 = new Zotero.Item;
item2.fromJSON({"key":"U2L8PVTW","version":0,"itemType":"journalArticle","creators":[{"firstName":"Carl G.","lastName":"de Boer","creatorType":"author"},{"firstName":"Eeshit Dhaval","lastName":"Vaishnav","creatorType":"author"},{"firstName":"Ronen","lastName":"Sadeh","creatorType":"author"},{"firstName":"Esteban Luis","lastName":"Abeyta","creatorType":"author"},{"firstName":"Nir","lastName":"Friedman","creatorType":"author"},{"firstName":"Aviv","lastName":"Regev","creatorType":"author"}],"tags":[],"title":"Deciphering eukaryotic gene-regulatory logic with 100 million random promoters","publicationTitle":"Nature Biotechnology","rights":"2019 The Author(s), under exclusive licence to Springer Nature America, Inc.","volume":"38","issue":"1","pages":"56-65","date":"2020-01","DOI":"10.1038/s41587-019-0315-8","ISSN":"1546-1696","url":"https://www.nature.com/articles/s41587-019-0315-8","abstractNote":"","language":"en","libraryCatalog":"www.nature.com","accessDate":"2021-02-17T02:40:52Z"});
await item2.saveTx();
var items = [item1, item2];
var style = Zotero.Styles.get('http://www.zotero.org/styles/elsevier-harvard');
var cslEngine = style.getCiteProc('en-US');
var output = Zotero.Cite.makeFormattedBibliographyOrCitationList(cslEngine, items, "html");

CSL JSON:

{"id":1,"type":"article-journal","container-title":"Genome Biology","DOI":"10.1186/s13059-020-02046-8","ISSN":"1474-760X","issue":"1","journalAbbreviation":"Genome Biology","page":"134","source":"BioMed Central","title":"MAUDE: inferring expression changes in sorting-based CRISPR screens","title-short":"MAUDE","volume":"21","author":[{"family":"Boer","given":"Carl G.","non-dropping-particle":"de"},{"family":"Ray","given":"John P."},{"family":"Hacohen","given":"Nir"},{"family":"Regev","given":"Aviv"}],"issued":{"date-parts":[["2020",6,3]]}}
{"id":2,"type":"article-journal","container-title":"Nature Biotechnology","DOI":"10.1038/s41587-019-0315-8","ISSN":"1546-1696","issue":"1","language":"en","page":"56-65","source":"www.nature.com","title":"Deciphering eukaryotic gene-regulatory logic with 100 million random promoters","volume":"38","author":[{"family":"Boer","given":"Carl G.","non-dropping-particle":"de"},{"family":"Vaishnav","given":"Eeshit Dhaval"},{"family":"Sadeh","given":"Ronen"},{"family":"Abeyta","given":"Esteban Luis"},{"family":"Friedman","given":"Nir"},{"family":"Regev","given":"Aviv"}],"issued":{"date-parts":[["2020",1]]}}

citeproc-js debug output (after changing some print() lines to CSL.debug()):

CSL: [A] === RUN ===
CSL: [B] === initVars() ===
CSL: [C] === runDisambig() ===
CSL:
[1] === incrementDisambig() ===
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain
CSL:
[1] === incrementDisambig() ===
CSL:     ------------------
CSL:     incremented values
CSL:     ------------------
CSL:     | gnameset: 0
CSL:     | gname: 0
CSL:     | names value: 1
CSL:     | givens value: 1
CSL:     | namesetsMax: 0
CSL:     | namesMax: 4
CSL:     | givensMax: 2
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain
CSL:
[1] === incrementDisambig() ===
CSL:     ------------------
CSL:     incremented values
CSL:     ------------------
CSL:     | gnameset: 0
CSL:     | gname: 0
CSL:     | names value: 1
CSL:     | givens value: 1
CSL:     | namesetsMax: 0
CSL:     | namesMax: 4
CSL:     | givensMax: 2
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain

…and so on.

This was caused by cb3ef75, which fixed #171. Reverting that commit avoids the infinite loop.

@retorquere @larsgw Could one of you potentially take a look at this?

(You meant to tag @retorquere here.)

I'll take a look.

Does anyone here know how the tests are put together? I'd prefer to add a failing test first.

@retorquere
You mean these tests?

>>== MODE ==>>
citation
<<== MODE ==<<

The "classic" abbreviation is applied to items of the "classic" type
based on the author and title of the item, separated by a comma. The
rendered form in the CSL layout is used as a default, but is not
relevant to the match.


>>== ABBREVIATIONS ==>>
{
    "default": {
        "classic": {
            "Bankton, Institute II": "Bankton <sc>Institute</sc> II", 
            "Blackstone, Commentaries": "Bl Comm"
        }
    }
}
<<== ABBREVIATIONS ==<<


>>== RESULT ==>>
Bl Comm; Bankton <span style="font-variant:small-caps;">Institute</span> II.
<<== RESULT ==<<

>>===== CSL =====>>
<style 
      xmlns="http://purl.org/net/xbiblio/csl"
      class="note"
      version="1.1mlz1">
  <info>
    <id />
    <title />
    <updated>2009-08-10T04:49:00+09:00</updated>
  </info>
  <citation>
    <layout suffix="." delimiter="; ">
      <group delimiter=" ">
        <names variable="author"/>
        <text variable="title"/>
      </group>
    </layout>
  </citation>
</style>
<<===== CSL =====<<


>>===== INPUT =====>>
[
    {
        "author": [
           {
              "family": "Blackstone"
           }
        ],
        "type": "classic",
        "id": "ITEM-1", 
        "title": "Commentaries"
    },
    {
        "author": [
           {
              "family": "Bankton"
           }
        ],
        "type": "classic",
        "id": "ITEM-2", 
        "title": "Institute II"
    }
]
<<===== INPUT =====<<

I think so, but I don't know how I'd use that to recreate the problem of this issue -- for starters, I don't know what the various modes mean.

Or how to run a single test.

https://citeproc-js.readthedocs.io/en/latest/setting-up.html

-s testName, --single=testName
  | Run a single local or standard test fixture.
-g groupName, --group=groupName
  | Run a group of tests with the specified prefix.
-a, --all | Run all tests.

So cslrun -s yourtest

Tried that:

$ cslrun -s gh-179.txt
Rebundling processor

Error: Single test fixture must be specified as [group]_[name]

Documentation of the test layout is here.

Then, there's also an extended test format: https://github.com/Juris-M/jm-style-tests/blob/master/chicago-fullnote-bibliography/style_test001.txt

This tests all variants of a single item. (That's useful for style development, not so much for processor testing.)

Oh, the underscore

Tried that:

$ cslrun -s gh-179.txt
Rebundling processor

Error: Single test fixture must be specified as [group]_[name]

Ok, then try bugs_testX.txt

Then, there's also an extended test format: https://github.com/Juris-M/jm-style-tests/blob/master/chicago-fullnote-bibliography/style_test001.txt

What does QANIJJS2 refer to in that test?

OK, I have a test working, I'll try to add the issue here as a test case

Don't know. Maybe the item identifier to link the test to a certain item in the public zotero group, see https://github.com/juris-m/citeproc-test-runner

CTR can build tests for individual styles, using items from a shared public library. As a first step, visit the "Jurism Test Submissions" library (below), join it, and then sync Jurism or Zotero to add the library to your local client:
https://www.zotero.org/groups/2339078/jurism_test_submissions

Does this look OK? Because that does seem like it gets stuck.

Because that does seem like it gets stuck.

What do you mean? Is that good or bad?

But yes, that looks ok. I haven't checked the details, but the overall structure looks good.

Well... good in the sense that it might be a successfully failing test, capturing the error condition. But I could have made a mistake in setting it up where it would be expected that it'd lock up.

I'll take it as a successfully failing test for now, and try to get it to work.

Can we revisit trying to fix this bug?

The weird thing is it has the problem if there is the triggering case as the first name + at least two other names, and it doesn't matter what the other names are. But if I force a stack trace, I don't see anything indicative of a deep recursion. If the triggering name is not the first in the list, the problem does not appear.

Oh wait, fewer names triggers a different part of the style of course.

But it's stranger than I thought nonetheless; the first item has the triggering name but doesn't trigger the hang. If I have the triggering name also in the 2nd item as the first name, then it triggers the hang...

except if the two supposedly triggering names are different. If I change any one of the two to "Boeren", or I change one of the particles to "van", no hang. So the problem appears if there are at least two items cited, both with a non-dropping-particle for the 1st author, but the names have to be the same for the hang to occur.

No idea what could trigger this behavior. Is anything cached about names, or something else the engine does for names that occur multiple times?

Why is that strange? The bug is in disambiguation code, so it would only happen with two matching names.

That is a good point of course.

Still, I don't see anything that would indicate an infinite loop or recursion when I force a stack trace.

Even when I comment out the code that triggers the bug, and I remove all names except "de Boer", addname is called 17 times. I just don't understand the code flow well enough to get the bigger picture.

Also, if I have this in both items

    "author": [
      {
        "family": "Boer",
        "given": "Carl G.",
        "non-dropping-particle": "de"
      },
      {
        "family": "Boer1",
        "given": "Carl G."
      }
    ],

disambiguation occurs, and the bug does not trigger.

    "author": [
      {
        "family": "Boer",
        "given": "Carl G.",
        "non-dropping-particle": "de"
      },
      {
        "family": "Boer1",
        "given": "Carl G."
      },
      {
        "family": "Boer2",
        "given": "Carl G."
      }
    ],

in both does trigger the bug.

You can see the relevant part of the code from the debug output I posted originally. This loop is running forever. Adding some debug output shows that this.lists[0][1].length remains at 2 and doesn't change. I haven't looked into what about the change in cb3ef75 causes that.

Do you know what this.lists contains, conceptually?

This has been fixed by 5c64c35.

Thanks!