Infinite loop during disambiguation

Question

Infinite loop during disambiguation

dstillman opened this issue 4 years ago · comments

Reported here.

This is how I'm reproducing it in Zotero:

var item1 = new Zotero.Item;
item1.fromJSON({"key":"WB338HGS","version":0,"itemType":"journalArticle","creators":[{"firstName":"Carl G.","lastName":"de Boer","creatorType":"author"},{"firstName":"John P.","lastName":"Ray","creatorType":"author"},{"firstName":"Nir","lastName":"Hacohen","creatorType":"author"},{"firstName":"Aviv","lastName":"Regev","creatorType":"author"}],"tags":[{"tag":"CRISPR/Cas9","type":1},{"tag":"Enhancers","type":1},{"tag":"Gene regulation","type":1},{"tag":"Transcriptional regulation","type":1},{"tag":"Gene expression","type":1},{"tag":"Pooled screen","type":1},{"tag":"R","type":1}],"date":"June 3, 2020","title":"MAUDE: inferring expression changes in sorting-based CRISPR screens","journalAbbreviation":"Genome Biology","pages":"134","volume":"21","issue":"1","abstractNote":"","ISSN":"1474-760X","url":"https://doi.org/10.1186/s13059-020-02046-8","DOI":"10.1186/s13059-020-02046-8","publicationTitle":"Genome Biology","libraryCatalog":"BioMed Central","accessDate":"2021-02-17T02:40:40Z","shortTitle":"MAUDE"});
await item1.saveTx();
var item2 = new Zotero.Item;
item2.fromJSON({"key":"U2L8PVTW","version":0,"itemType":"journalArticle","creators":[{"firstName":"Carl G.","lastName":"de Boer","creatorType":"author"},{"firstName":"Eeshit Dhaval","lastName":"Vaishnav","creatorType":"author"},{"firstName":"Ronen","lastName":"Sadeh","creatorType":"author"},{"firstName":"Esteban Luis","lastName":"Abeyta","creatorType":"author"},{"firstName":"Nir","lastName":"Friedman","creatorType":"author"},{"firstName":"Aviv","lastName":"Regev","creatorType":"author"}],"tags":[],"title":"Deciphering eukaryotic gene-regulatory logic with 100 million random promoters","publicationTitle":"Nature Biotechnology","rights":"2019 The Author(s), under exclusive licence to Springer Nature America, Inc.","volume":"38","issue":"1","pages":"56-65","date":"2020-01","DOI":"10.1038/s41587-019-0315-8","ISSN":"1546-1696","url":"https://www.nature.com/articles/s41587-019-0315-8","abstractNote":"","language":"en","libraryCatalog":"www.nature.com","accessDate":"2021-02-17T02:40:52Z"});
await item2.saveTx();
var items = [item1, item2];
var style = Zotero.Styles.get('http://www.zotero.org/styles/elsevier-harvard');
var cslEngine = style.getCiteProc('en-US');
var output = Zotero.Cite.makeFormattedBibliographyOrCitationList(cslEngine, items, "html");

CSL JSON:

{"id":1,"type":"article-journal","container-title":"Genome Biology","DOI":"10.1186/s13059-020-02046-8","ISSN":"1474-760X","issue":"1","journalAbbreviation":"Genome Biology","page":"134","source":"BioMed Central","title":"MAUDE: inferring expression changes in sorting-based CRISPR screens","title-short":"MAUDE","volume":"21","author":[{"family":"Boer","given":"Carl G.","non-dropping-particle":"de"},{"family":"Ray","given":"John P."},{"family":"Hacohen","given":"Nir"},{"family":"Regev","given":"Aviv"}],"issued":{"date-parts":[["2020",6,3]]}}

{"id":2,"type":"article-journal","container-title":"Nature Biotechnology","DOI":"10.1038/s41587-019-0315-8","ISSN":"1546-1696","issue":"1","language":"en","page":"56-65","source":"www.nature.com","title":"Deciphering eukaryotic gene-regulatory logic with 100 million random promoters","volume":"38","author":[{"family":"Boer","given":"Carl G.","non-dropping-particle":"de"},{"family":"Vaishnav","given":"Eeshit Dhaval"},{"family":"Sadeh","given":"Ronen"},{"family":"Abeyta","given":"Esteban Luis"},{"family":"Friedman","given":"Nir"},{"family":"Regev","given":"Aviv"}],"issued":{"date-parts":[["2020",1]]}}

citeproc-js debug output (after changing some print() lines to CSL.debug()):

CSL: [A] === RUN ===
CSL: [B] === initVars() ===
CSL: [C] === runDisambig() ===
CSL:
[1] === incrementDisambig() ===
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain
CSL:
[1] === incrementDisambig() ===
CSL:     ------------------
CSL:     incremented values
CSL:     ------------------
CSL:     | gnameset: 0
CSL:     | gname: 0
CSL:     | names value: 1
CSL:     | givens value: 1
CSL:     | namesetsMax: 0
CSL:     | namesMax: 4
CSL:     | givensMax: 2
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain
CSL:
[1] === incrementDisambig() ===
CSL:     ------------------
CSL:     incremented values
CSL:     ------------------
CSL:     | gnameset: 0
CSL:     | gname: 0
CSL:     | names value: 1
CSL:     | givens value: 1
CSL:     | namesetsMax: 0
CSL:     | namesMax: 4
CSL:     | givensMax: 2
CSL: [2] === scanItems() ===
CSL:   [CLASH]--> 1: de Boer et al., 2020
CSL:              2: de Boer et al., 2020
CSL: [3] == disNames() ==
CSL:   ** RESOLUTION [e]: no improvement, and clashes remain

…and so on.

Brenton M. Wiernik commented 4 years ago

thanks

Dan Stillman commented a year ago

Thanks!

Dan Stillman · Answer 1 · Tue Mar 23 2021 15:33:53 GMT+0800 (China Standard Time)

This was caused by cb3ef75, which fixed #171. Reverting that commit avoids the infinite loop.

Brenton M. Wiernik · Answer 2 · Tue Mar 30 2021 10:34:42 GMT+0800 (China Standard Time)

@retorquere @larsgw Could one of you potentially take a look at this?

Dan Stillman · Answer 3 · Tue Mar 30 2021 11:23:13 GMT+0800 (China Standard Time)

(You meant to tag @retorquere here.)

Emiliano Heyns · Answer 4 · Tue Mar 30 2021 14:56:21 GMT+0800 (China Standard Time)

I'll take a look.

Emiliano Heyns · Answer 5 · Tue Mar 30 2021 16:26:48 GMT+0800 (China Standard Time)

Does anyone here know how the tests are put together? I'd prefer to add a failing test first.

Denis Maier · Answer 6 · Tue Mar 30 2021 16:35:55 GMT+0800 (China Standard Time)

@retorquere
You mean these tests?

>>== MODE ==>>
citation
<<== MODE ==<<

The "classic" abbreviation is applied to items of the "classic" type
based on the author and title of the item, separated by a comma. The
rendered form in the CSL layout is used as a default, but is not
relevant to the match.


>>== ABBREVIATIONS ==>>
{
    "default": {
        "classic": {
            "Bankton, Institute II": "Bankton <sc>Institute</sc> II", 
            "Blackstone, Commentaries": "Bl Comm"
        }
    }
}
<<== ABBREVIATIONS ==<<


>>== RESULT ==>>
Bl Comm; Bankton <span style="font-variant:small-caps;">Institute</span> II.
<<== RESULT ==<<

>>===== CSL =====>>
<style 
      xmlns="http://purl.org/net/xbiblio/csl"
      class="note"
      version="1.1mlz1">
  <info>
    <id />
    <title />
    <updated>2009-08-10T04:49:00+09:00</updated>
  </info>
  <citation>
    <layout suffix="." delimiter="; ">
      <group delimiter=" ">
        <names variable="author"/>
        <text variable="title"/>
      </group>
    </layout>
  </citation>
</style>
<<===== CSL =====<<


>>===== INPUT =====>>
[
    {
        "author": [
           {
              "family": "Blackstone"
           }
        ],
        "type": "classic",
        "id": "ITEM-1", 
        "title": "Commentaries"
    },
    {
        "author": [
           {
              "family": "Bankton"
           }
        ],
        "type": "classic",
        "id": "ITEM-2", 
        "title": "Institute II"
    }
]
<<===== INPUT =====<<

Emiliano Heyns · Answer 7 · Tue Mar 30 2021 16:37:50 GMT+0800 (China Standard Time)

I think so, but I don't know how I'd use that to recreate the problem of this issue -- for starters, I don't know what the various modes mean.

Emiliano Heyns · Answer 8 · Tue Mar 30 2021 16:41:25 GMT+0800 (China Standard Time)

Or how to run a single test.

Denis Maier · Answer 9 · Tue Mar 30 2021 16:47:29 GMT+0800 (China Standard Time)

https://citeproc-js.readthedocs.io/en/latest/setting-up.html

-s testName, --single=testName
  | Run a single local or standard test fixture.
-g groupName, --group=groupName
  | Run a group of tests with the specified prefix.
-a, --all | Run all tests.

So cslrun -s yourtest

Emiliano Heyns · Answer 10 · Tue Mar 30 2021 16:57:05 GMT+0800 (China Standard Time)

Tried that:

$ cslrun -s gh-179.txt
Rebundling processor

Error: Single test fixture must be specified as [group]_[name]

Denis Maier · Answer 11 · Tue Mar 30 2021 17:00:12 GMT+0800 (China Standard Time)

Documentation of the test layout is here.

Then, there's also an extended test format: https://github.com/Juris-M/jm-style-tests/blob/master/chicago-fullnote-bibliography/style_test001.txt

This tests all variants of a single item. (That's useful for style development, not so much for processor testing.)

Emiliano Heyns · Answer 12 · Tue Mar 30 2021 17:01:38 GMT+0800 (China Standard Time)

Oh, the underscore

Denis Maier · Answer 13 · Tue Mar 30 2021 17:02:32 GMT+0800 (China Standard Time)

Tried that:

$ cslrun -s gh-179.txt
Rebundling processor

Error: Single test fixture must be specified as [group]_[name]

Ok, then try bugs_testX.txt

Frank Bennett · Answer 14 · Tue Mar 30 2021 17:25:56 GMT+0800 (China Standard Time)

Looking at the comment on the change, it may be missing a constraint. Will take a look in a few hours, and propose a solution if any lights go on.

…

On Tue, Mar 30, 2021, 17:38 Emiliano Heyns ***@***.***> wrote: I think so, but I don't know how I'd use that to recreate the problem of this issue -- for starters. I don't know what the various modes mean. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASMSUN6PIOB56YXZW7LN3TGGEW3ANCNFSM4XZWZJYA> .

Emiliano Heyns · Answer 15 · Tue Mar 30 2021 18:23:12 GMT+0800 (China Standard Time)

Then, there's also an extended test format: https://github.com/Juris-M/jm-style-tests/blob/master/chicago-fullnote-bibliography/style_test001.txt

What does QANIJJS2 refer to in that test?

Emiliano Heyns · Answer 16 · Tue Mar 30 2021 18:26:49 GMT+0800 (China Standard Time)

OK, I have a test working, I'll try to add the issue here as a test case

Denis Maier · Answer 17 · Tue Mar 30 2021 18:28:39 GMT+0800 (China Standard Time)

Don't know. Maybe the item identifier to link the test to a certain item in the public zotero group, see https://github.com/juris-m/citeproc-test-runner

CTR can build tests for individual styles, using items from a shared public library. As a first step, visit the "Jurism Test Submissions" library (below), join it, and then sync Jurism or Zotero to add the library to your local client:
https://www.zotero.org/groups/2339078/jurism_test_submissions

Emiliano Heyns · Answer 18 · Tue Mar 30 2021 18:37:32 GMT+0800 (China Standard Time)

Does this look OK? Because that does seem like it gets stuck.

Denis Maier · Answer 19 · Wed Mar 31 2021 04:07:22 GMT+0800 (China Standard Time)

Because that does seem like it gets stuck.

What do you mean? Is that good or bad?

But yes, that looks ok. I haven't checked the details, but the overall structure looks good.

Emiliano Heyns · Answer 20 · Wed Mar 31 2021 04:29:40 GMT+0800 (China Standard Time)

Well... good in the sense that it might be a successfully failing test, capturing the error condition. But I could have made a mistake in setting it up where it would be expected that it'd lock up.

I'll take it as a successfully failing test for now, and try to get it to work.

Brenton M. Wiernik · Answer 21 · Sat Jul 23 2022 20:40:48 GMT+0800 (China Standard Time)

Can we revisit trying to fix this bug?

Emiliano Heyns · Answer 22 · Sun Jul 24 2022 04:21:59 GMT+0800 (China Standard Time)

The weird thing is it has the problem if there is the triggering case as the first name + at least two other names, and it doesn't matter what the other names are. But if I force a stack trace, I don't see anything indicative of a deep recursion. If the triggering name is not the first in the list, the problem does not appear.

Emiliano Heyns · Answer 23 · Sun Jul 24 2022 04:25:34 GMT+0800 (China Standard Time)

Oh wait, fewer names triggers a different part of the style of course.

Emiliano Heyns · Answer 24 · Sun Jul 24 2022 05:08:18 GMT+0800 (China Standard Time)

But it's stranger than I thought nonetheless; the first item has the triggering name but doesn't trigger the hang. If I have the triggering name also in the 2nd item as the first name, then it triggers the hang...

except if the two supposedly triggering names are different. If I change any one of the two to "Boeren", or I change one of the particles to "van", no hang. So the problem appears if there are at least two items cited, both with a non-dropping-particle for the 1st author, but the names have to be the same for the hang to occur.

No idea what could trigger this behavior. Is anything cached about names, or something else the engine does for names that occur multiple times?

Dan Stillman · Answer 25 · Sun Jul 24 2022 05:16:18 GMT+0800 (China Standard Time)

Why is that strange? The bug is in disambiguation code, so it would only happen with two matching names.

Emiliano Heyns · Answer 26 · Sun Jul 24 2022 05:46:53 GMT+0800 (China Standard Time)

That is a good point of course.

Still, I don't see anything that would indicate an infinite loop or recursion when I force a stack trace.

Emiliano Heyns · Answer 27 · Sun Jul 24 2022 06:03:39 GMT+0800 (China Standard Time)

Even when I comment out the code that triggers the bug, and I remove all names except "de Boer", addname is called 17 times. I just don't understand the code flow well enough to get the bigger picture.

Also, if I have this in both items

    "author": [
      {
        "family": "Boer",
        "given": "Carl G.",
        "non-dropping-particle": "de"
      },
      {
        "family": "Boer1",
        "given": "Carl G."
      }
    ],

disambiguation occurs, and the bug does not trigger.

    "author": [
      {
        "family": "Boer",
        "given": "Carl G.",
        "non-dropping-particle": "de"
      },
      {
        "family": "Boer1",
        "given": "Carl G."
      },
      {
        "family": "Boer2",
        "given": "Carl G."
      }
    ],

in both does trigger the bug.

Dan Stillman · Answer 28 · Sun Jul 24 2022 06:10:13 GMT+0800 (China Standard Time)

You can see the relevant part of the code from the debug output I posted originally. This loop is running forever. Adding some debug output shows that this.lists[0][1].length remains at 2 and doesn't change. I haven't looked into what about the change in cb3ef75 causes that.

Emiliano Heyns · Answer 29 · Mon Jul 25 2022 00:54:11 GMT+0800 (China Standard Time)

Do you know what this.lists contains, conceptually?

Frank Bennett · Answer 30 · Mon Jul 25 2022 04:43:28 GMT+0800 (China Standard Time)

It's an array of (if memory serves) data or strings for matched and unmatched pairs. The matched pairs element (at index 0? I'd have to check) is meant to grow smaller with each trip through the disambig loop. That's not happening, so here we are, and here we are again, and here ... I'll be recovering from an all-nighter (research project support for another time zone, bread baking, it's a long story), but I've been poking around on a reported disambig failure for awhile, and I'll fold this test into the mix. No promises on speed, but I'll keep typing at it.

…

On Mon, Jul 25, 2022, 01:54 Emiliano Heyns ***@***.***> wrote: Do you know what this.lists contains, conceptually? — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASMSRGVDBQ5CRHIRSO6EDVVVYL3ANCNFSM4XZWZJYA> . You are receiving this because you commented.Message ID: ***@***.***>

Frank Bennett · Answer 31 · Tue Apr 18 2023 07:38:29 GMT+0800 (China Standard Time)

This has been fixed by 5c64c35.