Various index out of bound errors

Question

Various index out of bound errors

adomasven opened this issue 4 years ago · comments

I've encountered index out of bound errors when running Driver.free(), Driver.includeUncitedItems({Specific: []}) and Driver.insertReference() at various points while testing with Zotero. Note that Zotero is freeing the driver too often and I will have to debug why it does that, but it still shouldn't run into memory access issues when interacting with the driver.

I've put up a Zotero log, you can filter for CiteprocRs: lines to see CiteprocRs calls. If you have Zotero builds set up then the citeproc-rs branch is available here: https://github.com/adomasven/zotero/tree/citeproc-rs

Cormac Relf · Answer 1 · Mon Oct 05 2020 14:55:01 GMT+0800 (China Standard Time)

When you call free(), the JS object's ptr property is set to zero, so you can use this to be sure you don't double free if you really want to be sure. But yeah, your code is not doing that, it would give a different error in any case ('null pointer passed to rust'). I would advise wrapping the Driver.new in a try/catch for style parsing errors, and setting this._driver = null unconditionally after freeing it to make that if (this._driver) more solid in the face of this possibility.

In this case, it could be a stack size issue, because it's the wasm runtime itself throwing a RuntimeError. That 'out of bounds' does accurately describe a stack overflow, and I don't think the runtime can tell the difference between that and the heap. The other possibility is a real-life memory corruption/unsoundness bug I found in the nightly rust compiler last week, which I know to have been triggered by the citeproc-rs codebase, which caused an equivalent error (segfault), and which indeed occurred in code and structures that are common to those three methods or the recomputations that flow from them. The compiler is fixed now, but I think it's obvious I should change the CI publisher to use stable rust releases only, to avoid that kind of experimental work.

Upshot: I'll publish a new build with a larger default stack size than the default 1MB (which is pretty small tbh), and update the CI. From that log I can't tell which it is for certain, but I'm quite sure that one of them will fix it.

Adomas Ven · Answer 2 · Mon Oct 05 2020 15:32:57 GMT+0800 (China Standard Time)

I would advise wrapping the Driver.new in a try/catch for style parsing errors

Citeproc instantiation is wrapped a bit higher up the call tree for style parsing errors

Upshot: I'll publish a new build with a larger default stack size than the default 1MB (which is pretty small tbh), and update the CI. From that log I can't tell which it is for certain, but I'm quite sure that one of them will fix it.

For the record I've been only testing with a few citations at most, but Zotero users are not unheard of to reach a number of citations in the thousands. The size of citation data would not affect this, since it is stored on the heap, right?

Cormac Relf · Answer 3 · Tue Dec 01 2020 19:06:48 GMT+0800 (China Standard Time)

Alright, I've solved this. It was tricky, I finally nailed down that it was only happening if you used it right at Zotero startup, but not if you waited. I could then trigger it reliably. The problem was that the init: async function() { ... } was being called more than once, which I saw in the debug logs. Driver code refers to a variable that gets overwritten with a new WebAssembly.Instance in the init() function of citeproc_rs_wasm.js. So the memory gets wiped when you call that function again. This makes sense, as the driver had a correct prototype and even a real ptr field that was clearly validly constructed, it was just referring to another instance.

Phew, that's something we can fix! I also think it's something that better explains the pattern of errors in your log -- the error never triggered just once in a document, it would keep happening on each use of the driver, and on any method of your choosing.

The solution was to, inside init, store the whole initialisation routine as a Promise, and simply await the stored promise if it were called again. That way the caller still waits even if it's in-flight. A completed promise returns either immediately or on the next tick (IDK which but it doesn't matter), so if it's already done then it does nothing. With this patch, I could no longer reproduce the problem by racing at startup. Here's the compare view: adomasven/zotero@citeproc-rs...cormacrelf:citeproc-rs, but specifically adomasven/zotero@6d0d6f5 is the solution.

I did end up hardening the stack, incl complete avoidance of wasm-bindgen's error throwing thing (which leaks stack space) using my own type WasmResult<T, E> = { Ok: T } | { Err: E } + convenience methods. I think that was something that could have popped up somewhere down the line, so it was good hardening.

While I was at it, I have also improved the packaged output so you don't need the replace(...) in the build script any more, just use the files in node_modules/@citeproc-rs/wasm/_zotero instead.

PR incoming extremely shortly for the latest stuff, with which my zotero-client commits will make more sense.

Adomas Ven · Answer 4 · Tue Dec 01 2020 19:25:17 GMT+0800 (China Standard Time)

Great debugging! I didn't realize you'd be going into Zotero's codebase to fix this, glad you have found your way around it. The WASM bindings should throw if you attempt to initialize them multiple times then to prevent other clients from the same issues.

For the record I have most of the code changes in your commits in my own uncommitted branch too, and we have a pattern for doing things such as forcing functions to only run once in the Zotero codebase, so you don't have to bother with a zotero-client PR.

Cormac Relf · Answer 5 · Tue Dec 01 2020 19:34:53 GMT+0800 (China Standard Time)

Yeah, it's the generated stuff so not really code I control and upstream wouldn't want it because sometimes multiple wasm instances is what you want, but I will add a note in the docs. 👍