stop() in table.scan does not seem to stop the scan

Question

stop() in table.scan does not seem to stop the scan

mationai opened this issue 3 years ago · comments

The following code did not stop the scan. It printed out:

{ row: 10571 } has NaN
{ row: -1 } has NaN
... (seems ~10.5k times)
{ row: -1 } has NaN

When I change the conditional to if (row > -1 && (isNaN(d.vol) || isNaN(d.ma50))) {, it printed out:

{ row: 10571 } has NaN 
... (seems ~10.5k times)

Please let me know if I am not using it correctly.

    table.scan((row, d, stop) => {
      if (isNaN(d.vol) || isNaN(d.ma50)) {
        console.log({row}, 'has NaN')
        stop()
      }
    })

John Leung · Answer 1 · Tue Sep 21 2021 06:44:39 GMT+0800 (China Standard Time)

Related question, I can use .filter and see if filtered rows > 0, but I want it to stop early. Is there something similar to a .any or .fineOne I can use if scan isn't for doing this?

Jeffrey Heer · Answer 2 · Tue Sep 21 2021 07:56:56 GMT+0800 (China Standard Time)

Invoking stop() should prevent future iterations of the scan. If not, there may be a bug. (Though I did a quick spot check of the source code and the logic looks correct.) However, when using scan() directly you need to use columnar (not object) data access: d.vol.get(row) would be the correct access. Scan is intended primarily for internal use only, so use at your own peril!

Meanwhile, you could use an aggregate function like sum in conjunction with a conditional (something like op.sum(op.is_nan(d.value) ? 1 : 0)) to perform checks relatively efficiently. It will perform a full table scan, but won't allocate as much memory as a filter does.

John Leung · Answer 3 · Tue Sep 21 2021 08:25:28 GMT+0800 (China Standard Time)

Thank you for the .get(row)!

I confirmed stop works after table is ungroup-ed. Shouldn't stop() just stop the scan w/o needing to ungroup? If no, then how efficient is .slice(0).ungroup(), which is needed to make it work?

If scan is for internal use, won't something like .findOne (and boolean counterpart .any) that stops at first occurrence be useful? No matter how efficient op.sum is, it's hard to imagine it beating a stop at first occurrence if the dataset is big.

Jeffrey Heer · Answer 4 · Tue Sep 21 2021 08:47:03 GMT+0800 (China Standard Time)

Grouping should have no effect. Is your data filtered? That might explain why stop is failing (and would indicate a bug for the filtered case).

John Leung · Answer 5 · Tue Sep 21 2021 08:57:25 GMT+0800 (China Standard Time)

Yes, the data is filtered.

Another case for .findOne/any is checking for more than one col would require multiple op.sum. I think scan works best for this use case. Thank you for the suggesting op.sum though.