uwdata / arquero

Query processing and transformation of array-backed data tables.

Home Page:https://idl.uw.edu/arquero

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stop() in table.scan does not seem to stop the scan

mationai opened this issue · comments

The following code did not stop the scan. It printed out:

{ row: 10571 } has NaN
{ row: -1 } has NaN
... (seems ~10.5k times)
{ row: -1 } has NaN

When I change the conditional to if (row > -1 && (isNaN(d.vol) || isNaN(d.ma50))) {, it printed out:

{ row: 10571 } has NaN 
... (seems ~10.5k times)

Please let me know if I am not using it correctly.

    table.scan((row, d, stop) => {
      if (isNaN(d.vol) || isNaN(d.ma50)) {
        console.log({row}, 'has NaN')
        stop()
      }
    })

Related question, I can use .filter and see if filtered rows > 0, but I want it to stop early. Is there something similar to a .any or .fineOne I can use if scan isn't for doing this?

Invoking stop() should prevent future iterations of the scan. If not, there may be a bug. (Though I did a quick spot check of the source code and the logic looks correct.) However, when using scan() directly you need to use columnar (not object) data access: d.vol.get(row) would be the correct access. Scan is intended primarily for internal use only, so use at your own peril!

Meanwhile, you could use an aggregate function like sum in conjunction with a conditional (something like op.sum(op.is_nan(d.value) ? 1 : 0)) to perform checks relatively efficiently. It will perform a full table scan, but won't allocate as much memory as a filter does.

Thank you for the .get(row)!

I confirmed stop works after table is ungroup-ed. Shouldn't stop() just stop the scan w/o needing to ungroup? If no, then how efficient is .slice(0).ungroup(), which is needed to make it work?

If scan is for internal use, won't something like .findOne (and boolean counterpart .any) that stops at first occurrence be useful? No matter how efficient op.sum is, it's hard to imagine it beating a stop at first occurrence if the dataset is big.

Grouping should have no effect. Is your data filtered? That might explain why stop is failing (and would indicate a bug for the filtered case).

Yes, the data is filtered.

Another case for .findOne/any is checking for more than one col would require multiple op.sum. I think scan works best for this use case. Thank you for the suggesting op.sum though.