Better handle range-valued columns by integrating table_statistics into DIFF
marcua opened this issue · comments
Use this: https://github.com/marcua/datools/blob/main/datools/table_statistics.py
That functionality is wrapped by the currently unused
datools/datools/explanation/algorithms.py
Line 32 in 18cdbc4
diff
to generate range-based candidates.
- Might need to modify the statistics code to work on arbitrary queries, not just tables.
- Thinking about this harder, we're likely going to want more explicit direction: should each column be treated as set-valued or range-valued? Otherwise, range-valued columns will probably cause unhelpful explanations when also treated as set-valued.
- Rewrite
_range_valued_statistics
and_set_valued_statistics
to be public APIs and take a query instead of a table name.- engine instead of connection as argument
- query instead of table as argument
- Create a peer to
on_columns
calledon_range_columns
and return explanations for both types of column after transforming ranges to sets by bucketing. - test_diffs.py works with new API, but doesn't return any of the buckets as explanations. Is that by design given the data?
- Update the Intel Sensor example created in #20 to transform range-valued attributes.
- Make sure the example has the same results before you add range-valued columns
- Make sure that after bucketing range-valued columns, the results are more sensible than treating everything as a set-valued attribute *I did, and it's not more sensible. I wrote up some hypotheses in the notebook)
Not-so-fun fact of the day: the sqlite3
Python package won't tell you the type of the columns for a cursor.
This ends up being subtly explained in the docs, which claim it returns a None
for columns 2-7 of description
.
And here's evidence, in the hopes that someone searching for this issue can save an hour or two of their lives:
for uri in ('sqlite://', 'duckdb:///:memory:'):
engine = create_engine(uri)
with engine.connect() as conn:
conn.execute('CREATE TABLE xyzzzzzz (id INT, blah float, blah2 text)')
result = conn.execute('SELECT * FROM xyzzzzzz')
print(uri, result.cursor.description)
sqlite:// (('id', None, None, None, None, None, None), ('blah', None, None, None, None, None, None), ('blah2', None, None, None, None, None, None))
duckdb:///:memory: [('id', 'NUMBER', None, None, None, None, None), ('blah', 'NUMBER', None, None, None, None, None), ('blah2', 'STRING', None, None, None, None, None)]