marcua / datools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Better handle range-valued columns by integrating table_statistics into DIFF

marcua opened this issue · comments

Use this: https://github.com/marcua/datools/blob/main/datools/table_statistics.py

That functionality is wrapped by the currently unused

def _single_column_candidate_predicates(
, which could be used by diff to generate range-based candidates.

  • Might need to modify the statistics code to work on arbitrary queries, not just tables.
  • Thinking about this harder, we're likely going to want more explicit direction: should each column be treated as set-valued or range-valued? Otherwise, range-valued columns will probably cause unhelpful explanations when also treated as set-valued.
  • Rewrite _range_valued_statistics and _set_valued_statistics to be public APIs and take a query instead of a table name.
    • engine instead of connection as argument
    • query instead of table as argument
  • Create a peer to on_columns called on_range_columns and return explanations for both types of column after transforming ranges to sets by bucketing.
  • test_diffs.py works with new API, but doesn't return any of the buckets as explanations. Is that by design given the data?
  • Update the Intel Sensor example created in #20 to transform range-valued attributes.
    • Make sure the example has the same results before you add range-valued columns
    • Make sure that after bucketing range-valued columns, the results are more sensible than treating everything as a set-valued attribute *I did, and it's not more sensible. I wrote up some hypotheses in the notebook)

Not-so-fun fact of the day: the sqlite3 Python package won't tell you the type of the columns for a cursor.

This ends up being subtly explained in the docs, which claim it returns a None for columns 2-7 of description.

And here's evidence, in the hopes that someone searching for this issue can save an hour or two of their lives:

    for uri in ('sqlite://', 'duckdb:///:memory:'):                                                                                                                                                        
        engine = create_engine(uri)                                                                                                                                                                        
        with engine.connect() as conn:                                                                                                                                                                     
            conn.execute('CREATE TABLE xyzzzzzz (id INT, blah float, blah2 text)')                                                                                                                         
            result = conn.execute('SELECT * FROM xyzzzzzz')                                                                                                                                                
            print(uri, result.cursor.description)
sqlite:// (('id', None, None, None, None, None, None), ('blah', None, None, None, None, None, None), ('blah2', None, None, None, None, None, None))
duckdb:///:memory: [('id', 'NUMBER', None, None, None, None, None), ('blah', 'NUMBER', None, None, None, None, None), ('blah2', 'STRING', None, None, None, None, None)]