marcua / datools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Candidate column generation for enums (char/int) and ranges (%tiles for int/float)

marcua opened this issue · comments

    # TODO(marcua): column_values[column] seems to have # rows as                                                                                                
    # output, rather than # buckets. 

see the range values below

[(Column('id', INTEGER(), table=<sensor_readings>, primary_key=True, nullable=False), [SetValuedStatistics(distinct_values: 9), RangeValuedStatistics(bucket_maximums: [1, 2, 3, 4, 5, 6, 7, 8, 9])]), (Column('sensor_id', VARCHAR(), table=<sensor_readings>, nullable=False), [SetValuedStatistics(distinct_values: 3)]), (Column('created_at', DATETIME(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: ['2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000'])]), (Column('voltage', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [2.3, 2.3, 2.63, 2.64, 2.65, 2.7, 2.7, 2.7, 2.7])]), (Column('humidity', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.5, 0.5, 0.5])]), (Column('temperature', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [34.0, 35.0, 35.0, 35.0, 35.0, 35.0, 35.0, 80.0, 100.0])])]