marcua / datools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Talk to Eugene Wu once candidate generation works

marcua opened this issue · comments

Notes for conversation

  • No need to read code --- I can explain it all
  • So far, I've implemented two things
    • column statistics --- given a table, identifies range- and set-valued columns. For range-valued columns, identifies percentile bucket boundaries (e.g., 3 values representing [start_of_first, start_of_second, start_of_third] bucket values). For set-valued columns, identifies the (e.g., 100) most popular values.
    • Based on the column statistics, a candidate predicate generator that, for each column statistic, generates predicates. For range-valued columns, this means predicates of the form column >= start_of_first AND column < start_of_second. For set-valued columns, this means predicates of the form column = popular_value.
  • With those primitives, we can implement the fun part! Some thoughts
    • I liked your suggestion to implement https://homes.cs.washington.edu/~suciu/main_explanation.pdf ("the UW paper") instead of https://dspace.mit.edu/bitstream/handle/1721.1/89076/scorpion-vldb13.pdf?sequence=1&isAllowed=y ("the Scorpion paper"), because I can then push as much of the heavy lifting in to the DB without relying on external libraries for decision trees, etc. One hitch in implementing the UW paper is it relies heavily on cubes, which several DBs don't natively implement (especially SQLite and DuckDB, which I'm targeting for tests before supporting a broader set of databases). This isn't a huge problem: while I await a conversation with you, I'll implement a wrapper that implements grouping sets/cubes by way of a bunch of UNION ALL of GROUP BY combinations.
    • One thing to align on early: what's the API?
      • The UW paper says the user provides a list of queries with individual aggregates that can be arithmetically combined, along with a high/low direction indicator:
        image.
      • The Scorpion paper says the user provides an annotation over the aggregate query that separates the query into a hold-out set and a set of outliers that are annotates with high/low direction indicators:
        image.
    • Once we agree on the API, I also need help deciphering the UW algorithm for a single table. Namely
      • What's the difference between the two metrics \mu_{aggr} and \mu_{interv}?
      • Can we work through implementing the cube from the UW paper on the sensor example in the Scorpion paper since I've got that "sample dataset" implemented in the tests?

I'll answer the rest of the Qs soon enough. I skimmed the UW paper again and am sad to say that they are limited to COUNT and COUNT(distinct) queries (Section 4.1). AFAIK, it's akin to the "incrementally removable" property from scorpion, but executed using data cubes.

Bailis' DIFF work also sneaks in a min-support (COUNT) threshold to do pruning as well.

In terms of the API, depends on how we anticipate usage.

  • Scorpion assumes use through a visualization. In which case, annotating individual aggregate values (e.g., [ (val1, too high), (val2, should be 5), ... ] is appropriate
  • UW paper is more general and allows arbitrary expressions over a set of aggregate values. It can express the above, since stating "val2 should be 5" is the same as saying "val2 - 5" is too high, assuming val2 is positive.
  • DIFF proposes a SQL DIFF operator that takes two sub queries and a metric.
      bad_subquery DIFF good_subquery 
      ON attrs           -- conjunction of equality predicates over subset of attrs
      COMPARE BY [metric(...) > threshold]*        -- functions over aggregation results, and a min support
      MAX ORDER K        -- up to K equality predicates in conjunction

The API explanations were really helpful. I think something like the UW or DIFF APIs make sense, with a slight bias toward adding min support to help speed things up.

Does only supporting COUNT seem limiting to you? The examples I have in my mind include not just COUNT but also SUM (on the path to averages), which the Scorpion and DIFF papers reference. I was initially heartened by the definition of Q in the UW paper because it suggested any agg, but you're right that by the time you hit Section 4.1, agg = COUNT :/.