Script to perform GroupedShuffleSplit

Question

Script to perform GroupedShuffleSplit

ejhumphrey opened this issue 6 years ago · comments

Eric J. Humphrey commented 6 years ago

Inputs

metadata csv file
sparse label csv file
random state
prefix

Outputs

text file like {prefix}-train.txt containing newline-separated sample_keys
text file like {prefix}-test.txt containing newline-separated sample_keys

Brian McFee · Answer 1 · Tue Aug 21 2018 23:52:26 GMT+0800 (China Standard Time)

Working on this now, finding that it's a bit more subtle than anticipated.

Question for now: how do we want to define split sizes? minimum number of test positives and negatives?

Brian McFee · Answer 2 · Wed Aug 22 2018 01:39:56 GMT+0800 (China Standard Time)

Also, what are the values in the sparse_labels matrix? They appear to be scaled to +-1 (with nans). How should we set the threshold for what constitutes a positive of negative example?

Brian McFee · Answer 3 · Wed Aug 22 2018 02:30:28 GMT+0800 (China Standard Time)

Converted the sparse relevance scores into a dense dataframe. We shouldn't have any values in the [-0.5, 0.5] range, but we do.

>>> np.nanmin(label_matrix.abs().values, axis=0)
array([0.27133333, 0.6008    , 0.6       , 0.0028    , 0.1047    ,
       0.6012    , 0.6203    , 0.5021    , 0.0955    , 0.05155   ,
       0.07035   , 0.0555    , 0.6008    , 0.0101    , 0.625     ,
       0.01895   , 0.05555   , 0.0468    , 0.05505   , 0.6289    ])

Brian McFee · Answer 4 · Wed Aug 22 2018 02:33:08 GMT+0800 (China Standard Time)

In other news, my brain may be totally fried today, but I'm not seeing a nice way to partition the data into train-test using the standard machinery. I wonder if it will be easiest to use entrofy for this, since it exactly addresses the problem we're dealing with (subset selection with a target distribution across multiple dimensions)?

Brian McFee · Answer 5 · Wed Aug 22 2018 05:14:04 GMT+0800 (China Standard Time)

Had a quick chat with @justinsalamon about this -- he suggested treating the data as multi-class over positives, doing stratified sampling to get train-test, and then checking the distribution of negatives in the result. This seems pretty reasonable given our constraints, so I'll test it out.

Brian McFee · Answer 6 · Wed Aug 22 2018 05:47:05 GMT+0800 (China Standard Time)

Okay, the multi-class stratified split seems to do the trick nicely. I had to invent a new class to catch examples that are negative for all of our instrument categories, but otherwise it works as described above.

Here's a plot of the distribution of positives and negatives for each class in the full dataset (_ALL), train, and test splits. ('value' here is conditional distribution of positive or negative labels for each category, given that the track was annotated for that category.)

As you can see, all instrument annotation distributions are well preserved across the split with this hack, so I feel pretty good about it.

This split was done with a 75% train-test partition, meaning 15000 training examples and 5000 test examples. It's in a notebook for now, but I'll shove it into a proper script in the near future.

Julián Urbano · Answer 7 · Thu Aug 30 2018 20:16:27 GMT+0800 (China Standard Time)

@bmcfee out of curiosity because I've been facing similar problems, does entrofy or similar also balance pairs (or triplets, or etc) of classes?

Brian McFee · Answer 8 · Thu Aug 30 2018 21:21:02 GMT+0800 (China Standard Time)

@julian-urbano what entrofy tries to do is find a subset of a given size that meets (or exceeds) a target proportion along each (binary) attribute. For something like tags, consider two attributes (positive and negative) for each tag; for continuous values, think of bucketing the feature into quantiles, each of which is an attribute. The reason we need two attributes for tags is that the optimization can't penalize for over-shooting, so having a separate, opposing attribute does this implicitly, and provides balance. (You can't over-shoot on +X without under-shooting on its complement -X.)

I think @lostanlen has used it for stratified sampling, but I forget the details of exactly how it was done.

Vincent Lostanlen · Answer 9 · Thu Aug 30 2018 22:53:27 GMT+0800 (China Standard Time)

Hi everyone. Yes, I used entrofy to select a subset of acoustic sensor recordings for annotation. The strata are

sensors: 1-10
week IDs: 1-16
hour IDs: 8pm-8am
soundscape type IDs: 1-10 (after k-means clustering with k=10)

I'm happy to answer questions.

Julián Urbano · Answer 10 · Wed Sep 05 2018 02:57:59 GMT+0800 (China Standard Time)

Could it work in the following way? Say we already have a fixed subset A, and we have to add n new instances from a set B, while preserving some of the balance constrains it seems to handle. That is, does it work as a one-off thing, or could it work incrementally?

Brian McFee · Answer 11 · Wed Sep 05 2018 03:14:04 GMT+0800 (China Standard Time)

@julian-urbano that's essentially how entrofy works (incremental/greedy optimization of a submodular objective function), but followed up with some monte carlo sampling to smooth out bad local optima.

That said, I've already written up PR #25 to address this problem, and I'd prefer to stop thinking about it at this point. :)

Julián Urbano · Answer 12 · Wed Sep 05 2018 05:53:23 GMT+0800 (China Standard Time)

@bmcfee by all means! just curious for my own stuff ;-)