cosmir / openmic-2018

Tools and tutorials for the OpenMIC-2018 dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Script to perform GroupedShuffleSplit

ejhumphrey opened this issue · comments

Inputs

  • metadata csv file
  • sparse label csv file
  • random state
  • prefix

Outputs

  • text file like {prefix}-train.txt containing newline-separated sample_keys
  • text file like {prefix}-test.txt containing newline-separated sample_keys

Working on this now, finding that it's a bit more subtle than anticipated.

Question for now: how do we want to define split sizes? minimum number of test positives and negatives?

Also, what are the values in the sparse_labels matrix? They appear to be scaled to +-1 (with nans). How should we set the threshold for what constitutes a positive of negative example?

Converted the sparse relevance scores into a dense dataframe. We shouldn't have any values in the [-0.5, 0.5] range, but we do.

>>> np.nanmin(label_matrix.abs().values, axis=0)
array([0.27133333, 0.6008    , 0.6       , 0.0028    , 0.1047    ,
       0.6012    , 0.6203    , 0.5021    , 0.0955    , 0.05155   ,
       0.07035   , 0.0555    , 0.6008    , 0.0101    , 0.625     ,
       0.01895   , 0.05555   , 0.0468    , 0.05505   , 0.6289    ])

In other news, my brain may be totally fried today, but I'm not seeing a nice way to partition the data into train-test using the standard machinery. I wonder if it will be easiest to use entrofy for this, since it exactly addresses the problem we're dealing with (subset selection with a target distribution across multiple dimensions)?

Had a quick chat with @justinsalamon about this -- he suggested treating the data as multi-class over positives, doing stratified sampling to get train-test, and then checking the distribution of negatives in the result. This seems pretty reasonable given our constraints, so I'll test it out.

Okay, the multi-class stratified split seems to do the trick nicely. I had to invent a new class to catch examples that are negative for all of our instrument categories, but otherwise it works as described above.

Here's a plot of the distribution of positives and negatives for each class in the full dataset (_ALL), train, and test splits. ('value' here is conditional distribution of positive or negative labels for each category, given that the track was annotated for that category.)

As you can see, all instrument annotation distributions are well preserved across the split with this hack, so I feel pretty good about it.
image

This split was done with a 75% train-test partition, meaning 15000 training examples and 5000 test examples. It's in a notebook for now, but I'll shove it into a proper script in the near future.

@bmcfee out of curiosity because I've been facing similar problems, does entrofy or similar also balance pairs (or triplets, or etc) of classes?

@julian-urbano what entrofy tries to do is find a subset of a given size that meets (or exceeds) a target proportion along each (binary) attribute. For something like tags, consider two attributes (positive and negative) for each tag; for continuous values, think of bucketing the feature into quantiles, each of which is an attribute. The reason we need two attributes for tags is that the optimization can't penalize for over-shooting, so having a separate, opposing attribute does this implicitly, and provides balance. (You can't over-shoot on +X without under-shooting on its complement -X.)

I think @lostanlen has used it for stratified sampling, but I forget the details of exactly how it was done.

Hi everyone. Yes, I used entrofy to select a subset of acoustic sensor recordings for annotation. The strata are

  • sensors: 1-10
  • week IDs: 1-16
  • hour IDs: 8pm-8am
  • soundscape type IDs: 1-10 (after k-means clustering with k=10)

I'm happy to answer questions.

Could it work in the following way? Say we already have a fixed subset A, and we have to add n new instances from a set B, while preserving some of the balance constrains it seems to handle. That is, does it work as a one-off thing, or could it work incrementally?

@julian-urbano that's essentially how entrofy works (incremental/greedy optimization of a submodular objective function), but followed up with some monte carlo sampling to smooth out bad local optima.

That said, I've already written up PR #25 to address this problem, and I'd prefer to stop thinking about it at this point. :)

@bmcfee by all means! just curious for my own stuff ;-)