De-duping

Question

De-duping

bmcfee opened this issue 6 years ago · comments

Inspired by a recent tweet on deduping in cifar, I went and performed a similar analysis on our proposed train-test split. We indeed have some dupes in our dataset, though it seems like they're relatively few. This does already occur in our (current) train-test split though, so we'll need to do some cleanup to prevent that from happening.

Procedure

Map all tracks to vggish quantized codewords, and flatten the time dimension. This gives for train a 15000x1280 matrix; for test, 5000x1280.
Perform a nearest neighbor search from test to train using L1 distance (though any distance would work).
Look for any suspiciously small distances to nearest neighbors.

Results

This plot shows the distribution of distances from test to 1st, 2nd, ..., 5th nearest neighbor from the training set. As you can see from the fliers on column 1, there are a few suspiciously low values. In fact, two of them are identically 0. The offending culprits are

007954_184320 -> 120195_184320
074658_372480 -> 074328_372480

Checking the metadata fields, these appear to be exact duplicates with dirty metadata.

There's one more soft near-match, which has distance 9057 (median is about 54000):

021334_11520 -> 020898_11520

which again, by metadata, seems like a like match. (The artist ids are identical, but the other metadata is garbage.)

Since in all of these cases, an artist-id filter using metadata would have caught this, I'm wondering if that's sufficient to fix it? I didn't include this originally because I was under the impression that no artist had more than one track in the collection. Apparently that assumption was incorrect.

I haven't done a similar analysis on the train-train data, but I suspect there are many more dupes than are caught here.

What do you recommend @ejhumphrey @simondurand ? I could try to resolve this in the splitter, but I'm not sure how artist conditioning will play with stratified sampling -- in sklearn splitters, we can do either stratification or grouping, but not both simultaneously. I'll take a look and see if we just got unlucky with this random split, but it'd be nice to have a proper solution in place.

Eric J. Humphrey · Answer 1 · Thu Aug 30 2018 09:38:41 GMT+0800 (China Standard Time)

all grumbling aside, yea, I thought we did an artist_id split as well ... guess that's on me.

first, scope of the problem: why not do a full 20k x 20k distance matrix, all splits aside? I'm curious to know what the full duplicate count is. same goes for artist IDs...

second, fixing it: what's the right corrective action here? I see three possibilities, in increasing correctness:

leave it, flag it as a known issue, and split the data to avoid train/test contamination
drop the dupes and let it be openmic-2018, containing 19980+ examples
figure out which positive labels we'd lose for each example, and try to manually backfill from the disjoint set of artists not already covered in the collection.

I think you're suggesting (1) @bmcfee? I don't know which I prefer. (3) sounds like a lot of work, and I think making a point of this at ISMIR is a fine lesson to share.

Brian McFee · Answer 2 · Thu Aug 30 2018 11:00:58 GMT+0800 (China Standard Time)

first, scope of the problem: why not do a full 20k x 20k distance matrix, all splits aside? I'm curious to know what the full duplicate count is. same goes for artist IDs...

My laptop only has 8GB ram 😢

The number of unique artist ids in the metadata csv is around 7K though, so interpret that however you like. Since the number of train/test dupes occurs at a rate of around 3 in 5000 (0.06%), I'm not too concerned about it as a general problem, but it'd be nice to have it clean from the get go.

I think you're suggesting (1) @bmcfee? I don't know which I prefer. (3) sounds like a lot of work, and I think making a point of this at ISMIR is a fine lesson to share.

Indeed, I'd opt for 1. It'll take a little finessing in the splitter script, but I can try to hack that out tomorrow afternoon/evening.

Opt 2 would also be okay, but aesthetically unpleasing.

As for opt 3, I'm not sure. The dupes I'm seeing tend to have the same label patterns (though the confidences differ slightly).

Eric J. Humphrey · Answer 3 · Thu Aug 30 2018 20:06:48 GMT+0800 (China Standard Time)

okay, well, turns out #6 / #22 can be helpful for this, which is as good as it is bad.

That branch introduces two csvfiles of sample_key: md5 hash maps.

if you look at unique md5 hashes for the audio files, you end up with 20k values.
if you look at unique md5 hashes for the vggish features, you end up with 19994 values.

Which means, at the VGGish level, you have 6 pairs / 12 samples that are in some sense affected.

This, of course, obviously only catches exact matches ... so the story could be worse for near-misses. I might be able to dig into this a little bit more later, but it's back to real work now with hackdays ending.

From that branch / when it comes back to master...

df1 = pd.read_csv('tests/data/checksums/openmic-2018-audio.csv', index_col=0)
df2 = pd.read_csv('tests/data/checksums/openmic-2018-vggish.csv', index_col=0)
print(len(df1.md5.unique()), len(df2.md5.unique()))

simondurand · Answer 4 · Fri Aug 31 2018 00:11:50 GMT+0800 (China Standard Time)

I'd vote for 1. And if that is not too difficult to do, also adding an artist conditioning in the stratified sampling seems quite important to me. Even at the expense of a bit less even annotation distribution.

Brian McFee · Answer 5 · Fri Aug 31 2018 00:17:32 GMT+0800 (China Standard Time)

I agree, artist conditional splits with label stratification would be ideal. (Duplicate tracks within artist are a nuisance, but they're extremely rare so I can live with it.) The problem is that the current tools don't support both simultaneously, and it's not (to me) obvious how to achieve that effect. I'll have some time to think about it this evening.

Alternately, we could brute-force it by rejection-sampling stratified splits until we get one with no cross-contamination. This would be slow and painful, but in principle should work as long as there's some feasible split out there.

simondurand · Answer 6 · Fri Aug 31 2018 00:42:43 GMT+0800 (China Standard Time)

When I was faced with the same problem I just considered groups of tracks from the same artist as one entity but with the length of the number of tracks and used the same method. But it may not work in your label stratification framework. Also, the 15000/5000 split might not work exactly with this method, more something like ~15000/~5000.

If brute force work, go for it. If only a bunch of artists are still in the train and test sets, we can swap them in post processing as it won't significantly change the overall distribution.

Brian McFee · Answer 7 · Fri Aug 31 2018 01:20:28 GMT+0800 (China Standard Time)

Brute force seems like it's not going to work. As usual, the problem appears to be this fellow:

n [12]: meta.groupby('artist_id')['track_id'].count().sort_values().tail(40)
Out[12]: 
artist_id
10117     31
18561     32
11083     32
13061     32
8449      33
19219     33
10341     35
3931      35
10990     36
14774     37
2474      37
22014     38
1640      38
19030     39
16852     39
11099     39
6060      41
11687     42
7079      42
19282     42
3228      45
6443      45
4352      49
8568      49
22472     49
19472     50
77        50
11892     51
13262     53
11431     54
18686     59
22107     59
12351     62
2008      92
7168     112
10012    119
19461    152
15891    153
129      182
12750    227

Our man JSB up there (artist 12750) has 227 tracks. Following that is Ergo Phizmiz, Kosta T, and so on.

The odds of a randomized split getting any one of these artists entirely on one side are deep into Rosencrantz and Guildenstern territory.
Incidentally, I'm worried that artist-conditioning here will significantly skew our instrument / genre distributions between train and test. Maybe that's unavoidable.

Some more thoughts:

We can't do independent, instrument-wise grouped splits. That is, reimplement stratified sampling but with grouping in the inner loop. This fails because the group conditioning needs to be consistent across all instruments.
We could do a big group split, and then let the tracks fall wherever, as @simondurand suggests. I worry that this will drift really far from our target distributions.
We could propagate track labels up to the artist, do a stratified split sample as we currently have (using multiclass as a proxy for multi-label), and then drop the tracks back in. This seems like it has the best chance of working, but could fail hard for artists that have a lot of tracks with diversely annotated instruments.
Use GroupShuffleSplit, ignore the label stratification, and hope the law of large numbers saves us.
Something else?

simondurand · Answer 8 · Fri Aug 31 2018 02:52:03 GMT+0800 (China Standard Time)

I indeed also had a similar problem with this guy. I think he is linked to more than 30k songs on Spotify. Far above anyone else.

I am not entirely sure I understand the difference between 2. and 3.
What I had in mind is:
Let's say you want to assign train/test with this input:

(track_id:'001_001234', piano_yes=1, piano_no=0, violin_yes=0, ..., weight=1)

To obtain

(track_id:'001_001234', split:'train', ...) such that sum_{weight \in train} = 15000 and sum_{piano_yes \in train}/15000 ~= sum_{piano_yes \in test}/5000, ...

Then you do the same with this input:

(artist_id:'12750', piano_yes=A, ..., weight=B)
with A = sum_{track_id \in artist_id:'12750'}(piano_yes) and B = sum_{track_id \in artist_id:'12750'}(weight)

To obtain

(artist_id:'12750', split:'train', ...) such that sum_{weight \in train} ~= 15000 and sum_{piano_yes \in train}/15000 ~= sum_{piano_yes \in test}/5000, ...

I feel like this is not very clear, so do as you think is best ;)

Brian McFee · Answer 9 · Fri Aug 31 2018 07:18:36 GMT+0800 (China Standard Time)

I am not entirely sure I understand the difference between 2. and 3.

Option 2 as I understood it would ignore the labels.

Option 3 would apply the same multi-label->majority class reduction logic that we currently use at the track level, but at the artist level. The idea is as follows:

Aggregate the tracks (and annotations) within each artist. Say the confidence vectors are sum-aggregated.
Map each artist id to its maximally confident label. (Or _negative if all associations are negative)
Do a stratified split of artists
Expand artists back into tracks

There are two main problems that I see here, but I think it's probably our best option:

Label diversity within artist could hurt a lot at step 2, especially on rare instruments.
Stratification at the level of artists, within high skew of artist->track associations, will imbalance the results at step 3. We may not be better off than doing a blind split.

Brian McFee · Answer 10 · Tue Sep 04 2018 02:51:32 GMT+0800 (China Standard Time)

Updates:

I implemented the method described above. I tried a few random splits, and compared the resulting annotation distributions to the overall population. Here's an example of resulting probability ratios:

	test_neg	test_pos	train_neg	train_pos
instrument
accordion	1.0	1.0	1.0	1.0
banjo	1.0	0.9	1.0	1.0
bass	1.0	1.1	1.0	1.0
cello	0.9	1.1	1.0	1.0
clarinet	1.0	1.0	1.0	1.0
cymbals	1.0	1.0	1.0	1.0
drums	1.0	1.0	1.0	1.0
flute	1.0	1.0	1.0	1.0
guitar	1.0	1.0	1.0	1.0
mallet_percussion	1.0	1.0	1.0	1.0
mandolin	1.0	1.1	1.0	1.0
organ	1.1	0.8	1.0	1.0
piano	0.9	1.1	1.0	1.0
saxophone	0.9	1.1	1.0	1.0
synthesizer	1.0	1.0	1.0	1.0
trombone	1.0	1.0	1.0	1.0
trumpet	1.0	1.0	1.0	1.0
ukulele	1.0	1.1	1.0	1.0
violin	1.0	1.0	1.0	1.0
voice	1.0	1.0	1.0	1.0
_negative	1.0	1.0	1.0	1.0

In this case, the biggest deviation was +organ (test) having ~0.8 probability ratio compared to the full population. I've seen numbers as low as 0.7, but this gives me an idea. We can over-sample splits, and specify a maximum allowed deviation for accepted splits. If we set that number to 0.95, I'm pretty confident that we can get at least one split to come out. Of course, the sample counts don't pop out to exactly the target 15000/5000, but they're generally pretty close.

I'll do a bit more testing on this, and then push up a new script today.

Brian McFee · Answer 11 · Tue Sep 04 2018 03:16:32 GMT+0800 (China Standard Time)

After a bit of feature comparison testing, I'm now running into some genuine metadata errors that break artist-conditional filtering.

	track_id	album_id	album_title	album_url	artist_id	artist_name
9352	69465	12352	Classwar Karaoke - 0019 Survey	http://freemusicarchive.org/music/07_Elements/...	14230	07_Elements
10771	82089	13967	Elements 001-012	http://freemusicarchive.org/music/Anthony_Dono...	15896	Anthony Donovan

	track_id	album_id	album_title	album_url	artist_id	artist_name
10030	74328	13126	slavebation	http://freemusicarchive.org/music/christian_cu...	13491	Artist Name
10063	74658	13178	Slavebation	http://freemusicarchive.org/music/Christian_Cu...	15155	Christian Cummings

	track_id	album_id	album_title	album_url	artist_id	artist_name
13072	104838	16424	Sectioned v4.0	http://freemusicarchive.org/music/AL355I0/Sect...	18306	AL355I0
12972	103892	16322	Sectioned v4.0	http://freemusicarchive.org/music/Section_27_N...	8563	Section 27 Netlabel

It never seems to be more than 3 collisions (out of ~5000 test points), so it's not the end of the world. Still, we could avoid this by building a full feature hash, and applying some light manual correction to the artist ids prior to the artist-conditional splits. It still breaks independence assumptions for training/evaluation (which were never really legit to begin with, but c'est la vie), but at least it would rule out obvious dupes.

Brian McFee · Answer 12 · Tue Sep 04 2018 03:31:12 GMT+0800 (China Standard Time)

Ok, here's a complete list of sample-key feature collisions in the database: (EDIT: corrections)

{'007954_184320': ['120195_184320'],
 '069465_549120': ['082089_549120'],
 '074328_372480': ['074658_372480'],
 '074658_372480': ['074328_372480'],
 '082089_549120': ['069465_549120'],
 '103892_130560': ['104838_130560'],
 '104838_130560': ['103892_130560'],
 '116011_341760': ['116322_341760'],
 '116322_341760': ['116011_341760'],
 '116585_46080': ['116586_46080'],
 '116586_46080': ['116585_46080'],
 '120195_184320': ['007954_184320']}

The good news is that A) the list is small, and B) the collisions happen only in pairs, so there are no complicated higher-order cliques of colliding tracks to deal with.

At this point, what I'll do is map each colliding sample to the artist id of the lexicographically lower sample_key, and use that proxy artist id when generating the train-test split. This should fix us up going forward.

Brian McFee · Answer 13 · Tue Sep 04 2018 04:01:13 GMT+0800 (China Standard Time)

Aaaaand the metadata-corrected, pseudo-artist-conditional split seems to do the trick!

(Note: no fliers down near the origin on the 1-nn distance distribution.)

Now to script this up, throw in the rejection threshold parameter, and we should be good to go.

Brian McFee · Answer 14 · Tue Sep 04 2018 04:11:04 GMT+0800 (China Standard Time)

And here's a link to the notebook used to generate the sample deduplication index. I don't think this one needs to be converted into a script (use-once process), but it should be archived along with other development notebooks.

@ejhumphrey @simondurand where would be a good place to store the derived dedupe index? Alongside metadata / sparse labels?

Brian McFee · Answer 15 · Tue Sep 04 2018 04:43:39 GMT+0800 (China Standard Time)

New script is implemented and humming along. I can reliably get a pseudo-artist-conditional split with tolerance of 0.85 (meaning all subsample instrument distributions have 0.85 * p(Y) <= p(Y | train/test) <= p(y) / 0.85), but getting above 0.90 seems to be challenging. Probably due to JSB.