MS-Celeb-1M WashList

This repository is a slightly cleaner wash list of MS-Celeb-1M. As we know, there are lots of noises in it. For example, some images belong to one celebrity while those are included in other celebrities. Some images are very blurry and even clearly not human faces.

We provide a wash list to clean the dataset, and you can download from ~~OneDrive~~(coming soon) or Baidu Yun.

Some Details

This list is based on XiangWu's repo, so it will may be slightly cleaner. The ID mapping is same as Wu's, and follows the format in the list:

ID/\$(foldername)_\$(filename)

We extracted feature of every image by a CNN, and rudely use hierarchical clustering algorithm to find out the cluster contains the most images in each celebrity folder. The images of this cluster will be regard as no noise data of one celebrity. ~~If the elements of this largest cluster equal or less than 5 images, the whole folder will be dropped.~~

Datasets	Celebrities	Images
Original Dataset	99,892	8,456,240
XiangWu's Cleaned Dataset	79099	5,049,824
Our Cleaned Dataset	78579	4,621,640

Some Results

We spot-checked some cases manually, and found a few typical cases:

Success Case

Before:

After

Failure Case

Due to the variance of images in this folder is very large, the biggest cluster only contains 1/5 images. Before:

After

Overlap with LFW

We searched all the individuals of LFW in MSCeleb1M and list the nearest neighbor with cosine similarity in msceleb1m_lfw_mapping_probability.txt.

ID lfw_celeb_name cosine_similarity

According to our test, the pair with 0.5 over cosine similarity can be considered as the same person with a high probability. So we list 1266 pairs whose similarity is more than 0.5 in msceleb1m_lfw_overlaplist.txt as the overlap with LFW. Of course, the threshold can be determined by your own.

Accuracy on LFW

We simply trained a same model on the 3 datasets, and the LFW accuracy were listed below:

Dataset	Accuracy
Original Dataset	98.21%
XiangWu's Cleaned Dataset	99.42%
Our Cleaned Dataset	99.55%

Due to the inadequacy of our work, this result may not explain any problems.

Future Work

Considering our CNN model is not good enough, this clean list certainly still exist some noises, and some images which are not noises were deleted. We will update this list if we get a better CNN model;
~~We will try to find out the over 1000 overlap identities between MS-Celeb-1M and LFW;~~
Unfortunately, the mapping probability list was also generated by our CNN model. We have to admit that there must be some true negative and false positive. We will manually check suspicious pairs;

The released list is only allowed for non-commercial use.

inlmouse / MS-Celeb-1M_WashList