alex000kim / nsfw_data_scraper

Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VolleyballGirls and false positives?

wingman-jr-addon opened this issue · comments

Hello! Thanks for putting this repo together. It's been excellent to see its widespread adoption when facing the daunting task of data collection for an NSFW model.
To that end though, I wanted to bring up the subreddit VolleyballGirls as a potential source of error. While reviewing the existing sexy category, this particular subcategory of imagery seemed to have much more mixed quality than any of the others. While there are definitely many qualifying images in it, the posts on the subreddit tend to be of four types:

  1. Legitimately NSFW content with poses by one or more volleyball players.
  2. Content that may be NSFW, but may be legitimate sportswear in the correct context.
  3. Content that is NSFW, but primarily due to the precise moment of capture or the focus of the image rather than the quality itself.
  4. Content that is not likely to be considered NSFW, but rather that the viewer likes the overall appearance of one or more players.

Types 1 and likely 2 are probably excellent to capture. Type 3 is good to capture, but it should be noted that e.g. a legitimate volleyball picture could be random-cropped to obtain something from type 3.
Unfortunately, type 4 is not a small part of the data. Many of these are not dissimilar to local sports photos of group huddles or individual stars.

This leads to issues like the following photo being classified with high confidence by NSFW JS as sexy. (You might be interested too, @GantMan ?)

I understand that this repo is noisy by its very nature, but this particular category was enough an outlier that I wanted to report what I was seeing. It's tough, though, because I think an important category of sports-related NSFW would be lost by removing this.

Thoughts?

Excellent info. And it's a complicated problem for sure. I've been experimenting with ways to make this even more accurate but my experiments have failed so far. I might use this subreddit as a good way to evaluate progress.

Thanks for bringing up the issue.
I removed the subreddit: b70bc0e