mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets

Home Page:http://datacomp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remove CSAM, if present

ahundt opened this issue · comments

A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us know what steps you have taken and/or plan to take to address this issue responsibly. Thanks!

https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

Edit: Ali Alkhatib also makes a good point that, should dataset changes be needed, they might need to be mixed in with additional simultaneous data changes so an old version can't be easily diffed against a new version to find harmful material, among other best practices.

https://x.com/_alialkhatib/status/1737484384914092156?s=46

Thank you for the suggestion for improving DataComp. The cited study uses one of LAION’s NSFW classifiers to find CSAM content in LAION-5B. Unlike LAION-5B, we removed NSFW content when assembling DataComp, so to the best of our knowledge, the CSAM images in question are not in DataComp. We will review this issue in more depth and welcome specific suggestions for removing content from DataComp. For additional information, please see Section 3.2, Appendix E, and Appendix G of the DataComp paper, which describe our safety measures in more detail.

Thank you for your reply. I appreciate your attention to my concerns. However, I would like to draw your attention to the fact that my name is already mentioned in the acknowledgement section on page 10 of your paper, indicating that I have previously read and shared several items about the design, construction, collection, and publication approach to this dataset with another member of your team. While they have been noted, unfortunately, these concerns have not been addressed in practice, to the best of my knowledge, which would require actions like those found in the papers I reference below.

Regarding CSAM, the 404 media article makes explicit the very high risk posed. I would appreciate it if you could substantively address the items in this issue since I was asking what you’ve done now beyond what is outlined in the paper.

Simply multiplying your own error rate figures by the scale of your dataset provides very large numbers for potentially problematic images in your dataset. Work by multiple Birhane et al papers as well as the Stanford group that verified the CSAM in LAION includes substantially more comprehensive evaluation steps that have not been completed, according to your paper.

Here is Dr. Birhane’s Google Scholar page with the relevant papers and methods:

  1. Multimodal Datasets
  2. Data-swamps
  3. LAION’s den
  4. Large image datasets

Here is the page with the Stanford group’s work detecting CSAM.

The paper stable bias is also likely to be relevant.
https://arxiv.org/abs/2303.11408

I would appreciate it if this matter were taken seriously and acted upon with equal or greater care and attention than authors of the papers I’ve provided have taken. The reasons detailed in the 404 media article make the risks, motivation for addressing the risks, and the impacts all crystal clear.

Thank you for your time and consideration.