LAION-400M: A CLIP-filtered Dataset of Image-Alt-text Pairs from the Common-Crawl Dataset
Multimodal datasets: misogyny, pornography, and malignant stereotypes
We examine the recently released laion-400m dataset, which is a clip-filtereddataset of image-alt-text pairs parsed from the common-crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs ofrape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the ai community, regulators, policy makers and data subjects.