yiminglin-ai / LAION-Face

The human face subset of LAION-400M for large-scale face pretraining.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LAION-Face

Introduction

LAION-Face is the human face subset of LAION-400M, it consists of 50 million image-text pairs. Face detection is conducted to find images with faces. Apart from the 50 million full-set(LAION-Face 50M), we also provide a 20 million sub-set(LAION-Face 20M) for fast evaluation.

LAION-Face is first used as the training set of FaRL, which provides powerful pre-training transformer backbones for face analysis tasks.

For now, we only provide the image id list of those contains human face, you need download the images by yourself following the instructions below. We will further provide the face detection metadata.

Setup

pip install -r requirements.txt

We need pyarrow to read and write parque file, img2dataset to download images.

Download the metadata

We provide the list of sample_id in huggingface.

Download and convert the metadata with the following commands.

wget -l1 -r --no-parent https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/
mv the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ .
wget https://huggingface.co/datasets/FacePerceiver/laion-face/resolve/main/laion_face_ids.pth
python convert_parquet.py ./laion_face_ids.pth ./laion400m-meta ./laion_face_meta

Download the images with img2dataset

When metadata is ready, you can start download the images.

bash download.sh ./laion_face_meta ./laion_face_data

Please be patient, this command might run over days, and cost about 2T disk space, and it will download 50 million image-text pairs as 32 parts.

  • To use the LAION-Face 50M, you should use all the 32 parts.
  • To use the LAION-Face 20M, you should use these parts.
    0,2,5,8,13,15,17,18,21,22,24,25,28
    

checkout download.sh and img2dataset for more details and parameter setting.

Download the Face Detection Metata

We use batch-face to detect faces on the images, here we provide the face detection result of each sample.

To download the detection result, use the following command.

bash download_detection.sh ./detection_metadata

it will download 32 sample2detect.pth to the detection_metadata, cost about 30G disk space, each corresponding to a part as in last section.

Each pth is a dict object, it's key is int(SAMPLE_ID), and the value is the face detection result.

To get the face detection result of single image, you can refer to the code snippet below.

import torch
part_index=0
SAMPLE_ID=int(SAMPLE_ID) # you can get it from the parquet file generated by the img2dataset 
sample2detect=torch.load(f"detection_metadata/sample2detect_{part_index}.pth") # each part has a sample2detect pth, its a dict
faces=sample2detect[SAMPLE_ID]
box, landmarks, score = faces[0] # face rectangle, the standard five points, confidence

License

LAION-Face is the face subset of LAION-400M, we distribute the image id list (the pth files) under the most open Creative Common CC-BY 4.0 license, which poses no particular restriction. The metadata of the dataset are from LAION-400M. Please check LAION-400M for more details.

Contact

For help or issues concerning the data, feel free to submit a GitHub issue, or contact Yinglin Zheng.

Citation

If you find our work helpful, please consider citing

@article{zheng2021farl,
  title={General Facial Representation Learning in a Visual-Linguistic Manner},
  author={Zheng, Yinglin and Yang, Hao and Zhang, Ting and Bao, Jianmin and Chen, Dongdong and Huang, Yangyu and Yuan, Lu and Chen, Dong and Zeng, Ming and Wen, Fang},
  journal={arXiv preprint arXiv:2112.03109},
  year={2021}
}

About

The human face subset of LAION-400M for large-scale face pretraining.


Languages

Language:Python 55.5%Language:Shell 44.5%