facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data files are missing during the metadata generation.

gmfirefox1 opened this issue · comments

I am trying to repro the metadata build process. In the file build_metadata.py, it needs some data files:
data/wiki/enwiki-unigram.txt
data/wiki/1gram.txt.gz
data/wiki/2gram.txt.gz

Can you please share it or share the process how you generate it? Thanks!

thx for your interests. uni/bigram can be computed by your self on a wiki corpus or you can find some pre-computed version online. for example:
https://github.com/IlyaSemenov/wikipedia-word-frequency/raw/master/results/enwiki-2022-08-29.txt
or
https://nlp.cs.nyu.edu/wikipedia-data/ngram/wp_1gram.txt.gz
https://nlp.cs.nyu.edu/wikipedia-data/ngram/wp_2gram.txt.gz

Hope this helps.

Thank you, Howard! Your assistance was greatly appreciated.
It seems enwiki-2022-08-29.txt is no longer available, I'll try enwiki-2023-04-13.txt instead.