Giters
facebookresearch
/
cc_net
Tools to download and cleanup Common Crawl data
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
916
Watchers:
24
Issues:
44
Forks:
135
facebookresearch/cc_net Issues
how to only compute the perplexity of each paragraph using your language model with local data?
Updated
10 months ago
Comments count
1
Running on local files
Closed
3 years ago
Comments count
4
503 Server Error: Service Unavailable for url
Updated
a year ago
Comments count
1
从wet格式中提取文本
Updated
a year ago
Comments count
2
Whether CC_Net provides an existing monolingual corpus
Updated
a year ago
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url
Updated
a year ago
Comments count
1
Can reproduce still run normally?
Closed
a year ago
Numerous Errors
Updated
a year ago
Comments count
2
win10 use cc_net
Updated
a year ago
Error: Job not requeued because: timed-out and not checkpointable.
Updated
a year ago
Comments count
12
CC-100 in statmt version is different from paper
Updated
a year ago
Annotation statistics
Closed
a year ago
The final json files are not as expected
Updated
a year ago
The questions about the stats json configuration file
Updated
a year ago
EOFError: Compressed file ended before the end-of-stream marker was reached
Closed
4 years ago
Comments count
4
Inquiries about utilizing 2022 collected common rawl snapshots
Updated
a year ago
Inquiries about korean datasets utilized in the CCNet pipeline
Updated
a year ago
Comments count
1
when use odoo 16.0 in pycharm show this Error
Updated
a year ago
ModuleNotFoundError: No module named 'typing_extensions'
Closed
4 years ago
Comments count
5
Batch job submission failed: Invalid job array specification
Updated
a year ago
Comments count
3
403 forbidden while downloading
Updated
2 years ago
Comments count
2
Error when Running 2020-34 dumps
Updated
2 years ago
Comments count
4
Error: Mining phase failure
Closed
2 years ago
Comments count
1
Variance of hash files sizes in newer crawls
Updated
2 years ago
Comments count
1
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url
Updated
2 years ago
Comments count
5
sbatch: error: Batch job submission failed: Invalid job array specification
Closed
2 years ago
I want to copy the output data of CC_net directly, what should I do?
Closed
2 years ago
Comments count
1
make dl_all_lm failing
Updated
2 years ago
Comments count
2
cc_net/tools/dl_cc_100.py fails to extract complete dataset
Updated
3 years ago
Comments count
6
"Reproducing our work" does not specify set of languages and snapshots
Updated
3 years ago
Comments count
2
Question about the size of Roberta-small
Closed
3 years ago
getpy version specified in setup.py no longer available
Closed
3 years ago
Comments count
1
support of Hausa
Closed
4 years ago
Comments count
4
Model finding
Updated
3 years ago
Are not all languages in the paper supported?
Closed
4 years ago
Comments count
1
Decrease RAM usage, investigate miss documents
Closed
4 years ago
Comments count
3
Cannot download the precpomputed files
Closed
4 years ago
Comments count
7
Doing hashing, mining and regroup from each bin order
Closed
4 years ago
Comments count
1
Any plans to release the cleaned datasets ?
Closed
5 years ago
Comments count
6
ERROR: Package u'cc-net' requires a different Python: 2.7.12 not in '>=3.7'
Closed
4 years ago
Comments count
2
Failing to use mp execution
Updated
4 years ago
Comments count
4
Early exit when desired number of documents is reached?
Closed
4 years ago
Comments count
3
Dedup all paragraphs if it appear more than once?
Closed
4 years ago
Comments count
2
ChunkedEncodingError & ConnectionResetError
Closed
5 years ago
Comments count
13