Text Corpus Collection (tcc)
This is work in progress!
What is it?
This project provides simple tools to obtain (popular) text corpora that are used for benchmarks and tests.
What it is not?
We do not host any of the corpora. We just provide an easy way to get and/or compute them. Please visit the websites of the corpora for further information.
What is contained?
- The Pizza & Chili Corpus
- Lightweight Corpus
- Random number generation
- Word based alphabet computation
How to use it?
Use make download
to download all files in the download configs, make random
to generate random strings as defined in the config and make processing
to build all preprocessing tools.