kurpicz / tcc

Text Corpus Collection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text Corpus Collection (tcc)

This is work in progress!

What is it?

This project provides simple tools to obtain (popular) text corpora that are used for benchmarks and tests.

What it is not?

We do not host any of the corpora. We just provide an easy way to get and/or compute them. Please visit the websites of the corpora for further information.

What is contained?

How to use it?

Use make download to download all files in the download configs, make random to generate random strings as defined in the config and make processing to build all preprocessing tools.

About

Text Corpus Collection

License:BSD 2-Clause "Simplified" License


Languages

Language:C++ 64.2%Language:Makefile 32.6%Language:Shell 3.2%