smc / corpus

Malayalam Corpus by Swathanthra Malayalam Computing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Malayalam Corpus by Swathanthra Malayalam Computing

This is a collection of Malayalam content collected from various sources and then curated and processed for general purpose usage.

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources.

  • The whole content of Malayalam Wikipedia extracted on January 1, 2019
  • News/Article from various sources, source mentioned in respective files:
  • 251 Mb
  • 8,60,159 lines
  • 98,15,533 words
  • 10,11,11,885 characters

The word corpus contains

Contributing

  1. If you know or have a text collection with compatible license(CC by SA), we can add that to this collection. Just create an issue and let us know about it. We will help. We are looking for content in diverse topics.
  2. We are also collecting person names, place names etc in Malayalam. You can see the existing words by just browsing to the words folder. If you like to expand that collection, create an issue with details or create a merge request.

Make sure to respect the copyright of the content. We are trying to provide a corpus of free licensed content.

Other sources

  1. Malayalam content from Common Crawl dataset- https://github.com/qburst/common-crawl-malayalam

License

Creative Commons Attribution-ShareAlike https://creativecommons.org/licenses/by-sa/3.0/

About

Malayalam Corpus by Swathanthra Malayalam Computing


Languages

Language:sed 86.0%Language:Shell 14.0%