tm4roon / jawikinews-headline-dataset

A parallel corpus of article-headline pairs obtained from Japanese Wikinews.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Japanese-Wikinews Headline Dataset

The datasets contain article-headline pairs obtained from Japanese Wikinews. The articles and headlines are segmented to words using mecab-ipadic.

In this repository, there are following three version datasets according to the article length:

  • full-articles: the dataset with articles more than 10 tokens, and headlines;
  • long version: the dataset with articles extracted from the first five sentences or 256 tokens, and headlines.
  • short version: the dataset with articles extracted from the three sentences or 128 tokens, and headlines.

Data Statistics

Table1 Number of documents

Table2 N-gram overlaps in headline

About

A parallel corpus of article-headline pairs obtained from Japanese Wikinews.


Languages

Language:Jupyter Notebook 100.0%