paithiov909 / ldccr

Utilities for using various Japanese corpora

Home Page:https://paithiov909.github.io/ldccr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ldccr

ldccr status badge

Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

  1. parsers for several Japanese corpora that are free or open licensed (non proprietary).
  2. a downloader of zipped text files published on Aozora Bunko.

Installation

install.packages("ldccr", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Supported Corpora

Monolingual

Name License Link
✔️ Live Door News Corpus CC BY-ND 2.1 JP #
✔️ Japanese Realistic Textual Entailment Corpus CC BY-NC-SA 4.0 #
✔️ ja.text8 corpus CC BY-SA #

Multilingual

Currently not supported.

Download text file from Aozora Bunko

if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>
  dplyr::sample_n(1L) |>
  dplyr::pull("テキストファイルURL") |>
  ldccr::read_aozora(directory = "cache") |>
  readr::read_lines()

dplyr::glimpse(text)
#>  chr [1:16] "雪子さんの泥棒よけ" "夢野久作" ...

License

MIT license.

About

Utilities for using various Japanese corpora

https://paithiov909.github.io/ldccr/

License:Other


Languages

Language:R 100.0%