awesome-japanese-nlp-resources
A curated list of resources dedicated to Python libraries, llms, dictionaries, and corpora of NLP for Japanese
This list includes 549 Japanese NLP repositories.
A tool for searching these repositories is available on Hugging Face Spaces.
For information on the models available on Huggingface, please see here .
We have released a Japanese NLP classification dataset called awesome-japanese-nlp-classification-dataset .
English | 日本語 (Japanese) | 繁體中文 (Chinese) | 简体中文 (Chinese)
Go
go-kakasi - Kanji transliteration to hiragana/katakana/romaji, in Go
Tutorial
Updated on Mar 25, 2024
sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
Janome - Japanese morphological analysis engine written in pure Python
mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
nagisa - A Japanese tokenizer based on recurrent neural networks
pyknp - A Python Module for JUMAN++/KNP
Mykytea-python - Python wrapper for KyTea
konoha - Konoha: Simple wrapper of Japanese Tokenizers
natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
rakutenma-python - Rakuten MA (Python version)
python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
rhoknp - Yet another Python binding for Juman++/KNP
python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
jagger-python - Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
cabocha - Yet Another Japanese Dependency Structure Analyzer
UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
camphr - Camphr - NLP libary for creating pipeline components
SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
depccg - A* CCG Parser with a Supertag and Dependency Factored Model
bertknp - A Japanese dependency parser based on BERT
esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
jdepp-python - Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)
pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
cutlet - Japanese to romaji converter in Python
alphabet2kana - Convert English alphabet to Katakana
Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
mozcpy - Mozc for Python: Kana-Kanji converter
jamorasep - Japanese text parser to separate Hiragana/Katakana string into morae (syllables).
text2phoneme - 日本語文を音素列へ変換するスクリプト
jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (国税庁) 's
wiredify - Convert japanese kana from ba-bi-bu-be-bo into va-vi-vu-ve-vo
mecab-text-cleaner - Simple Python package (CLI/Python API) for getting japanese readings (yomigana) and accents using MeCab.
neologdn - Japanese text normalizer for mecab-neologd
jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
mojimoji - A fast converter between Japanese hankaku and zenkaku characters
text-cleaning - A powerful text cleaner for Japanese web texts
HojiChar - 複数の前処理を構成して管理するテキスト前処理ツール
utsuho - Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.
python-habachen - Yet Another Fast Japanese String Converter
Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
japanese-sentence-breaker - Japanese Sentence Breaker
sengiri - Yet another sentence-level tokenizer for the Japanese text
budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
ja_sentence_segmenter - japanese sentence segmentation library for python
hasami - A tool to perform sentence segmentation on Japanese text
kuzukiri - Japanese Text Segmenter for Python written in Rust
ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools
oseti - Dictionary based Sentiment Analysis for Japanese
negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
pymlask - Emotion analyzer for Japanese text
asari - Japanese sentiment analyzer implemented in Python.
jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
namaco - Character Based Named Entity Recognition.
entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
noyaki - Converts character span label information to tokenized text-based label information.
bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード
pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
mokuro - Read Japanese manga inside browser with selectable text.
handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
OCR_Japanease - 日本語OCR
ndlocr_cli - NDLOCRのアプリケーション
donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
JMTrans - manga translator - get japanese manga from url to translate manga image
Kindai-OCR - OCR system for recognizing modern Japanese magazines
text_recognition - NDLOCR用テキスト認識モジュール
Poricom - Optical character recognition in manga images. Manga OCR desktop application
Tool for pretrained models
namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
python_asa - python版日本語意味役割付与システム(ASA)
toiro - A comparison tool of Japanese tokenizers
ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
JapaneseTokenizers - A set of metrics for feature selection from text data
daaja - This repository has implementations of data augmentation for NLP for Japanese.
accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
nlplot - Visualization Module for Natural Language Processing
rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
jel - Japanese Entity Linker.
MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
darts-clone-python - Darts-clone python binding
jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
nlp-recipes-ja - Samples codes for natural language processing in Japanese
Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
DNorm-J - Japanese version of DNorm
pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
ishi - Ishi: A volition classifier for Japanese
python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
unsupervised-pos-tagging - 教師なし品詞タグ推定
negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
YouyakuMan - Extractive summarizer using BertSum as summarization model
japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
kantan - Lookup japanese words by radical patterns
make-meidai-dialogue - Get Japanese dialogue corpus
japanese_summarizer - A summarizer for Japanese articles.
chirptext - ChirpText is a collection of text processing tools for Python.
yubin - Japanese Address Munger
jawiki-cleaner - Japanese Wikipedia Cleaner
japanese2phoneme - A python library to convert Japanese to phoneme.
anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
aozora_classification - About
This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
NTM - Testing of Neural Topic Modeling for Japanese articles
EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
mbart-finetuning - Code to perform finetuning of the mBART model.
xvector_jtubespeech - xvector model on jtubespeech
TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール.
Grongish - 日本語とグロンギ語の相互変換スクリプト
WordCloud-Japanese - WordCloudでの日本語文章をMecab(形態素解析エンジン)を使用せずに形態素解析チックな表示を実現するスクリプト
snark - 日本語ワードネットを利用したDBアクセスライブラリ
toEmoji - 日本語文を絵文字だけの文に変換するなにか
termextract - - 専門用語抽出アルゴリズムの実装の練習
JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
neural_ime - Neural IME: Neural Input Method Engine
neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
tinysegmenter - tokenizer specified for Japanese
AugLy-jp - Data Augmentation for Japanese Text on AugLy
furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
PyKatsuyou - Japanese verb/adjective inflections tool
jageocoder - Pure Python Japanese address geocoder
pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
nksnd - New kana-kanji conversion engine
JaMIE - A Japanese Medical Information Extraction Toolkit
fasttext-vs-word2vec-on-twitter-data - fasttextとword2vecの比較と、実行スクリプト、学習スクリプトです
minimal-search-engine - 最小のサーチエンジン/PageRank/tf-idf
5ch-analysis - 5chの過去ログをスクレイピングして、過去流行った単語(ex, 香具師, orz)などを追跡調査
tweet_extructor - Twitter日本語評判分析データセットのためのツイートダウンローダ
japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
jinf - A Japanese inflection converter
kwja - A unified language analyzer for Japanese
mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
whisper-asr-finetune - Finetuning Whisper ASR model
japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
radicalchar - 部首文字正規化ライブラリ
akaza - Yet another Japanese IME for IBus/Linux
posuto - Japanese postal code data.
tacotron2-japanese - Tacotron2 implementation of Japanese
ibus-hiragana - ひらがなIME for IBus
furiganapad - ふりがなパッド
chikkarpy - Japanese synonym library
ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
JapaneseEmbeddingEval - JapaneseEmbeddingEval
gptuber-by-langchain - GPTがYouTuberをやります
shuwa - Extend GNOME On-Screen Keyboard for Input Methods
japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
fugumt - ぷるーふおぶこんせぷと で公開した機械翻訳エンジンを利用する翻訳環境です。 フォームに入力された文字列の翻訳、PDFの翻訳が可能です。
JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
Retrieval-based-Voice-Conversion-WebUI-JP-localization - jp-localization
pyopenjtalk - Python wrapper for OpenJTalk
yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
N46Whisper - Whisper based Japanese subtitle generator
japanese_llm_simple_webui - Rinna-3.6B、OpenCALM等の日本語対応LLM(大規模言語モデル)用の簡易Webインタフェースです
pdf-translator - pdf-translator translates English PDF files into Japanese, preserving the original layout.
japanese_qa_demo_with_haystack_and_es - Haystack + Elasticsearch + wikipedia(ja) を用いた、日本語の質問応答システムのサンプル
mozc-devices - Automatically exported from code.google.com/p/mozc-morse
natsume - A Japanese text frontend processing toolkit
vits-japros-webui - 日本語TTS(VITS)の学習と音声合成のGradio WebUI
ja-law-parser - A Japanese law parser
dictation-kit - Japanese dictation kit using Julius
julius4seg - Juliusを使ったセグメンテーション支援ツール
voicevox_engine - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン
LLaVA-JP - LLaVA-JP is a Japanese VLM trained by LLaVA method
RAG-Japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
bertjsc - Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正
llm-leaderboard - Project of llm evaluation to Japanese tasks
jglue-evaluation-scripts - About
Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
BLIP2-Japanese - Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
wikipedia-passages-jawiki-embeddings-utils - wikipedia 日本語の文を、各種日本語の embeddings や faiss index へと変換するスクリプト等。
mecab - Yet another Japanese morphological analyzer
jumanpp - Juman++ (a Morphological Analyzer Toolkit)
kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
cabocha - Yet Another Japanese Dependency Structure Analyzer
knp - A Japanese Parser
Name
downloads/week
total downloads
stars
cabocha
-
-
knp
-
-
jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
aquaskk - An input method without morphological analysis.
mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
resembla - Resembla: Word-based Japanese similar sentence search library
corvusskk - ▽▼ SKK-like Japanese Input Method Editor for Windows
lindera - A morphological analysis library.
vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
goya - Japanese Morphological Analysis written in Rust
vibrato - vibrato: Viterbi-based accelerated tokenizer
yoin - A Japanese Morphological Analyzer written in pure Rust
mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
awabi - A morphological analyzer using mecab dictionary
wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角カナ] and Wide-alphanumeric[全角英数] into normal ones
kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
tokenizer-speed-bench - Comparison code of various tokenizers
stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
vime - Using Vim as an input method for X11 apps
voicevox_core - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア
akaza - Yet another Japanese IME for IBus/Linux
Jotoba - A free online, self-hostable, multilang Japanese dictionary.
dvorakjp-romantable - Google 日本語入力用DvorakJPローマ字テーブル / DvorakJP Roman Table for Google Japanese Input
niinii - Japanese glossator for assisted reading of text using Ichiran
cskk - SKK (Simple Kana Kanji henkan) library
japanki - Learn Japanese vocabs 🇯🇵 by doing quizzes on CLI!
jpreprocess - Japanese text preprocessor for Text-to-Speech applications (OpenJTalk rewrite in rust language)
kuromoji.js - JavaScript implementation of Japanese morphological analyzer
rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
Resources
node-mecab-ya - Yet another mecab wrapper for nodejs
juman-bin - a User-Extensible Morphological Analyzer for Japanese. 日本語形態素解析システム
node-mecab-async - Asynchronous japanese morphological analyser using MeCab.
kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
japanese-numerals-to-number - Converts Japanese Numerals into number
jslingua - Javascript libraries to process text: Arabic, Japanese, etc.
WanaKana - Javascript library for detecting and transliterating Hiragana <--> Katakana <--> Romaji
node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
kyujitai.js - Utility collections for making Japanese text old-fashioned
normalize-japanese-addresses - オープンソースの住所正規化ライブラリ。
kagome - Self-contained Japanese Morphological Analyzer written in pure Go
Name
downloads/week
total downloads
stars
kagome
-
-
ojosama - テキストを壱百満天原サロメお嬢様風の口調に変換します
nihongo - Japanese Dictionary
yomichan-import - External dictionary importer for Yomichan.
imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
go-kakasi - Kanji transliteration to hiragana/katakana/romaji, in Go
go-moji - A Go library for Zenkaku/Hankaku conversion
ojichat - おじさんがLINEやメールで送ってきそうな文を生成する
kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Sudachi - A Japanese Tokenizer for Business
SudachiDict - A lexicon for Sudachi
kanjitomo-ocr - Java library for identifying Japanese characters from images
jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
Kamite - A desktop language immersion companion for learners of Japanese
react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
elasticsearch-analysis-japanese - Japanese analyzer uses kuromoji japanese tokenizer for ElasticSearch
moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
neologdn-java - Japanese text normalizer for mecab-neologd
elasticsearch-sudachi - The Japanese analysis plugin for elasticsearch
bert-japanese - BERT models for Japanese text.
japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
bert-japanese - BERT with SentencePiece for Japanese text.
SudachiTra - Japanese tokenizer for Transformers
japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
ILYS-aoba-chatbot - ILYS-aoba-chatbot
t5-japanese - Codes to pre-train Japanese T5 models
pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
RoBERTa-japanese - Japanese BERT Pretrained Model
aMLP-japanese - aMLP Transformer Model for Japanese
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
gpt2-japanese - Japanese GPT2 Generation Model
text2text-japanese - gpt-2 based text2text conversion model
gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
albert-japanese - BERT with SentencePiece for Japanese text.
ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
luke - LUKE -- Language Understanding with Knowledge-based Embeddings
GPTSAN - General-purpose Swich transformer based Japanese language mode
japanese-clip - Japanese CLIP by rinna Co., Ltd.
AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
Japanese-Alpaca-LoRA - 日本語に翻訳したStanford Alpacaのデータセットを用いてLLaMAをファインチューニングし作成したLow-Rank AdapterのリンクとGenerateサンプルコード
albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert ) on Japanese Wikipedia Resources
japanese-llama-experiment - Japanese LLaMa experiment
mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
tdmelodic - A Japanese accent dictionary generator
jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
unidic-py - Unidic packaged for installation via pip.
Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
manbyo-sudachi - Sudachi向け万病辞書
jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
JIWC-Dictionary - dictionary to find emotion related to text
JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
ipadic-py - IPAdic packaged for easy use from Python.
unidic-lite - A small version of UniDic for easy pip installs.
emoji-ime-dictionary - 日本語で絵文字入力をするための IME 追加辞書 orange_book Google 日本語入力などで日本語から絵文字への変換を可能にする IME 拡張辞書
google-ime-dictionary - 日英変換・英語略語展開のための IME 追加辞書 orange_book 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書
dic-nico-intersection-pixiv - ニコニコ大百科とピクシブ百科事典の共通部分のIME辞書
google-ime-user-dictionary-ja-en - GoogleIME用カタカナ語辞書プロジェクトのアーカイブです。Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
emoticon - Google日本語入力の顔文字辞書∩(,,Ò‿Ó,,)∩
mecab-mozcdic - open source mozc dictionaryをMeCab辞書のフォーマットに変換したものです。
denonbu-ime-dic - 電音IME: Microsoft IMEなどで利用することを想定した「電音部」関連用語の辞書
nijisanji-ime-dic - Microsoft IMEなどで利用することを想定した「にじさんじ」関連用語の用語辞書です。
pokemon-ime-dic - Microsoft IMEなどで利用することを想定した、現状判明している全てのポケモンの名前を網羅した用語辞書です。
EJDict - English-Japanese Dictionary data (Public Domain) EJDict-hand
Ayashiy-Nipongo-Dic - 贵樣ばこゐ辞畫を使て正レい日本语を使ラことが出來ゑ。
genshin-dict - Windows/macOSで使える原神の単語辞書です
jmdict-simplified - JMdict and JMnedict in JSON format
mozcdict-ext - Convert external words into Mozc system dictionary
mh-dict-jp - MonsterHunterのユーザー辞書を作りたい…
jitenbot - Convert data from Japanese dictionary websites and applications into portable file formats
mecab-unidic-neologd - Neologism dictionary based on the language resources on the Web for mecab-unidic
hololive-dictionary - ホロライブ(ホロライブプロダクション)に関する辞書ファイルです。./dictionary フォルダ内のテキストファイルを使って、IMEに単語を追加できます。詳細はREADME.mdをご覧ください。
jmdict-yomitan - JMdict, JMnedict, KANJIDIC for Yomitan/Yomichan.
yomichan-jlpt-vocab - JLPT level tags for words in Yomichan
Jitendex - A free and openly licensed Japanese-to-English dictionary compatible with multiple dictionary clients
jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語 辞典 和英辞典 漢英字典 和独辞典 和蘭辞典
pixiv-yomitan - Pixiv Encyclopedia Dictionary for Yomitan
Part-of-speech tagging / Named entity recognition
JMRD - Japanese Movie Recommendation Dialogue dataset
open2ch-dialogue-corpus - おーぷん2ちゃんねるをクロールして作成した対話コーパス
BSD - The Business Scene Dialogue corpus
asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
japanese-corpus - 日本語の対話データ for seq2seq etc
BPersona-chat - This repository contains the Japanese–English bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
japanese-daily-dialogue - Japanese Daily Dialogue, or 日本語日常対話コーパス in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
llm-japanese-dataset - LLM構築用の日本語チャットデータセット
jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
simple-jppdb - A paraphrase database for Japanese text simplification
chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
JaNLI - Japanese Adversarial Natural Language Inference Dataset
ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
emoji-ja - UNICODE絵文字の日本語読み/キーワード/分類辞書
nayose-wikipedia-ja - Wikipediaから作成した日本語名寄せデータセット
ja.text8 - Japanese text8 corpus for word embedding.
ThreeLineSummaryDataset - 3行要約データセット
japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
kanji-frequency - Kanji usage frequency data collected from various sources
TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
CoARiJ - Corpus of Annual Reports in Japan
technological-book-corpus-ja - 日本語で書かれた技術書を収集した生コーパス/ツール
ita-corpus-chuwa - Chunked word annotation for ITA corpus
wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
inappropriate-words-ja - 日本語における不適切表現を収集します。自然言語処理の時のデータクリーニング用等に使えると思います。
house-of-councillors - 参議院の公式ウェブサイトから会派、議員、議案、質問主意書のデータを整理しました。
house-of-representatives - 国会議案データベース:衆議院
STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
speechBSD - An extension of the BSD corpus with audio and speaker attribute information
ita-corpus - ITAコーパスの文章リスト
rohan4600 - モーラバランス型日本語コーパス
anlp-jp-history - 言語処理学会年次大会講演の全リスト・機械可読版など
keigo_transfer_task - 敬語変換タスクにおける評価用データセット
loanwords_gairaigo - English loanwords in Japanese
jawikicorpus - Japanese-Wikipedia Wikification Corpus
GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
wrime - WRIME: 主観と客観の感情分析データセット
jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
WikipediaWordFrequencyList - 日本語Wikipediaで使用される頻出単語のリスト
kokkosho_data - 車両不具合情報に関するデータセット
pdmocrdataset-part1 - デジタル化資料OCRテキスト化事業において作成されたOCR学習用データセット
huriganacorpus-ndlbib - 全国書誌データから作成した振り仮名のデータセット
jvs_hiho - JVS (Japanese versatile speech) コーパスの自作のラベル
hirakanadic - Allows Sudachi to normalize from hiragana to katakana from any compound word list
animedb - 約100年に渡るアニメ作品リストデータベース
security_words - サイバーセキュリティに関連する公的な組織の日英対応
Data-on-Japanese-Diet-Members - 日本の国会議員のデータ
honkoku-data - 歴史資料の市民参加型翻刻プラットフォーム「みんなで翻刻」のテキストデータ置き場です。 / Transcription texts created on Minna de Honkoku (https://honkoku.org ), a crowdsourced transcription platform for historical Japanese documents.
wikihow_japanese - wikiHow dataset (Japanese version)
engineer-vocabulary-list - Engineer Vocabulary List in Japanese/English
JSICK - Japanese Sentences Involving Compositional Knowledge (JSICK) Dataset/JSICK-stress Test Set
phishurl-list - Phishing URL dataset from JPCERT/CC
jcms - A Japanese Corpus of Many Specialized Domains (JCMS)
aozorabunko_text - text-only archives of www.aozora.gr.jp
friendly_JA-Corpus - friendly_JA is a parallel Japanese-to-Japanese corpus aimed at making Japanese easier by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
topokanji - Topologically ordered lists of kanji for effective learning
isbn4groups - ISBN-13における日本語での出版物 (978-4-XXXXXXXXX) に関するデータ等
NMeCab - NMeCab: About Japanese morphological analyzer on .NET
ndlngramdata - デジタル化資料から作成したOCRテキストデータのngram頻度統計情報のデータセット
ndlngramviewer_v2 - 2023年1月にリニューアルしたNDL Ngram Viewerのソースコード等一式
data_set - 法律・判例関係のデータセット
huggingface-datasets_wrime - WRIME for huggingface datasets
ndl-minhon-ocrdataset - NDL古典籍OCR学習用データセット(みんなで翻刻加工データ)
PAX_SAPIENTICA - GIS & Archaeological Simulator. 2023 in development.
j-liwc2015 - Japanese version of LIWC2015
huggingface-datasets_livedoor-news-corpus - Japanese Livedoor news corpus for huggingface datasets
huggingface-datasets_JGLUE - JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
comet-atomic-ja - COMET-ATOMIC ja
dcsg-ja - Dialogue Commonsense Graph in Japanese
japanese-toxic-dataset - "Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.
camera - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset.
Japanese-Fakenews-Dataset - 日本語フェイクニュースデータセット
jpn_explainable_qa_dataset - jpn_explainable_qa_dataset
copa-japanese - COPA Dataset in Japanese
WLSP-familiarity - Word Familiarity Rate for 'Word List by Semantic Principles (WLSP)'
ProSub - A cross-linguistic study of pronoun substitutes and address terms
commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
ramendb - なんとかデータベース( https://supleks.jp/ )からのスクレイピングツールと収集データ
huggingface-datasets_CAMERA - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) for huggingface datasets
FactCheckSentenceNLI-FCSNLI- - FactCheckSentenceNLIデータセット
databricks-dolly-15k-ja - databricks/dolly-v2-12b の学習データに使用されたdatabricks-dolly-15k.jsonl を日本語に翻訳したデータセットになります。
EaST-MELD - EaST-MELD is an English-Japanese dataset for emotion-aware speech translation based on MELD.
meconaudio - Mecon Audio(Medical Conference Audio)は厚生労働省主催の先進医療会議の議事録の読み上げデータセットです。
japanese-addresses - 全国の町丁目レベル(277,191件)の住所データのオープンデータ
aozorasearch - The full-text search system for Aozora Bunko by Groonga. 青空文庫全文検索ライブラリ兼Webアプリ。
llm-jp-corpus - This repository contains scripts to reproduce the LLM-jp corpus.
alpaca_ja - alpacaデータセットを日本語化したものです
instruction_ja - Japanese instruction data (日本語指示データ)
japanese-family-names - Top 5000 Japanese family names, with readings, ordered by frequency.
kanji-data-media - Japanese language data on kanji, radicals, media files, fonts and related resources from Kanji alive
reazonspeech - Construct large-scale Japanese audio corpus at home
huriganacorpus-aozora - 青空文庫及びサピエの点字データから作成した振り仮名のデータセット
koniwa - An open collection of annotated voices in Japanese language
JMMLU - 日本語マルチタスク言語理解ベンチマーク Japanese Massive Multitask Language Understanding Benchmark
hurigana-speech-corpus-aozora - 青空文庫振り仮名注釈付き音声コーパスのデータセット