rinnakk / japanese-pretrained-models

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

Home Page:https://huggingface.co/rinna

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Please update data’s url.

spider-man-tm opened this issue · comments

I noticed that the wikipedia dataset has been updated in all languages.

as is (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20211025-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"

to be (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20220228-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"

Thank you. It is now solved in this commit.

json文件里存的的是unicode编码 "text":"\u30a2\u30d5\u30ea\u30ab \u30a2\u30d5\u30ea\u30ab\uff08\u82f1\u00a0:

    lines1 = f1.read()
    lines1  = lines1 .encode('utf-8').decode("unicode_escape")

print(path1+':'+line)

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 118-119: surrogates not allowed

这个错误怎么解决?