Please update data’s url.
spider-man-tm opened this issue · comments
Takayoshi Makabe commented
I noticed that the wikipedia dataset has been updated in all languages.
as is (src/coupus/jp_wiki/config.py)
class Config(object):
def __init__(self):
self.corpus_name = "jp_wiki"
# Management
self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20211025-cirrussearch-content.json.gz"
self.raw_data_dir = "../data/jp_wiki/raw_data"
self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
self.doc_data_dir = "../data/jp_wiki/doc_data"
to be (src/coupus/jp_wiki/config.py)
class Config(object):
def __init__(self):
self.corpus_name = "jp_wiki"
# Management
self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20220228-cirrussearch-content.json.gz"
self.raw_data_dir = "../data/jp_wiki/raw_data"
self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
self.doc_data_dir = "../data/jp_wiki/doc_data"
Zhao Tianyu commented
Thank you. It is now solved in this commit.
nissansz commented
json文件里存的的是unicode编码 "text":"\u30a2\u30d5\u30ea\u30ab \u30a2\u30d5\u30ea\u30ab\uff08\u82f1\u00a0:
lines1 = f1.read()
lines1 = lines1 .encode('utf-8').decode("unicode_escape")
print(path1+':'+line)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 118-119: surrogates not allowed
这个错误怎么解决?