rinnakk / japanese-pretrained-models

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

Home Page:https://huggingface.co/rinna

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py just return empty list.

HiroshigeAoki opened this issue · comments

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py only return empty list.
Is it intended behavior?
Thank you.

def load_docs_from_filepath(filepath, tokenizer):
    docs = []
    with open(filepath, encoding="utf-8") as f:
        doc = []
        for line in f:
            line = line.strip()
            if line == "":
                if len(doc) > 0:
                    docs.append(doc)
                doc = []
            else:
                sent = line
                tokens = tokenizer.tokenize(sent)
                token_ids = tokenizer.convert_tokens_to_ids(tokens)
                if len(token_ids) > 0:
                    doc.append(token_ids)
    return docs

Hi, this is not supposed to happen.
Please check the content of the file at filepath. If it is not empty, please paste some lines of it here so we can better understand what is happening.

This was my fault. I'm sorry to interrupt you...