facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error occurs when using my own trained LASER model

weitaizhang opened this issue · comments

Hi,
I follow the instructions from LASER https://github.com/facebookresearch/fairseq/blob/nllb/examples/laser/README.md and train a LASER model with my own data (Chinese-English Bitext).
But when I embed sentences with the trained model, an error occurs as follows:
image

from the logs and codes, I found the checkpoint file of my trained models is different from laser2.pt, and the state_dict of my checkpoint file has no "params".
image

So is there something wrong when I trained the LASER model (actually I want to train a Teacher model)?
Thanks for your reply.

Hi @weitaizhang, did you train this model using --arch laser_lstm?

Hi @weitaizhang, did you train this model using --arch laser_lstm?

@heffernankevin yes, I used --arch laser_lstm.
FYI, here's my training script:
image
And did you train the laser2 model with fairseq codes? Or should I convert the trained checkpoint file to some certain structure after training ?

Hi @weitaizhang, yes you'll need to convert the checkpoint file. Can you try running the following temporary solution script (from the $LASER/source directory) and then try re-running the embed step?

import torch
from fairseq.data.dictionary import Dictionary
from embed import LaserLstmEncoder

ckpt = torch.load([path/to/your/checkpoint], map_location="cpu")

encoder_state = {
    key.replace("encoder.", ""): val
    for key, val in ckpt["model"].items()
    if key.startswith("encoder.")
}

params = {
    "num_embeddings": encoder_state["embed_tokens.weight"].shape[0],
    "padding_idx": 1,
    "embed_dim": encoder_state["embed_tokens.weight"].shape[1],
    "hidden_size": ckpt["args"].encoder_hidden_size,
    "num_layers": ckpt["args"].encoder_layers,
    "bidirectional": ckpt["args"].encoder_bidirectional,
}
encoder = LaserLstmEncoder(**params)
encoder.embed_tokens.weight = torch.nn.Parameter(encoder_state["embed_tokens.weight"])
encoder.lstm.load_state_dict(
    {
        key.replace("lstm.", ""): val
        for key, val in encoder_state.items()
        if key.startswith("lstm.")
    }
)

state_dict = {
    "params": params,
    "model": encoder.state_dict(),
    "dictionary": Dictionary.load([path/to/your/src_vocab/file/from/config.json]).indices,
}
torch.save(state_dict, "encoder.pt")

Hi @weitaizhang, yes you'll need to convert the checkpoint file. Can you try running the following temporary solution script (from the $LASER/source directory) and then try re-running the embed step?

import torch
from fairseq.data.dictionary import Dictionary
from embed import LaserLstmEncoder

ckpt = torch.load([path/to/your/checkpoint], map_location="cpu")

encoder_state = {
    key.replace("encoder.", ""): val
    for key, val in ckpt["model"].items()
    if key.startswith("encoder.")
}

params = {
    "num_embeddings": encoder_state["embed_tokens.weight"].shape[0],
    "padding_idx": 1,
    "embed_dim": encoder_state["embed_tokens.weight"].shape[1],
    "hidden_size": ckpt["args"].encoder_hidden_size,
    "num_layers": ckpt["args"].encoder_layers,
    "bidirectional": ckpt["args"].encoder_bidirectional,
}
encoder = LaserLstmEncoder(**params)
encoder.embed_tokens.weight = torch.nn.Parameter(encoder_state["embed_tokens.weight"])
encoder.lstm.load_state_dict(
    {
        key.replace("lstm.", ""): val
        for key, val in encoder_state.items()
        if key.startswith("lstm.")
    }
)

state_dict = {
    "params": params,
    "model": encoder.state_dict(),
    "dictionary": Dictionary.load([path/to/your/src_vocab/file/from/config.json]).indices,
}
torch.save(state_dict, "encoder.pt")

Many thanks, it works now!