microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot find "no_padding" option in C# ?

myl1ne opened this issue · comments

Hey @SergeiAlonichau and @ankane, I'm trying to get parity with HuggingFace tokeniser from the BlingFire C# bindings:

    public int[] TestTokenise(string input_str)
    {
        string tokeniserModelPath = "D:/Models/tokenizers/gpt2.bin";
        tokenizerHandle = BlingFireUtils.LoadModel(tokeniserModelPath);
        BlingFireUtils.SetNoDummyPrefix(tokenizerHandle, false);
        Debug.Log($"About to tokenize {input_str}");
        byte[] inBytes = System.Text.Encoding.UTF8.GetBytes(input_str);
        int[] ids = new int[128];
        int outputCount = BlingFireUtils.TextToIds(tokenizerHandle, inBytes, inBytes.Length, ids, ids.Length, 0);
        Debug.Log($"Found {outputCount} tokens [{string.Join(",",ids)}]");
        return ids.Take(outputCount).ToArray();
    }

I'm getting different tokens than what @ankane had earlier:
image

From your discussion, I think I'd need to set no_padding to true, but I do not find this option in the C# interface. Any clue where I should look?

Originally posted by @stephane-lallee in #82 (comment)

you need to resize the ids array down to outputCount elements, see C# example: https://github.com/microsoft/BlingFire/blob/master/nuget/test/Program.cs .

        var outputCount = BlingFireUtils.TextToIds(h1, inBytes, inBytes.Length, Ids, Ids.Length, 0);
        Console.WriteLine(String.Format("return length: {0}", outputCount));
        if (outputCount >= 0)
        {
            Array.Resize(ref Ids, outputCount);
            Console.WriteLine(String.Format("return array: [{0}]", string.Join(", ", Ids)));

...

Python API's have this parameter, but not C#.

Use this API after the model is loaded

public static extern int SetNoDummyPrefix(UInt64 model, bool fNoDummyPrefix);
to control if a special space added in front of the input or not, this corresponds to : add_prefix_space=True/False from Huggingface tokenizers.

AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True/False)