Cannot find "no_padding" option in C# ?
myl1ne opened this issue · comments
Hey @SergeiAlonichau and @ankane, I'm trying to get parity with HuggingFace tokeniser from the BlingFire C# bindings:
public int[] TestTokenise(string input_str)
{
string tokeniserModelPath = "D:/Models/tokenizers/gpt2.bin";
tokenizerHandle = BlingFireUtils.LoadModel(tokeniserModelPath);
BlingFireUtils.SetNoDummyPrefix(tokenizerHandle, false);
Debug.Log($"About to tokenize {input_str}");
byte[] inBytes = System.Text.Encoding.UTF8.GetBytes(input_str);
int[] ids = new int[128];
int outputCount = BlingFireUtils.TextToIds(tokenizerHandle, inBytes, inBytes.Length, ids, ids.Length, 0);
Debug.Log($"Found {outputCount} tokens [{string.Join(",",ids)}]");
return ids.Take(outputCount).ToArray();
}
I'm getting different tokens than what @ankane had earlier:
From your discussion, I think I'd need to set no_padding to true, but I do not find this option in the C# interface. Any clue where I should look?
Originally posted by @stephane-lallee in #82 (comment)
you need to resize the ids array down to outputCount elements, see C# example: https://github.com/microsoft/BlingFire/blob/master/nuget/test/Program.cs .
var outputCount = BlingFireUtils.TextToIds(h1, inBytes, inBytes.Length, Ids, Ids.Length, 0);
Console.WriteLine(String.Format("return length: {0}", outputCount));
if (outputCount >= 0)
{
Array.Resize(ref Ids, outputCount);
Console.WriteLine(String.Format("return array: [{0}]", string.Join(", ", Ids)));
...
Python API's have this parameter, but not C#.
Use this API after the model is loaded
BlingFire/nuget/lib/BlingFireUtils.cs
Line 221 in d9d5cea
AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True/False)