Unknown PostProcessor type: Sequence
zcbenz opened this issue · comments
System Info
Using Node.js 20 with transformers.js 2.17.1.
Environment/Platform
- Website/web-app
- Browser extension
- Server-side (e.g., Node.js, Deno, Bun)
- Desktop app (e.g., Electron)
- Other (e.g., VSCode extension)
Description
It seems that following post preprocessor in tokenizer.json
is not supported:
"post_processor": {
"type": "Sequence",
"processors": [
{
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": false,
"use_regex": true
},
{
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
}
],
"pair": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 1
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
}
],
"special_tokens": {
"<|begin_of_text|>": {
"id": "<|begin_of_text|>",
"ids": [
128000
],
"tokens": [
"<|begin_of_text|>"
]
}
}
}
]
},
Reproduction
import {AutoTokenizer} from '@xenova/transformers'
AutoTokenizer.from_pretrained('yujiepan/llama-3-tiny-random')
Throws error:
throw new Error(`Unknown PostProcessor type: ${config.type}`);
^
Error: Unknown PostProcessor type: Sequence
at PostProcessor.fromConfig (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:1596:23)
at new PreTrainedTokenizer (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:2465:45)
at AutoTokenizer.from_pretrained (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:4424:16)
Hi there 👋 Thanks for the report!
Luckily, we already support the ByteLevel and TemplateProcessing post-processors, so the only thing needed is to implement the Sequence post-processor.
Similarly, we already support sequences of normalizers, decoders, and pre-tokenizers, and a similar pattern can be adapted for post-processors. Is this something you'd be interested in adding? If so, I'd be happy to review a PR.
Sorry I don't plan to work on this issue, I'm just reporting a random issue I met.
No worries! It's super simple, so I'll add it soon. Thanks again for reporting!
@xenova Any thoughts on this? This is preventing loading llama 3 8b, which is a bummer.
Here's the rust code for it: https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/processors/sequence.rs#L18-L36 and it should be easy to translate into JS.
@xenova 💯 thank you, that is awesome. What cadence are you doing releases on?
Will most likely do one tomorrow. Just finalizing a few other things. :)