Unknown PostProcessor type: Sequence

Question

Unknown PostProcessor type: Sequence

zcbenz opened this issue a month ago · comments

Cheng commented a month ago

System Info

Using Node.js 20 with transformers.js 2.17.1.

Environment/Platform

Description

It seems that following post preprocessor in tokenizer.json is not supported:

  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          },
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 1
            }
          },
          {
            "Sequence": {
              "id": "B",
              "type_id": 1
            }
          }
        ],
        "special_tokens": {
          "<|begin_of_text|>": {
            "id": "<|begin_of_text|>",
            "ids": [
              128000
            ],
            "tokens": [
              "<|begin_of_text|>"
            ]
          }
        }
      }
    ]
  },

Reproduction

import {AutoTokenizer} from '@xenova/transformers'

AutoTokenizer.from_pretrained('yujiepan/llama-3-tiny-random')

Throws error:

                throw new Error(`Unknown PostProcessor type: ${config.type}`);
                      ^

Error: Unknown PostProcessor type: Sequence
    at PostProcessor.fromConfig (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:1596:23)
    at new PreTrainedTokenizer (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:2465:45)
    at AutoTokenizer.from_pretrained (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:4424:16)

Joshua Lochner · Answer 1 · Mon May 06 2024 09:37:54 GMT+0800 (China Standard Time)

Hi there 👋 Thanks for the report!

Luckily, we already support the ByteLevel and TemplateProcessing post-processors, so the only thing needed is to implement the Sequence post-processor.

Similarly, we already support sequences of normalizers, decoders, and pre-tokenizers, and a similar pattern can be adapted for post-processors. Is this something you'd be interested in adding? If so, I'd be happy to review a PR.

Cheng · Answer 2 · Mon May 06 2024 09:50:47 GMT+0800 (China Standard Time)

Sorry I don't plan to work on this issue, I'm just reporting a random issue I met.

Joshua Lochner · Answer 3 · Mon May 06 2024 10:20:18 GMT+0800 (China Standard Time)

No worries! It's super simple, so I'll add it soon. Thanks again for reporting!

Dominic Bosco · Answer 4 · Mon May 13 2024 21:50:51 GMT+0800 (China Standard Time)

@xenova I am also encountering this issue. I was going to take a pass at it, but I don't understand the internals well enough to understand how to meaningfully accumulate the token_type_ids generated by post processors (ref).

Dominic Bosco · Answer 5 · Wed May 15 2024 03:07:22 GMT+0800 (China Standard Time)

@xenova Any thoughts on this? This is preventing loading llama 3 8b, which is a bummer.

Joshua Lochner · Answer 6 · Wed May 15 2024 06:47:53 GMT+0800 (China Standard Time)

Here's the rust code for it: https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/processors/sequence.rs#L18-L36 and it should be easy to translate into JS.

Joshua Lochner · Answer 7 · Thu May 23 2024 09:12:38 GMT+0800 (China Standard Time)

I added support for it in #771. See here for example usage.

Dominic Bosco · Answer 8 · Fri May 24 2024 02:13:21 GMT+0800 (China Standard Time)

@xenova 💯 thank you, that is awesome. What cadence are you doing releases on?

Joshua Lochner · Answer 9 · Fri May 24 2024 04:44:17 GMT+0800 (China Standard Time)

Will most likely do one tomorrow. Just finalizing a few other things. :)