xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!

Home Page:https://huggingface.co/docs/transformers.js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unknown PostProcessor type: Sequence

zcbenz opened this issue · comments

commented

System Info

Using Node.js 20 with transformers.js 2.17.1.

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

It seems that following post preprocessor in tokenizer.json is not supported:

  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          },
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 1
            }
          },
          {
            "Sequence": {
              "id": "B",
              "type_id": 1
            }
          }
        ],
        "special_tokens": {
          "<|begin_of_text|>": {
            "id": "<|begin_of_text|>",
            "ids": [
              128000
            ],
            "tokens": [
              "<|begin_of_text|>"
            ]
          }
        }
      }
    ]
  },

Reproduction

import {AutoTokenizer} from '@xenova/transformers'

AutoTokenizer.from_pretrained('yujiepan/llama-3-tiny-random')

Throws error:

                throw new Error(`Unknown PostProcessor type: ${config.type}`);
                      ^

Error: Unknown PostProcessor type: Sequence
    at PostProcessor.fromConfig (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:1596:23)
    at new PreTrainedTokenizer (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:2465:45)
    at AutoTokenizer.from_pretrained (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:4424:16)

Hi there 👋 Thanks for the report!

Luckily, we already support the ByteLevel and TemplateProcessing post-processors, so the only thing needed is to implement the Sequence post-processor.

Similarly, we already support sequences of normalizers, decoders, and pre-tokenizers, and a similar pattern can be adapted for post-processors. Is this something you'd be interested in adding? If so, I'd be happy to review a PR.

commented

Sorry I don't plan to work on this issue, I'm just reporting a random issue I met.

No worries! It's super simple, so I'll add it soon. Thanks again for reporting!

@xenova I am also encountering this issue. I was going to take a pass at it, but I don't understand the internals well enough to understand how to meaningfully accumulate the token_type_ids generated by post processors (ref).

@xenova Any thoughts on this? This is preventing loading llama 3 8b, which is a bummer.

I added support for it in #771. See here for example usage.

@xenova 💯 thank you, that is awesome. What cadence are you doing releases on?

Will most likely do one tomorrow. Just finalizing a few other things. :)