Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer

Question

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer

thakkarparth007 opened this issue a year ago · comments

If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.

"# Here are some relevant code fragments from other files of the repo:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/add.py\n# --------------------------------------------------\n# import subprocess\n# from typing import Tuple\n# \n# from mindflow.utils.execute import execute_no_trace\n# \n# \n# def run_add(args: Tuple[str]):\n#     \"\"\"\n#     Add command.\n#     \"\"\"\n#     command = [\"git\", \"add\"] + list(args)\n# \n#     # Execute the git diff command and retrieve the output as a string\n#     execute_no_trace(command)\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# ) -> Optional[Tuple[str, str]]:\n#     settings = Settings()\n# \n#     diff_output = run_diff((base_branch,))\n#     if not diff_output:\n#         diff_output = \"\"\n# \n#     title_response: Union[ModelError, str]\n#     body_response: Union[ModelError, str]\n#     if title is None and body is None:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# from mindflow.utils.prompts import PR_BODY_PREFIX\n# from mindflow.utils.prompts import PR_TITLE_PREFIX\n# \n# \n# def run_pr(args: Tuple[str], title: Optional[str] = None, body: Optional[str] = None):\n#     base_branch = get_flag_value(args, [\"--base\", \"-B\"])\n# \n#     if base_branch is None:\n#         # Determine the name of the default branch\n#         base_branch = (\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# --------------------------------------------------\n\nfrom typing import Optional, Tuple, List\n\nfrom mindflow.core.git.pr import create_title_and_body\nfrom mindflow.utils.command_parse import get_flag_value\nfrom mindflow.utils.execute import execute_no_trace\n\n\ndef run_mr(\n    args: Tuple[str], title: Optional[str] = None, description: Optional[str] = None\n):\n    base_branch = get_flag_value(args, [\"--target-branch\", \"-b\"])\n\n    if base_branch is None:\n        # Determine the name of the default branch\n        base_branch = (\n            subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])"

James Ravenscroft · Answer 1 · Thu Apr 13 2023 03:47:12 GMT+0800 (China Standard Time)

Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).

I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.

Was there a specific use case you have in mind that this is blocking?

Parth Thakkar · Answer 2 · Thu Apr 13 2023 04:55:59 GMT+0800 (China Standard Time)

Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.

James Ravenscroft · Answer 3 · Thu Apr 13 2023 05:06:43 GMT+0800 (China Standard Time)

Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!

Parth Thakkar · Answer 4 · Thu Apr 13 2023 07:00:49 GMT+0800 (China Standard Time)

Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2

It seems to work fine for me.

James Ravenscroft · Answer 5 · Thu Apr 13 2023 14:20:35 GMT+0800 (China Standard Time)

That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.

Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!

Parth Thakkar · Answer 6 · Thu Apr 13 2023 16:41:04 GMT+0800 (China Standard Time)

Thanks!

I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:

I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are