Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer
thakkarparth007 opened this issue · comments
If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.
"# Here are some relevant code fragments from other files of the repo:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/add.py\n# --------------------------------------------------\n# import subprocess\n# from typing import Tuple\n# \n# from mindflow.utils.execute import execute_no_trace\n# \n# \n# def run_add(args: Tuple[str]):\n# \"\"\"\n# Add command.\n# \"\"\"\n# command = [\"git\", \"add\"] + list(args)\n# \n# # Execute the git diff command and retrieve the output as a string\n# execute_no_trace(command)\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# return\n# \n# title, body = title_body_tuple\n# \n# command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body] # type: ignore\n# print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n# base_branch, title: Optional[str], body: Optional[str]\n# ) -> Optional[Tuple[str, str]]:\n# settings = Settings()\n# \n# diff_output = run_diff((base_branch,))\n# if not diff_output:\n# diff_output = \"\"\n# \n# title_response: Union[ModelError, str]\n# body_response: Union[ModelError, str]\n# if title is None and body is None:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# from mindflow.utils.prompts import PR_BODY_PREFIX\n# from mindflow.utils.prompts import PR_TITLE_PREFIX\n# \n# \n# def run_pr(args: Tuple[str], title: Optional[str] = None, body: Optional[str] = None):\n# base_branch = get_flag_value(args, [\"--base\", \"-B\"])\n# \n# if base_branch is None:\n# # Determine the name of the default branch\n# base_branch = (\n# subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n# .decode()\n# .strip()\n# .split(\"/\")[-1]\n# )\n# \n# if not title or not body:\n# title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n# if not title_body_tuple:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n# .decode()\n# .strip()\n# .split(\"/\")[-1]\n# )\n# \n# if not title or not body:\n# title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n# if not title_body_tuple:\n# return\n# \n# title, body = title_body_tuple\n# \n# command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body] # type: ignore\n# print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n# base_branch, title: Optional[str], body: Optional[str]\n# --------------------------------------------------\n\nfrom typing import Optional, Tuple, List\n\nfrom mindflow.core.git.pr import create_title_and_body\nfrom mindflow.utils.command_parse import get_flag_value\nfrom mindflow.utils.execute import execute_no_trace\n\n\ndef run_mr(\n args: Tuple[str], title: Optional[str] = None, description: Optional[str] = None\n):\n base_branch = get_flag_value(args, [\"--target-branch\", \"-b\"])\n\n if base_branch is None:\n # Determine the name of the default branch\n base_branch = (\n subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])"
Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).
I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.
Was there a specific use case you have in mind that this is blocking?
Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.
Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!
Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2
It seems to work fine for me.
That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.
Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!
Thanks!
I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:
I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are