karpathy / minGPT

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should -1 marker (as special token) be counted in vocab_size?

mw66 opened this issue · comments

commented

y[:ndigit*2-1] = -1 # we will only train in the output locations. -1 will mask loss to zero

return 10 # digits 0..9

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab
you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower