allenai / OLMo-Eval

Evaluation suite for LLMs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OLMo's results seem very bad and There's weight initialization issue

Ivan-Zhou opened this issue · comments

🐛 Describe the bug

Hi! Thanks for setting this repo up :)

I ran olmo_eval with allenai/OLMo-1B on the paloma dataset and I noticed two issues:

  1. The evaluation metric seems worse than I anticipated, especially the ppl_token which seems too high. I wonder if that's an averaged or summed measurement. I add a screenshot below and you can see my full results in this google sheet.
image
  1. This may be related to 1. I noticed when it inialized allenai/OLMo-1B, there's a warning that some many of the weights (if not all) are not initialized correctly. From the log below, it seems to be trying to initialize with the class OlmoForCausalLM.
Some weights of OlmoForCausalLM were not initialized from the model checkpoint at allenai/OLMo-1B and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.5.self_at
tn.q_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

To help reproduce the issue:

Here's my config in jsonnet:

/*--------------------------------------- Configurations -----------------------------------------*/

local utils = import 'utils.libsonnet';

//❗ To run this config you will need to first set up the data following the instructions in ai2-llm-eval/eval_data/README.md
//❗ Also note that this will run validation results. Change to paloma_hf_release_test.libsonnet to run test results.
local ppl_suite = import 'task_sets/paloma_hf_release_val.libsonnet';


//❗Set gsheet to the name of your google sheet.
// Set it to null if you do not want your results to be uploaded to a google sheet (they will still be saved as an object).
local gsheet = null;
//❗Set output_dir to a directory where you want to save outputs as jsonl.gz files .
// Set it to null if you do not want your results saved as jsonl.gz files.
local output_dir = "paloma_olmo";

local models = [
    {
        model_path: "allenai/OLMo-1B",
        gpus_needed: 1,
        prediction_kwargs: {
            model_max_length: 2048,
            max_batch_tokens: 20480,
            limit: null,
        }
    }
];
local task_sets = [
    ppl_suite.task_set
];


{
    steps: utils.create_fine_grained_pipeline(models, task_sets, gsheet, output_dir)
}

Versions

Python 3.10.14
absl-py==2.1.0
accelerate==0.30.1
ai2-catwalk==1.0.0rc0
ai2-olmo==0.3.0
-e git+ssh://git@github.com/allenai/OLMo-Eval.git@51ad8f9a95e2eabbddc04eda5e23aa16fb64138e#egg=ai2_olmo_eval
ai2-tango==1.3.2
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
base58==2.1.1
beaker-py==1.26.12
bert-score==0.3.13
bettermap==1.3.1
blis==0.7.11
boto3==1.34.104
botocore==1.34.104
cached_path==1.6.2
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.3
click-help-colors==0.9.4
cloudpathlib==0.16.0
colorama==0.4.6
confection==0.1.4
contourpy==1.2.1
cycler==0.12.1
cymem==2.0.8
datasets==2.19.1
decorator==5.1.1
dill==0.3.8
docker==6.1.3
docker-pycreds==0.4.0
exceptiongroup==1.2.1
fairscale==0.4.13
filelock==3.13.4
fonttools==4.51.0
frozenlist==1.4.1
fsspec==2024.3.1
gitdb==4.0.11
GitPython==3.1.43
glob2==0.7
google-api-core==2.19.0
google-api-python-client==2.129.0
google-auth==2.29.0
google-auth-httplib2==0.2.0
google-auth-oauthlib==1.2.0
google-cloud-core==2.4.1
google-cloud-datastore==2.19.0
google-cloud-storage==2.16.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
grpcio==1.63.0
grpcio-status==1.48.2
httplib2==0.22.0
huggingface-hub==0.21.4
idna==3.7
iniconfig==2.0.0
iso639==0.1.4
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
kiwisolver==1.4.5
langcodes==3.4.0
language_data==1.2.0
lxml==5.2.2
marisa-trie==1.1.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.4
mdurl==0.1.2
more-itertools==9.1.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
murmurhash==1.0.10
mypy-extensions==1.0.0
networkx==3.3
nltk==3.8.1
numpy==1.26.4
oauthlib==3.2.2
omegaconf==2.3.0
packaging==24.0
pandas==2.2.2
pathtools==0.1.2
petname==2.6
pillow==10.3.0
pluggy==1.5.0
portalocker==2.8.2
preshed==3.0.9
proto-plus==1.23.0
protobuf==3.19.6
psutil==5.9.8
py==1.11.0
pyarrow==16.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.7.1
pydantic_core==2.18.2
Pygments==2.18.0
pygsheets==2.0.6
pyparsing==3.1.2
pytest==8.2.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.10
requests==2.31.0
requests-oauthlib==2.0.0
retry==0.9.2
rich==13.7.1
rjsonnet==0.5.4
rouge_score==0.1.2
rsa==4.9
s3transfer==0.10.1
sacrebleu==2.4.2
sacremoses==0.1.1
safetensors==0.4.3
scikit-learn==1.4.2
scipy==1.13.0
sentencepiece==0.1.98
sentry-sdk==2.1.1
setproctitle==1.3.3
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
spacy==3.7.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
sqlitedict==2.1.0
srsly==2.4.8
sympy==1.12
tabulate==0.9.0
thinc==8.2.3
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.0.1
torchmetrics==0.11.1
tqdm==4.66.4
transformers==4.40.2
typer==0.9.4
typing_extensions==4.11.0
tzdata==2024.1
uritemplate==4.1.1
urllib3==2.2.1
wandb==0.14.2
wasabi==1.1.2
weasel==0.3.4
websocket-client==1.8.0
wget==3.2
xxhash==3.4.1
yarl==1.9.4

As a follow up, the issue happens when trying to load allenai/OLMo-1B with AutoModelForCausalLM.from_pretrained.

>>> olmo = OLMoForCausalLM.from_pretrained("allenai/OLMo-1B")
>>> olmo_hf = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B-hf")
>>> olmo_auto = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B")
Some weights of OlmoForCausalLM were not initialized from the model checkpoint at allenai/OLMo-1B and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.ga
te_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layer ......
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

OLMo-eval (here) constructs a model using Catwalk's implementation, which uses AutoModelForCausalLM , which doesn't fit with allenai/OLMo-1B. Instead, I should probably use allenai/OLMo-1B-hf.

Yes, you should use allenai/OLMo-1B-hf. allenai/OLMo-1B does not support AutoModelForCausalLM (more info about the types of OLMo checkpoints here: https://github.com/allenai/OLMo/blob/main/docs/Checkpoints.md).