OLMo's results seem very bad and There's weight initialization issue

Question

OLMo's results seem very bad and There's weight initialization issue

Ivan-Zhou opened this issue 2 months ago · comments

Ivan Zhou commented 2 months ago

🐛 Describe the bug

Hi! Thanks for setting this repo up :)

I ran olmo_eval with allenai/OLMo-1B on the paloma dataset and I noticed two issues:

The evaluation metric seems worse than I anticipated, especially the ppl_token which seems too high. I wonder if that's an averaged or summed measurement. I add a screenshot below and you can see my full results in this google sheet.

This may be related to 1. I noticed when it inialized allenai/OLMo-1B, there's a warning that some many of the weights (if not all) are not initialized correctly. From the log below, it seems to be trying to initialize with the class OlmoForCausalLM.

Some weights of OlmoForCausalLM were not initialized from the model checkpoint at allenai/OLMo-1B and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.5.self_at
tn.q_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

To help reproduce the issue:

Here's my config in jsonnet:

/*--------------------------------------- Configurations -----------------------------------------*/

local utils = import 'utils.libsonnet';

//❗ To run this config you will need to first set up the data following the instructions in ai2-llm-eval/eval_data/README.md
//❗ Also note that this will run validation results. Change to paloma_hf_release_test.libsonnet to run test results.
local ppl_suite = import 'task_sets/paloma_hf_release_val.libsonnet';


//❗Set gsheet to the name of your google sheet.
// Set it to null if you do not want your results to be uploaded to a google sheet (they will still be saved as an object).
local gsheet = null;
//❗Set output_dir to a directory where you want to save outputs as jsonl.gz files .
// Set it to null if you do not want your results saved as jsonl.gz files.
local output_dir = "paloma_olmo";

local models = [
    {
        model_path: "allenai/OLMo-1B",
        gpus_needed: 1,
        prediction_kwargs: {
            model_max_length: 2048,
            max_batch_tokens: 20480,
            limit: null,
        }
    }
];
local task_sets = [
    ppl_suite.task_set
];


{
    steps: utils.create_fine_grained_pipeline(models, task_sets, gsheet, output_dir)
}

Versions

Python 3.10.14
absl-py==2.1.0
accelerate==0.30.1
ai2-catwalk==1.0.0rc0
ai2-olmo==0.3.0
-e git+ssh://git@github.com/allenai/OLMo-Eval.git@51ad8f9a95e2eabbddc04eda5e23aa16fb64138e#egg=ai2_olmo_eval
ai2-tango==1.3.2
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
base58==2.1.1
beaker-py==1.26.12
bert-score==0.3.13
bettermap==1.3.1
blis==0.7.11
boto3==1.34.104
botocore==1.34.104
cached_path==1.6.2
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.3
click-help-colors==0.9.4
cloudpathlib==0.16.0
colorama==0.4.6
confection==0.1.4
contourpy==1.2.1
cycler==0.12.1
cymem==2.0.8
datasets==2.19.1
decorator==5.1.1
dill==0.3.8
docker==6.1.3
docker-pycreds==0.4.0
exceptiongroup==1.2.1
fairscale==0.4.13
filelock==3.13.4
fonttools==4.51.0
frozenlist==1.4.1
fsspec==2024.3.1
gitdb==4.0.11
GitPython==3.1.43
glob2==0.7
google-api-core==2.19.0
google-api-python-client==2.129.0
google-auth==2.29.0
google-auth-httplib2==0.2.0
google-auth-oauthlib==1.2.0
google-cloud-core==2.4.1
google-cloud-datastore==2.19.0
google-cloud-storage==2.16.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
grpcio==1.63.0
grpcio-status==1.48.2
httplib2==0.22.0
huggingface-hub==0.21.4
idna==3.7
iniconfig==2.0.0
iso639==0.1.4
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
kiwisolver==1.4.5
langcodes==3.4.0
language_data==1.2.0
lxml==5.2.2
marisa-trie==1.1.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.4
mdurl==0.1.2
more-itertools==9.1.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
murmurhash==1.0.10
mypy-extensions==1.0.0
networkx==3.3
nltk==3.8.1
numpy==1.26.4
oauthlib==3.2.2
omegaconf==2.3.0
packaging==24.0
pandas==2.2.2
pathtools==0.1.2
petname==2.6
pillow==10.3.0
pluggy==1.5.0
portalocker==2.8.2
preshed==3.0.9
proto-plus==1.23.0
protobuf==3.19.6
psutil==5.9.8
py==1.11.0
pyarrow==16.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.7.1
pydantic_core==2.18.2
Pygments==2.18.0
pygsheets==2.0.6
pyparsing==3.1.2
pytest==8.2.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.10
requests==2.31.0
requests-oauthlib==2.0.0
retry==0.9.2
rich==13.7.1
rjsonnet==0.5.4
rouge_score==0.1.2
rsa==4.9
s3transfer==0.10.1
sacrebleu==2.4.2
sacremoses==0.1.1
safetensors==0.4.3
scikit-learn==1.4.2
scipy==1.13.0
sentencepiece==0.1.98
sentry-sdk==2.1.1
setproctitle==1.3.3
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
spacy==3.7.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
sqlitedict==2.1.0
srsly==2.4.8
sympy==1.12
tabulate==0.9.0
thinc==8.2.3
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.0.1
torchmetrics==0.11.1
tqdm==4.66.4
transformers==4.40.2
typer==0.9.4
typing_extensions==4.11.0
tzdata==2024.1
uritemplate==4.1.1
urllib3==2.2.1
wandb==0.14.2
wasabi==1.1.2
weasel==0.3.4
websocket-client==1.8.0
wget==3.2
xxhash==3.4.1
yarl==1.9.4

Ivan Zhou · Answer 1 · Sun May 19 2024 01:32:15 GMT+0800 (China Standard Time)

As a follow up, the issue happens when trying to load allenai/OLMo-1B with AutoModelForCausalLM.from_pretrained.

>>> olmo = OLMoForCausalLM.from_pretrained("allenai/OLMo-1B")
>>> olmo_hf = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B-hf")
>>> olmo_auto = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B")
Some weights of OlmoForCausalLM were not initialized from the model checkpoint at allenai/OLMo-1B and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.ga
te_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layer ......
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

OLMo-eval (here) constructs a model using Catwalk's implementation, which uses AutoModelForCausalLM , which doesn't fit with allenai/OLMo-1B. Instead, I should probably use allenai/OLMo-1B-hf.

Shane A · Answer 2 · Wed May 22 2024 04:53:20 GMT+0800 (China Standard Time)

Yes, you should use allenai/OLMo-1B-hf. allenai/OLMo-1B does not support AutoModelForCausalLM (more info about the types of OLMo checkpoints here: https://github.com/allenai/OLMo/blob/main/docs/Checkpoints.md).