mistralai / mistral-inference

Official inference library for Mistral models

Home Page:https://mistral.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)?

matrixssy opened this issue · comments

I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.