marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

Home Page:https://marian-nmt.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High RAM usage with factors+shuffle-in-ram: false

eltorre opened this issue · comments

Bug description

We are trying to train a model with factors, but running into out of memory problems:

  • When running marian with data shuffling, the training uses ~90Gb of RAM, regardless of shuffle-in-ram.
  • Same model, disabled shuffling, it peaks at ~40Gb
  • The baseline SPM model, which uses exactly the same data but without factors, with shuffle-in-ram:false, peaks at ~25Gb

We are using factors-combine: sum, but not sure this has a large effect on RAM usage.

It seems marian is using significantly more RAM when shuffling data using factored models. Maybe it is ignoring shuffle-in-ram: false?

For reference, vocab+factors+valid entries stats, which looks OK to me:

[2023-02-10 14:56:47] [vocab] Loading vocab spec file ../wd.all2022.en-fr.en-fr/vocab.en.new.fsv
[2023-02-10 14:56:47] [vocab] Factor group '(lemma)' has 32000 members
[2023-02-10 14:56:47] [vocab] Factor group '|d' has 114 members
[2023-02-10 14:56:47] [vocab] Factor group '|s' has 4 members
[2023-02-10 14:56:47] [vocab] Factor group '|c' has 3 members
[2023-02-10 14:56:47] [vocab] Factored-embedding map read with total/unique of 127985/32121 factors from 32000 example words (in space of 73,602,300)
[2023-02-10 14:56:47] [vocab] Expanding all valid vocab entries out of 73,602,300...
[2023-02-10 14:57:11] [vocab] Completed, total 43769165 valid combinations
[2023-02-10 14:57:11] [data] Setting vocabulary size for input 0 to 43,769,165

Context

Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800

We also observed the same behaviour with rev. 3c2a432

CMake command:
cmake .. -DCMAKE_BUILD_TYPE=Release
-DUSE_SENTENCEPIECE=ON
-DCOMPILE_CPU=on
-DUSE_STATIC_LIBS=on
-DUSE_FBGEMM=on

Comments

As a side question (and sorry to mix it with the bug), the size of the expanded space is:
(32000+1)(114+1)(4+1)*(3+1)=73602300
To me it seems marian is reserving an extra vocab word for UNK on each factor, but this will not happen. Is there a flag to inhibit this behaviour?

Thanks a lot