tomtung / omikuji

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Home Page:https://crates.io/crates/omikuji

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues when training on a large dataset

klimentij opened this issue · comments

Hi Tom! At first, I wanted to thank you for your great contribution. This is the best implementation for XMC I've found (that is also feasible to use in production).

I ran a number of experiments and I have an observation that it works great when the training set is about 1-2M samples, but the task I'm trying to solve has 60M samples in the training set with 1M labels and 3M features from Tf-Idf. I always use default Parabel-like parameters.

Once I managed to train a model on 60M samples with 260k labels, but the only machine that managed to fit it was 160CPU 3.4T RAM GCP instance which is very expensive.

I tried 96CPU 1.4T machine to decrease costs, but it hangs for 3-4 hours on Initializing tree trainer step and then disconnects (I guess it gets out of memory).

Do you have any tips and tricks how to run training on a dataset of this size at a reasonable cost? E.g. would it be possible to train in batches on smaller/cheaper machines? Or are there any "magic" hyperparameter settings to achieve this?

Hi, first please make sure that you're using the binary compiled from Rust directly instead of the Python wrapper. Currently to keep the Python API ergonomic, a redundant copy of the dataset is kept in memory, which is fine for smaller datasets, but could be problematic for really large ones.

I'm a bit surprised that you couldn't even finish initialization. I added some addtional logging to facilitate debugging; could you try installing the latest source, and see how far it gets? You can do so by running cargo install -f --git https://github.com/tomtung/omikuji.git --features cli.

Once initialization passes, you can try using the new --train_trees_1_by_1 flag I just added. Additionally, you can also consider limiting --n_threads; the program by default utilizes all avilable cores, but higher parallelization could also cause high memory usage.

Thank you for a quick reply!

you're using the binary compiled from Rust directly instead of the Python wrapper

Yes, I use CLI compiled from source. I use Python binding just for the inference on new data.

I'm a bit surprised that you couldn't even finish initialization

I'm not 100% sure it happened because of Omikuji: I was connected through SSH and the only output I saw was Broken pipe. So it could be some SSH-related timeout.

Yesterday I re-ran this training on this time on 3.4T Ram instance, and this time I executed the command as a background bash job (with &). It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Right now it's been running for 26 hours and it's on 90% of forest training.

Thank you for the update you've made: I'll reinstall your tool before the next training and will try --train_trees_1_by_1 flag and will play with --n_threads.

Also, just as a suggestion. Do you think it would be possible/feasible to implement checkpoints? Right now the cost of training on my dataset is about $600 with this instance, and if something happens before the model is saved, that's a painful waste of resources.

It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Yep you might have noticed that this has already been added. When I get around to it I could also try improving parallelization on this part; I didn't expect it to take as long as 6 hours.

Do you think it would be possible/feasible to implement checkpoints?

In principle it should be possible, but would require some fairly significant refactoring & redesign. I guess we could also aim for something simpler, e.g. saving the initialized trainer, or saving each tree immediately after it's trained, but that would still require some refactoring. I probably can't get to it at the moment, but you're welcome to take a stab at it :)

And the last question regarding usage on large datasets. That yesterday's training process was finished successfully in 26 hours and I'm very happy with precision@5 I got on my test set.

But there's a new interesting issue: the resulting model consists of 3 trees 120Gb each, and it takes 70 minutes to load the model (in fact, it requires around over 624Gb of memory to finish loading the model for inference, since it failed to load on 624Gb instance).

I tried loading model to Python, then model.densify_weights(0.05) and saved the model, but it didn't seem to affect the model size (I even ended up with bigger model size for some reason, >140Gb per tree).

Also, I understand I can leave only one tree in the model folder, but it's still 120Gb and there's quite a performance drop when I do it (tested only on smaller 2M dataset).

Is model.densify_weights the right place to start if I want to optimize the model? Do you have any tips and tricks how to decrease memory usage and loading time for inference in production?

Unfortuntely for now I can't really think of any way to further speed up model loading... I guess we could first load the entire files into memory, then parallelize deserializing individual trees, but that would probably make the memory usage problem even worse.

Calling densify_weights would indeed only increase the model size (often with the benefit of faster prediction).

You could try increasing --linear.weight_threshold during training to more aggressively prune out more weights, but this might cause noticable performance drop. I can also try add support for pruning trained models if you think that would be useful.

I might eventually try support some sparsity-inducing loss function like L1-normalized SVM, but that will take time and might be tricky too. (E.g., according to Babbar & Schölkopf 2019 the LibLinear solver underfits and they suggested using proximal gradient method instead, which I suspect could be much slower.)

Out of curiosity, could you tell me a bit more about your use case? Particularly, do you need to regularly retrain the model? If so, I could try prioritize speeding up the initialization process, as I assume shaving off 6hrs from 26hrs would be quite significant.

Closing for now due to inactivity, feel free to re-open if you have more questions.