suvojit-0x55aa / mixed-precision-pytorch

Training with FP16 weights in PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mixed Precision Training

in PyTorch


Training in FP16 that is in half precision results in slightly faster training in nVidia cards that supports half precision ops. Also the memory requirements of the models weights are almost halved since we use 16-bit format to store the weights instead of 32-bits.

Although training in half precision has it's own caveats. The problems that is encountered in half precision training are:

  • Imprecise weight update
  • Gradients underflow
  • Reductions overflow

Below is a discussion on how to deal with these problems.

FP16 Basics

IEEE-754 floating point starndard states that given a floating point number X if,
2^E <= abs(X) < 2^(E+1) then the distance from X to the next largest representable floating point number epsilon is:

  • epsilon = 2^(E-52) [For a 64-bit float (double precision)]
  • epsilon = 2^(E-23) [For a 32-bit float (single precision)]
  • epsilon = 2^(E-10) [For a 16-bit float (half precision)]

The above equations allow us to compute the following:

  • For half precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005.

  • For single precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005.

  • For double precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

Imprecise Weight Update

Thus while training our network we'll need that added precision, since our weights will go through small updates. For example 1 + 0.0001 will result in:

  • 1.0001 in FP32
  • but in FP16 it will be 1

What that means is that we risk underflow (attempting to represent numbers so small they clamp to zero) and overflow (numbers so large they become NaN, not a number). With underflow, our network never learns anything, and with overflow, it learns garbage.
To overcome this we keep a "FP32 master copy". It is a copy of out FP16 model weights in FP32. We use these master params to update our weights and then copy them back to our model. We also update the gradients in the master copy as they are calculated in the model.

Gradients Underflow

Gradients are sometime not representable in FP16. This leads to the gradient underflow problem. A way to deal with this problem is to shift the gradient bitwise, so that they are in a range representable by half-precision floats. A way to do this is to multiply the loss by a large number like 2^7 that would shift the computed gradients during loss.backward() to the FP16 representable range. Then when we copy these gradients to FP32 master copy, we scale them back down by dividing the gradients with the same scaling factor.

Reduction Overflow

Another caveat with half-precision is that while doing large reductions it may overflow. For example consider two tensor:

  • a = torch.Tensor(4094).fill_(4.0).cuda()
  • b = torch.Tensor(4095).fill_(4.0).cuda()

If we were to do a.sum() and b.sum() it would result in 16376 and 16380 respectively, as expected in single-point precision. But if we did the same ops in half point precision it would result in 16376 and 16384 respectively. To overcome this problem we do the reduction ops like BatchNorm and loss calculation in FP32.

All these problems have been kept in mind to help us successfully train with FP16 weights. Implementation of the above ideas can be found in the train.py file.

Usage Instruction

python main.py [-h] [--lr LR] [--steps STEPS] [--gpu] [--fp16] [--loss_scaling] [--model MODEL]

PyTorch (FP16) CIFAR10 Training

optional arguments:
  -h, --help            Show this help message and exit
  --lr LR               Learning Rate
  --steps STEPS, -n STEPS
                        No of Steps
  --gpu, -p             Train on GPU
  --fp16                Train with FP16 weights
  --loss_scaling, -s    Scale FP16 losses
  --model MODEL, -m MODEL
                        Name of Network

To run in FP32 mode, use:
python main.py -n 200 -p --model resnet50

To train with FP16 weights, use:
python main.py -n 200 -p --fp16 -s --model resnet50
-s flag enables loss scaling.

Results

Training on a single P100 Pascal GPU, I was able to obtain the following result, while training with ResNet50 with a batch size of 128 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 1m32s 1m15s
Storage 90 MB 46 MB
Accuracy 94.50% 94.43%

Training on 4x K80 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 1m24s 1m17s
Storage 90 MB 46 MB
Accuracy 94.634% 94.922%

Training on 4x P100 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 26s224ms 23s359ms
Storage 90 MB 46 MB
Accuracy 94.51% 94.78%

Training on a single V100 Volta GPUs, with ResNet50 with a batch size of 128 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 47s112ms 25s601ms
Storage 90 MB 46 MB
Accuracy 94.87% 94.65%

Training on 4x V100 Volta GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 17s841ms 12s833ms
Storage 90 MB 46 MB
Accuracy 94.38% 94.60%

Speedup of the row setup with respect to the column setup is summarized in the following table.

1xP100:FP32 1xP100:FP16 4xP100:FP32 4xP100:FP16 1xV100:FP32 1xV100:FP16 4xV100:FP32
1xP100:FP16 22.67 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
4xP100:FP32 250.82 % 186.0 % 0.0 % 0.0 % 79.65 % 0.0 % 0.0 %
4xP100:FP16 293.85 % 221.08 % 12.27 % 0.0 % 101.69 % 9.6 % 0.0 %
1xV100:FP32 95.28 % 59.2 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
1xV100:FP16 259.36 % 192.96 % 2.43 % 0.0 % 84.02 % 0.0 % 0.0 %
4xV100:FP32 415.67 % 320.38 % 46.99 % 30.93 % 164.07 % 43.5 % 0.0 %
4xV100:FP16 616.9 % 484.43 % 104.35 % 82.02 % 267.12 % 99.49 % 39.02 %

TODO
  • Test with all nets.
  • Test models on Volta GPUs.
  • Test runtimes on multi GPU setup.

Further Explorations:

Convenience:

nVidia provides the apex library that handles all the caveats of training in mixed precision. It also provides API for multiprocess distributed training with NCCL and SyncBatchNorm which reduces stats across processes during multiprocess distributed data parallel training.


Thanks:

The project heavily borrows from @kuangliu's project pytorch-cifar. The models have been directly borrowed from the repository with minimal change, so thanks to @kuangliu for maintaining such awesome project.