What is AllGather for. Why use ALLGather.
lyccol opened this issue · comments
DeCLIP/prototype/model/clip.py
Lines 136 to 146 in 9d9e25d
Since CLIP (Contrastive Image Language Pre-training) requires a large batch size, we use all-gather during DDP (Distributed Data-Parallel Acceleration) to scale up the batch size by synchronising data from multiple cards.
For example, when we use 128 cards with a batch size of 256 on each card, the dimension of image_features (\resp, text_features) per thread is [256, feature_dim], while the dimension of gathered_image_features (\resp, gathered_text_features) is [128*256, feature_dim], so after the gradient synchronisation of the loss function, it is equivalent to directly using the batch size of 32768.
The same reason holds for SLIP, FILIP, DeCLIP, DeFILIP