NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

NVIDIA/TransformerEngine Issues

[ERROR] cannot install the package,
Updated 5 days ago3
Can CUDA 12.1.1 really be used for compilation?
Closed 9 days ago4
cuda 12.1 dont work
Updated 9 days ago2
Version constraint of `flash-attn` needs to be updated
Closed 9 days ago2
main branch cannot compile due to incompatibility with the main branch of cudnn-frontend
Updated 9 days ago2
Quantitative Analysis of FP8 GEMM's Impact on LLM Convergence
Closed 9 days ago1
MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism
Updated 9 days ago1
torch.compile graph breaks at `forward`
Updated 9 days ago
FP8 not converging during Supervised Fine-Tuning (though BF16 is)
Updated 9 days ago1
[Pytorch] LayerNormMLP seems to causing grad norm explosion under multi-node
Updated 9 days ago1
wath's the benefit of using comm_gemm_overlap.h:bulk_overlap
Updated 9 days ago
[Question] Why Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled?
Updated 9 days ago1
Support for overlapping tensor-parallel collectives with matmuls in fprop?
Updated 9 days ago2
Best out of the box framework for training a BitNet model
Updated 9 days ago
Feature request: Add Llama-style MLP with three linear layers
Closed 9 days ago2
When using Import Transformer_engine, many processes will be created
Updated 9 days ago
The package name passed to `find_package_handle_standard_args` (LIBRARY) does not match the name of the calling package (CUDNN)
Updated 9 days ago2
With using the fp8, after the interruption of training, and then continue , there may be a little difference in loss. Is this caused by the fp8 mechanism?
Updated 9 days ago1
CPU Overhead of te.Linear FP8 Layers
Updated 9 days ago7
Could TransformerEngine work with Deepspeed Zero w/ offloading?
Updated 9 days ago
Replacing nn.Linear w/ te.Linear FP8 convergence issue
Updated 9 days ago8
Output scale not being used with `te_gemm` in FP8
Updated 9 days ago3
te.Checkpoint does not work for nested autocast
Updated 9 days ago3
When ub_overlap_rs_dgrad is set to True, the error "Caught signal 8 (Floating point exception: integer divide by zero)" is raised.
Updated 9 days ago1
MLP without LayerNorm
Updated 9 days ago
Training the 1B model on H800 resulted in a decrease in throughput
Updated 9 days ago3
`inv_freq` of `RotaryPositionEmbedding` is hard-coded to 10k
Updated 9 days ago1
[ERROR] cuBLAS error when launch training with Megatron-LM and TransformerEngine
Closed 11 days ago2
[Question] ub_tp_comm_overlap config setup
Closed 11 days ago9
Can TE optimize the find cudnn?
Closed 12 days ago1
Request for Adaptive Layer Norm MLP
Updated 12 days ago6
how to disable fused_attention when building?
Closed 20 days ago3
v1.6: FP8GlobalStateManager seems to be preserving state in distributed setting
Closed 25 days ago1
`warnings.simplefilter('default')` in global scope causes excessive DeprecationWarnings
Closed a month ago5
ERROR: Failed building wheel for transformer-engine
Updated a month ago4
Some doubts about the usage of `DelayedScaling.interval`.
Updated a month ago
[JAX] Support fused SwiGLU MLP
Closed a month ago
ncclIpcSocketSendFd failed in register_user_buffer_collective(alloc=true), --tp-comm-overlap
Closed a month ago
Cannot import transformer_engine.pytorch
Closed a month ago
te.checkpoint does not work on nn.Module that consists of te blocks
Closed a month ago4
Build fails when using jax NGC image
Closed a month ago3
_ZN18transformer_engine6getenvIiEET_RKSsRKS1_ on the latest main branch
Closed a month ago6
When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F`
Closed 2 months ago2
[Pytorch] Swiglu implementation not aligned with jiterator version in probability
Closed 2 months ago4
Primary weights profiling question
Closed 2 months ago4
Question: Scaling Factor of Weights Primary vs Not
Closed 2 months ago1
[JAX/PyTorch] slower kernel calls on `sm90_xmma_gemm_e4m3bf16_e4m3f32_f32`
Closed 2 months ago1
Tensor to FP8
Closed 3 months ago
Doesn't work on wsl2
Updated 3 months ago2
Incorrect error message when shape is not suitable for fp8 casting
Closed 3 months ago1