flash-attention

There are 0 repository under flash-attention topic.

QwenLM / Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
chinese flash-attention large-language-models llm natural-language-processing pretrained-models
Language:Python 11554
Chinese-LLaMA-Alpaca-2
ymcui / Chinese-LLaMA-Alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
alpaca llama llm llama-2 large-language-models nlp alpaca-2 flash-attention llama2 alpaca2 64k yarn rlhf
Language:Python 6931
InternLM / InternLM
Official release of InternLM2 7B and 20B base and chat models. 200K context support
chatbot gpt large-language-model long-context rlhf fine-tuning-llm llm chinese flash-attention pretrained-models
Language:Python 5297
DefTruth / Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
awesome-llm deepseek flash-attention flash-attention-2 llm llm-inference llms open-sora paged-attention sora streaming-llm tensorrt-llm vllm
1494
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
cuda flash-attention gpu large-large-models llm-inference pytorch tvm
Language:Cuda 679
DefTruth / CUDA-Learn-Notes
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记，更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
block-reduce cuda cuda-kernels cuda-programming elementwise flash-attention flash-attention-2 gemm gemv layernorm rmsnorm softmax warp-reduce
Language:Cuda 599
CoinCheung / gdGPT
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
deepspeed llm pipeline nlp pytorch full-finetune model-parallization bloom flash-attention baichuan2-7b chatglm3-6b mixtral-8x7b llama2
Language:Python 83
RulinShao / FastCkpt
Python package for rematerialization-aware gradient checkpointing
flash-attention gradient-checkpointing
Language:Python 22
kklemon / FlashPerceiver
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
attention-mechanism deep-learning flash-attention nlp perceiver transformer
Language:Python 12
kyegomez / FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
artificial-intelligence artificial-neural-networks attention attention-mechanisms attentionisallyouneed gpt4 transformer flash-attention
Language:Jupyter Notebook 12
Naman-ntc / FastCode
Utilities for efficient fine-tuning, inference and evaluation of code generation models
code-generation efficient finetuning inference transformers flash-attention
Language:Python 10
graphcore-research / flash-attention-ipu
Poplar implementation of FlashAttention for IPU
deep-learning flash-attention flash-attention-2 graphcore poplar transformers ipu pytorch
Language:C++ 2

flash-attention

QwenLM / Qwen

ymcui / Chinese-LLaMA-Alpaca-2

InternLM / InternLM

DefTruth / Awesome-LLM-Inference

flashinfer-ai / flashinfer

DefTruth / CUDA-Learn-Notes

CoinCheung / gdGPT

RulinShao / FastCkpt

kklemon / FlashPerceiver

kyegomez / FlashMHA

Naman-ntc / FastCode

graphcore-research / flash-attention-ipu