Might be a solution to get built/compiles Flash Attention 2 on Windows
Akatsuki030 opened this issue · comments
As a Windows user, I tried to compile this and found the problem was on these two files "flash_fwd_launch_template.h
" and "flash_bwd_launch_template.h
". below "./flash-attention/csrc/flash_attn/src
". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:
First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "Headdim == XX
", and some templates like this: run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure)
, the thing I did is change all the "Headdim
" in these templates in the function. Take an example, if the function called run_mha_bwd_hdim128
and has a constant declaration
"Headdim == 128
", you have to change Headdim as 128 in the templates, which likes run_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure)
, and I did the same thing to the functions "run_mha_fwd_hdimXX
" and also the templates.
Second, another error is from the "flash_fwd_launch_template.h
", line 107, also the problem of referencing the constant "kBlockM
" in the below if-else statement, and I rewrote it to
if constexpr(Kernel_traits::kHeadDim % 128 == 0){
dim3 grid_combine((params.b * params.h * params.seqlen_q + 4 - 1) / 4);
BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
if (params.num_splits <= 2) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 4) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 8) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 16) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 32) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 64) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 128) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
});
}else if constexpr(Kernel_traits::kHeadDim % 64 == 0){
dim3 grid_combine((params.b * params.h * params.seqlen_q + 8 - 1) / 8);
BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
if (params.num_splits <= 2) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 4) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 8) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 16) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 32) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 64) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 128) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
});
}else{
dim3 grid_combine((params.b * params.h * params.seqlen_q + 16 - 1) / 16);
BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
if (params.num_splits <= 2) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 4) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 8) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 16) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 32) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 64) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
} else if (params.num_splits <= 128) {
flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
});
}
Third, for the function"run_mha_fwd_splitkv_dispatch
" in "flash_fwd_launch_template.h
", line 194, you also have to change "kBlockM
" in the template as 64. And then you can try to compile it.
These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used
- python 3.11
- Pytorch 2.2+cu121 Nightly
- CUDA 12.2
- Anaconda
- Windows 11 22H2
I did try replacing you files .h files on my venv, with
- Python 3.10
- Pytorch 2.2 Nightly
- CUDA 12.1
- Visual Studio 2022
- Ninja
And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja?
Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it.
EDIT: Tried with CUDA 12.2, no luck either.
EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM.
It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work.
EDIT3: I can confirm it works for exllamav2 + FA v2
Without FA
-- Measuring token speed...
** Position 1 + 127 tokens: 13.5848 t/s
** Position 128 + 128 tokens: 13.8594 t/s
** Position 256 + 128 tokens: 14.1394 t/s
** Position 384 + 128 tokens: 13.8138 t/s
** Position 512 + 128 tokens: 13.4949 t/s
** Position 640 + 128 tokens: 13.6474 t/s
** Position 768 + 128 tokens: 13.7073 t/s
** Position 896 + 128 tokens: 12.3254 t/s
** Position 1024 + 128 tokens: 13.8960 t/s
** Position 1152 + 128 tokens: 13.7677 t/s
** Position 1280 + 128 tokens: 12.9869 t/s
** Position 1408 + 128 tokens: 12.1336 t/s
** Position 1536 + 128 tokens: 13.0463 t/s
** Position 1664 + 128 tokens: 13.2463 t/s
** Position 1792 + 128 tokens: 12.6211 t/s
** Position 1920 + 128 tokens: 13.1429 t/s
** Position 2048 + 128 tokens: 12.5674 t/s
** Position 2176 + 128 tokens: 12.5847 t/s
** Position 2304 + 128 tokens: 13.3471 t/s
** Position 2432 + 128 tokens: 12.9135 t/s
** Position 2560 + 128 tokens: 12.2195 t/s
** Position 2688 + 128 tokens: 11.6120 t/s
** Position 2816 + 128 tokens: 11.2545 t/s
** Position 2944 + 128 tokens: 11.5304 t/s
** Position 3072 + 128 tokens: 11.7982 t/s
** Position 3200 + 128 tokens: 11.8041 t/s
** Position 3328 + 128 tokens: 12.8038 t/s
** Position 3456 + 128 tokens: 12.7324 t/s
** Position 3584 + 128 tokens: 11.7733 t/s
** Position 3712 + 128 tokens: 10.7961 t/s
** Position 3840 + 128 tokens: 11.1014 t/s
** Position 3968 + 128 tokens: 10.8474 t/s
With FA
-- Measuring token speed...
** Position 1 + 127 tokens: 22.6606 t/s
** Position 128 + 128 tokens: 22.5140 t/s
** Position 256 + 128 tokens: 22.6111 t/s
** Position 384 + 128 tokens: 22.6027 t/s
** Position 512 + 128 tokens: 22.3392 t/s
** Position 640 + 128 tokens: 22.0570 t/s
** Position 768 + 128 tokens: 22.3728 t/s
** Position 896 + 128 tokens: 22.4983 t/s
** Position 1024 + 128 tokens: 21.9384 t/s
** Position 1152 + 128 tokens: 22.3509 t/s
** Position 1280 + 128 tokens: 22.3189 t/s
** Position 1408 + 128 tokens: 22.2739 t/s
** Position 1536 + 128 tokens: 22.4145 t/s
** Position 1664 + 128 tokens: 21.9608 t/s
** Position 1792 + 128 tokens: 21.7645 t/s
** Position 1920 + 128 tokens: 22.1468 t/s
** Position 2048 + 128 tokens: 22.3400 t/s
** Position 2176 + 128 tokens: 21.9830 t/s
** Position 2304 + 128 tokens: 21.8387 t/s
** Position 2432 + 128 tokens: 20.2306 t/s
** Position 2560 + 128 tokens: 21.0056 t/s
** Position 2688 + 128 tokens: 22.2157 t/s
** Position 2816 + 128 tokens: 22.1912 t/s
** Position 2944 + 128 tokens: 22.1835 t/s
** Position 3072 + 128 tokens: 22.1393 t/s
** Position 3200 + 128 tokens: 22.1182 t/s
** Position 3328 + 128 tokens: 22.0821 t/s
** Position 3456 + 128 tokens: 22.0308 t/s
** Position 3584 + 128 tokens: 22.0060 t/s
** Position 3712 + 128 tokens: 21.9909 t/s
** Position 3840 + 128 tokens: 21.9816 t/s
** Position 3968 + 128 tokens: 21.9757 t/s
This is very helpful, thanks @Akatsuki030 and @Panchovix.
@Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with constexpr static int
instead of constexpr int
? I've just pushed a commit that does it. Can you check if that compile on Windows?
A while back someone (I think it was Daniel Haziza from the xformers team) told me that they need constexpr static int
for Windows compilation.
@tridao just tested the compilation with your latest push, and now it works.
I did use
- Python 3.10
- Pytorch 2.2+cu121 Nightly
- CUDA 12.2
- Visual Studio 2022
- Ninja
Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.
Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.
Great! I did built a whl with python setup.py bdist_wheel
but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.
@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well.
We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile.
I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows.
Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1).
https://github.com/bdashore3/flash-attention/releases/tag/2.3.2
Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.
Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.
Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible).
If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there.
Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.
Great! I did built a whl with
python setup.py bdist_wheel
but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.
你真乃神人也!
Works like a charm. I used:
- CUDA 12.2
- PyTorch 2.2.0.dev20231011+cu121 (installed with the command
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
). Be sure you install this CUDA version and not the CPU version.
I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM.
If you get a "ninja: build stopped: subcommand failed" error, do this:
git clean -xdf
python setup.py clean
git submodule sync
git submodule deinit -f .
git submodule update --init --recursive
python setup.py install
GOOD🎶
RTX4090 24GB RAM AMD7950X 64GM RAM
python3.8 python3.10 both work
python3.10
https://www.python.org/downloads/release/python-3100/
win11
python -m venv venv
cd venc/Scripts
activate
-----------------------
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
pip install packaging
pip install wheel
set MAX_JOBS=4
python setup.py install
Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution?
@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8.
I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention
and rotary has been built by @bdashore3 's branch, while layer_norm in error.
I used py3.10, vs2019,cuda12.1
You don't have to use layer_norm.
You don't have to use layer_norm.
However, I made it work.
The trouble is in ln_bwd_kernels.cuh line 54
For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as
if(HasColscaleConst){
using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
weight_t,
input_t,
residual_t,
output_t,
compute_t,
index_t,
true,
32 * 32, // THREADS_PER_CTA
BYTES_PER_LDG_FINAL>;
auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
}else{
using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
weight_t,
input_t,
residual_t,
output_t,
compute_t,
index_t,
false,
32 * 32, // THREADS_PER_CTA
BYTES_PER_LDG_FINAL>;
auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
That's stupid way, but it works ,and now is compiling.
Does it mean I can use FA2 on windows if build it from source?
Any compiled wheel for Windows 11,
Python 3.11
Cuda 12.2
Torch 2.1.2
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: Could not build wheels for flash_attn, which is required to install pyproject.toml-based projects
GOOD🎶 RTX4090 24GB RAM AMD7950X 64GM RAM python3.8 python3.10 both work
python3.10 https://www.python.org/downloads/release/python-3100/ win11 python -m venv venv cd venc/Scripts activate ----------------------- git clone https://github.com/Dao-AILab/flash-attention cd flash-attention pip install packaging pip install wheel set MAX_JOBS=4 python setup.py install
Confirmed this method compiles on Windows 11 and working!
I have the following installed:
Python 3.11.9, Pytorch 2.3, CUDA 12.3, VS Studio 2022
System specs:
AMD 7950x, 4090
I am trying to install Flash Attention 2 on Windows 11, with Python 3.12.3, and here is my setup -
RTX 3050 Laptop
16 GB RAM
Core i7 12650H.
So I have setup MSVC Build Tools 2022, alongside MS VS Community 2022. Once I cloned the Flash Attention git repo, I ran python setup.py install
and it gives error below -
running build_ext
D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py:384: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'flash_attn_2_cuda' extension
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src
Emitting ninja build file D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\build.ninja...
Compiling objects...
Using envvar MAX_JOBS (1) as the number of workers...
[1/49] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
FAILED: D:/Github/Deep-Learning-Basics/LLM Testing/MultiModalAI/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/flash_api.obj
cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++17'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2107, in _run_ninja_build
subprocess.run(
File "C:\Python312\Lib\subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\setup.py", line 311, in <module>
setup(
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 184, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 200, in run_commands
dist.run_commands()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
super().run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 87, in run
self.do_egg_install()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 139, in do_egg_install
self.run_command('bdist_egg')
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
self.distribution.run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
super().run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 167, in run
cmd = self.call_command('install_lib', warn_dir=0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 153, in call_command
self.run_command(cmdname)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
self.distribution.run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
super().run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\install_lib.py", line 110, in build
self.run_command('build_ext')
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
self.distribution.run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
super().run_command(command)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 91, in run
_build_ext.run(self)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 359, in run
self.build_extensions()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 870, in build_extensions
build_ext.build_extensions(self)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 479, in build_extensions
self._build_extensions_serial()
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 505, in _build_extensions_serial
self.build_extension(ext)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 252, in build_extension
_build_ext.build_extension(self, ext)
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 560, in build_extension
objects = self.compiler.compile(
^^^^^^^^^^^^^^^^^^^^^^
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 842, in win_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 1783, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2123, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
I'm pretty new to this, so was hoping if someone could point me in the right direction. Couldn't find anyway to fix my issue elsewhere online. Any help would be appreciated. Thanks!
Seems like you are missing Cuda Toolkit
Download it from Nvidia's website
cuda
I recently recompiled mine with the following:
Windows 11
Python 3.12.4
pyTorch Nightly 2.4.0.dev20240606+cu124
Cuda 12.5.0_555.85
Nvidia v555.99 Drivers
If you wan to use my batch file, its hosted here:
batch file
Seems like you are missing Cuda Toolkit
Download it from Nvidia's website cuda
I recently recompiled mine with the following: Windows 11 Python 3.12.4 pyTorch Nightly 2.4.0.dev20240606+cu124 Cuda 12.5.0_555.85 Nvidia v555.99 Drivers
If you wan to use my batch file, its hosted here: batch file
Oh sorry, I forgot to mention, I do have Cuda toolkit installed. Below is my nvcc -V
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
And below is my nvidia-smi
nvidia-smi
Wed Jun 12 13:05:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 66C P8 3W / 72W | 32MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 26140 C+G ...8bbwe\SnippingTool\SnippingTool.exe N/A |
+-----------------------------------------------------------------------------------------+
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."
Have you tried installing Visual Studio 2022?
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory ninja: build stopped: subcommand failed."
Have you tried installing Visual Studio 2022?
Yes, I had installed Visual Studio 2022 along with the Build Tools 2022. But the issue seemed to be stemming from Visual Studio itself, since I managed to build Flash Attention 2 after modifying the Visual Studio Community 2022 installation and adding the Windows 11 SDK (available under Desktop Development with C++ >> Optional).
Thanks!
Just sharing, I was able to build this repo on windows without the need for changes above with these settings :
- Python 3.11
- VS 2022 C++ (v14.38-17.9)
- CUDA 12.2
Seems like CUDA 12.4 and 12.5 not yet supported?
I was able to compile and build from the source repository on Windows 11 with:
CUDA 12.5
Python 3.12
I have a Visual Studio 2019 that came with Windows and I've never used it.
pip install
never not worked for me.
Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install
(took about an hours time, system specs at the end):
pip install flash-attn --no-build-isolation
Python 3.11.5 & PIP 24.1.1
CUDA 12.4
PyTorch installed via:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
PIP dependencies:
pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2
System Specs:
Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)
took about an hours time
Windows roughly an 1 hour, Ubuntu (Linux) some seconds to a few minutes....