Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Might be a solution to get built/compiles Flash Attention 2 on Windows

Akatsuki030 opened this issue · comments

As a Windows user, I tried to compile this and found the problem was on these two files "flash_fwd_launch_template.h" and "flash_bwd_launch_template.h". below "./flash-attention/csrc/flash_attn/src". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:

First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "Headdim == XX", and some templates like this: run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure), the thing I did is change all the "Headdim" in these templates in the function. Take an example, if the function called run_mha_bwd_hdim128 and has a constant declaration
"Headdim == 128", you have to change Headdim as 128 in the templates, which likes run_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure), and I did the same thing to the functions "run_mha_fwd_hdimXX" and also the templates.

Second, another error is from the "flash_fwd_launch_template.h", line 107, also the problem of referencing the constant "kBlockM" in the below if-else statement, and I rewrote it to

		if constexpr(Kernel_traits::kHeadDim % 128 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 4 - 1) / 4);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else if constexpr(Kernel_traits::kHeadDim % 64 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 8 - 1) / 8);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else{
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 16 - 1) / 16);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}

Third, for the function"run_mha_fwd_splitkv_dispatch" in "flash_fwd_launch_template.h", line 194, you also have to change "kBlockM" in the template as 64. And then you can try to compile it.
These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used

  • python 3.11
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Anaconda
  • Windows 11 22H2

I did try replacing you files .h files on my venv, with

  • Python 3.10
  • Pytorch 2.2 Nightly
  • CUDA 12.1
  • Visual Studio 2022
  • Ninja

And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja?

Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it.

EDIT: Tried with CUDA 12.2, no luck either.

EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM.

It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work.

image

EDIT3: I can confirm it works for exllamav2 + FA v2

Without FA

-- Measuring token speed...
 ** Position     1 + 127 tokens:   13.5848 t/s
 ** Position   128 + 128 tokens:   13.8594 t/s
 ** Position   256 + 128 tokens:   14.1394 t/s
 ** Position   384 + 128 tokens:   13.8138 t/s
 ** Position   512 + 128 tokens:   13.4949 t/s
 ** Position   640 + 128 tokens:   13.6474 t/s
 ** Position   768 + 128 tokens:   13.7073 t/s
 ** Position   896 + 128 tokens:   12.3254 t/s
 ** Position  1024 + 128 tokens:   13.8960 t/s
 ** Position  1152 + 128 tokens:   13.7677 t/s
 ** Position  1280 + 128 tokens:   12.9869 t/s
 ** Position  1408 + 128 tokens:   12.1336 t/s
 ** Position  1536 + 128 tokens:   13.0463 t/s
 ** Position  1664 + 128 tokens:   13.2463 t/s
 ** Position  1792 + 128 tokens:   12.6211 t/s
 ** Position  1920 + 128 tokens:   13.1429 t/s
 ** Position  2048 + 128 tokens:   12.5674 t/s
 ** Position  2176 + 128 tokens:   12.5847 t/s
 ** Position  2304 + 128 tokens:   13.3471 t/s
 ** Position  2432 + 128 tokens:   12.9135 t/s
 ** Position  2560 + 128 tokens:   12.2195 t/s
 ** Position  2688 + 128 tokens:   11.6120 t/s
 ** Position  2816 + 128 tokens:   11.2545 t/s
 ** Position  2944 + 128 tokens:   11.5304 t/s
 ** Position  3072 + 128 tokens:   11.7982 t/s
 ** Position  3200 + 128 tokens:   11.8041 t/s
 ** Position  3328 + 128 tokens:   12.8038 t/s
 ** Position  3456 + 128 tokens:   12.7324 t/s
 ** Position  3584 + 128 tokens:   11.7733 t/s
 ** Position  3712 + 128 tokens:   10.7961 t/s
 ** Position  3840 + 128 tokens:   11.1014 t/s
 ** Position  3968 + 128 tokens:   10.8474 t/s

With FA

-- Measuring token speed...
** Position     1 + 127 tokens:   22.6606 t/s
** Position   128 + 128 tokens:   22.5140 t/s
** Position   256 + 128 tokens:   22.6111 t/s
** Position   384 + 128 tokens:   22.6027 t/s
** Position   512 + 128 tokens:   22.3392 t/s
** Position   640 + 128 tokens:   22.0570 t/s
** Position   768 + 128 tokens:   22.3728 t/s
** Position   896 + 128 tokens:   22.4983 t/s
** Position  1024 + 128 tokens:   21.9384 t/s
** Position  1152 + 128 tokens:   22.3509 t/s
** Position  1280 + 128 tokens:   22.3189 t/s
** Position  1408 + 128 tokens:   22.2739 t/s
** Position  1536 + 128 tokens:   22.4145 t/s
** Position  1664 + 128 tokens:   21.9608 t/s
** Position  1792 + 128 tokens:   21.7645 t/s
** Position  1920 + 128 tokens:   22.1468 t/s
** Position  2048 + 128 tokens:   22.3400 t/s
** Position  2176 + 128 tokens:   21.9830 t/s
** Position  2304 + 128 tokens:   21.8387 t/s
** Position  2432 + 128 tokens:   20.2306 t/s
** Position  2560 + 128 tokens:   21.0056 t/s
** Position  2688 + 128 tokens:   22.2157 t/s
** Position  2816 + 128 tokens:   22.1912 t/s
** Position  2944 + 128 tokens:   22.1835 t/s
** Position  3072 + 128 tokens:   22.1393 t/s
** Position  3200 + 128 tokens:   22.1182 t/s
** Position  3328 + 128 tokens:   22.0821 t/s
** Position  3456 + 128 tokens:   22.0308 t/s
** Position  3584 + 128 tokens:   22.0060 t/s
** Position  3712 + 128 tokens:   21.9909 t/s
** Position  3840 + 128 tokens:   21.9816 t/s
** Position  3968 + 128 tokens:   21.9757 t/s

This is very helpful, thanks @Akatsuki030 and @Panchovix.
@Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with constexpr static int instead of constexpr int? I've just pushed a commit that does it. Can you check if that compile on Windows?
A while back someone (I think it was Daniel Haziza from the xformers team) told me that they need constexpr static int for Windows compilation.

@tridao just tested the compilation with your latest push, and now it works.

I did use

  • Python 3.10
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Visual Studio 2022
  • Ninja

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well.

We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile.

I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows.

Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1).

https://github.com/bdashore3/flash-attention/releases/tag/2.3.2

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible).

If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there.

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

你真乃神人也!

Works like a charm. I used:

  • CUDA 12.2
  • PyTorch 2.2.0.dev20231011+cu121 (installed with the command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121). Be sure you install this CUDA version and not the CPU version.

I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM.

If you get a "ninja: build stopped: subcommand failed" error, do this:
git clean -xdf
python setup.py clean
git submodule sync
git submodule deinit -f .
git submodule update --init --recursive
python setup.py install

GOOD🎶
RTX4090 24GB RAM AMD7950X 64GM RAM
python3.8 python3.10 both work

python3.10
https://www.python.org/downloads/release/python-3100/
win11

python -m venv venv

cd venc/Scripts
activate
-----------------------

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention

pip install packaging 
pip install wheel

set MAX_JOBS=4
python setup.py install
flashattention2

Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution?

@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8.

I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention
and rotary has been built by @bdashore3 's branch, while layer_norm in error.

I used py3.10, vs2019,cuda12.1

You don't have to use layer_norm.

You don't have to use layer_norm.

However, I made it work.

The trouble is in ln_bwd_kernels.cuh line 54

For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as

if(HasColscaleConst){
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  true,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
					}else{
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  false,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);

That's stupid way, but it works ,and now is compiling.

Does it mean I can use FA2 on windows if build it from source?

Any compiled wheel for Windows 11,
Python 3.11
Cuda 12.2
Torch 2.1.2

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: Could not build wheels for flash_attn, which is required to install pyproject.toml-based projects

GOOD🎶 RTX4090 24GB RAM AMD7950X 64GM RAM python3.8 python3.10 both work

python3.10
https://www.python.org/downloads/release/python-3100/
win11

python -m venv venv

cd venc/Scripts
activate
-----------------------

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention

pip install packaging 
pip install wheel

set MAX_JOBS=4
python setup.py install
flashattention2

Confirmed this method compiles on Windows 11 and working!

I have the following installed:
Python 3.11.9, Pytorch 2.3, CUDA 12.3, VS Studio 2022

System specs:
AMD 7950x, 4090

I am trying to install Flash Attention 2 on Windows 11, with Python 3.12.3, and here is my setup -
RTX 3050 Laptop
16 GB RAM
Core i7 12650H.

So I have setup MSVC Build Tools 2022, alongside MS VS Community 2022. Once I cloned the Flash Attention git repo, I ran python setup.py install and it gives error below -

running build_ext
D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py:384: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'flash_attn_2_cuda' extension
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src      
Emitting ninja build file D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\build.ninja...
Compiling objects...
Using envvar MAX_JOBS (1) as the number of workers...
[1/49] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
FAILED: D:/Github/Deep-Learning-Basics/LLM Testing/MultiModalAI/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/flash_api.obj
cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++17'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "C:\Python312\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\setup.py", line 311, in <module>
    setup(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 184, in setup     
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 200, in run_commands
    dist.run_commands()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 87, in run        
    self.do_egg_install()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 139, in do_egg_install
    self.run_command('bdist_egg')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 167, in run     
    cmd = self.call_command('install_lib', warn_dir=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install_lib.py", line 11, in run    
    self.build()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\install_lib.py", line 110, in build
    self.run_command('build_ext')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 91, in run      
    _build_ext.run(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 359, in run
    self.build_extensions()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 870, in build_extensions
    build_ext.build_extensions(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 479, in build_extensions
    self._build_extensions_serial()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 505, in _build_extensions_serial
    self.build_extension(ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 252, in build_extension
    _build_ext.build_extension(self, ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 560, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 842, in win_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 1783, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2123, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I'm pretty new to this, so was hoping if someone could point me in the right direction. Couldn't find anyway to fix my issue elsewhere online. Any help would be appreciated. Thanks!

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website
cuda

I recently recompiled mine with the following:
Windows 11
Python 3.12.4
pyTorch Nightly 2.4.0.dev20240606+cu124
Cuda 12.5.0_555.85
Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here:
batch file

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website cuda

I recently recompiled mine with the following: Windows 11 Python 3.12.4 pyTorch Nightly 2.4.0.dev20240606+cu124 Cuda 12.5.0_555.85 Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here: batch file

Oh sorry, I forgot to mention, I do have Cuda toolkit installed. Below is my nvcc -V

 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

And below is my nvidia-smi

nvidia-smi
Wed Jun 12 13:05:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8              3W /   72W |      32MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     26140    C+G   ...8bbwe\SnippingTool\SnippingTool.exe      N/A      |
+-----------------------------------------------------------------------------------------+
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

Yes, I had installed Visual Studio 2022 along with the Build Tools 2022. But the issue seemed to be stemming from Visual Studio itself, since I managed to build Flash Attention 2 after modifying the Visual Studio Community 2022 installation and adding the Windows 11 SDK (available under Desktop Development with C++ >> Optional).

Thanks!

Just sharing, I was able to build this repo on windows without the need for changes above with these settings :

  1. Python 3.11
  2. VS 2022 C++ (v14.38-17.9)
  3. CUDA 12.2

Seems like CUDA 12.4 and 12.5 not yet supported?

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5
Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install (took about an hours time, system specs at the end):

pip install flash-attn --no-build-isolation

Python 3.11.5 & PIP 24.1.1
CUDA 12.4
PyTorch installed via:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

PIP dependencies:

pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2

System Specs:

Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)

took about an hours time

Windows roughly an 1 hour, Ubuntu (Linux) some seconds to a few minutes....