ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

Home Page:https://ziglang.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pointers for NVPTX support

gwenzek opened this issue · comments

NVPTX backend is currently Tier 4 in Zig,
yet there still might be people interested in improving support for it (like myself)
Here I'm sharing a few pointers provided by @Snektron to get started, complemented by information I've glanned from LLVM and Nvidia documentation.

I've also started a branch with an ongoing implementation.
I'm learning both about the Zig compiler and LLVM at the same time so beware ! gwenzek#1

Adress spaces

PTX format defines a VM for the GPU and an ISA. The VM make uses of different memory spaces.
Address space support has been added to Stage 2, so you'll need to use stage 2.
Only a few generic adress spaces have been defined, but more should be added for Nvidia GPU:

zig/lib/std/builtin.zig

Lines 172 to 177 in 70ef9bc

pub const AddressSpace = enum {
generic,
gs,
fs,
ss,
};

And defaults here:

zig/src/target.zig

Lines 573 to 577 in 70ef9bc

) std.builtin.AddressSpace {
_ = target;
_ = context;
return .generic;
}

(for nvptx you probably want the .constant addrspace for constants for example, also .local for locals, etc
Then convert Zig address spaces to llvm address spaces here:

fn llvmAddressSpace(self: DeclGen, address_space: std.builtin.AddressSpace) c_uint {

The Zig bindings already have definition for Nvidia memory spaces:

pub const nvptx = struct {
pub const generic: c_uint = 0;
pub const global: c_uint = 1;
pub const constant: c_uint = 2;
pub const shared: c_uint = 3;
pub const param: c_uint = 4;
pub const local: c_uint = 5;
};

To better understand what Zig needs to do you can look at a sample kernel written in LLVM IR:
https://llvm.org/docs/NVPTXUsage.html#the-kernel
_I guess it would also be helpful to be able to generate LLVM IR from an arbitrary kernel
apparently Clang can do so: https://www.llvm.org/docs/CompileCudaWithLLVM.html

Also one would need to implement support for Cuda special variables gridDim, gridIdx, ... which are stored in specials registers:
https://llvm.org/docs/NVPTXUsage.html#id7
as well as the intrinsic for the block barrier __syncthreads

I'm not sure how to do that yet.

Intrinsics can be added by CreateBinaryIntrinsic, CreateUnaryIntrinsic and CreateIntrinsic functions provided by LLVM. See for an example how the Intrinsic::umax intrinsic is emitted in Zig:

zig/src/zig_llvm.cpp

Lines 472 to 475 in 969bcb6

LLVMValueRef ZigLLVMBuildUMax(LLVMBuilderRef B, LLVMValueRef LHS, LLVMValueRef RHS, const char *name) {
CallInst *call_inst = unwrap(B)->CreateBinaryIntrinsic(Intrinsic::umax, unwrap(LHS), unwrap(RHS), nullptr, name);
return wrap(call_inst);
}

According to the LLVM NVPTX page, __syncthreads is @llvm.nvvm.barrier0(). After some digging (opening build/include/llvm/IR/IntrinsicEnums.inc where build is the LLVM build directory), we can see that this translates to Intrinsic::nvvm_barrier0. Note that there is no C api call for CreateIntrinsic, so a shim would need to be added to zig_llvm.cpp in order to use this, similar to the other Intrinsic usages.

@gwenzek FYI on github you can press y to turn the URL into a specific commit; that way your links with line numbers won't move around.

I've started a prototype on gwenzek#1
I've added some glue code so that I can run:

.../stage2/bin/zig build-obj kernel.zig -target nvptx64-cuda -O ReleaseSafe -femit-llvm-ir

This will generate an invalid kernel.ptx which doesn't contain ptx code. I need to dig into LLVM documentation to understand how to make it generate PTX.
I think there is an issue with how LLVM TargetMachine is setup preventing it to generating the output.
I hacked it to make it generate bytecode instead, but that's not PTX.

This will generate a good looking .ptx

The main thing I was surprised is I had to modify codegen.zig even if I don't want to implement PTX generation
from Stage2, and only want to use LLVM for that. Is there a way of having an architecture that is only supported by LLVM and not by Zig itself ?

I've also created a new output format ".ptx" for the linker, as done for sprV or asm architectures.

The path youre going now with gwenzek#1 is implementing your own code generation with nvptx in stage 2. If you don't want to do that, you shouldn't need to create an entire custom link format, or enable nvptx in codegen.zig. In fact, can you try just running zig build-obj -target nvptx-cuda -fLLVM -fno-stage1 kernel.zig with an upstream build of stage 2? I think most of the llvm target definitions are already present, and the main thing to figure out is if we need to get llvm to emit bitcode instead of elf files.

@Snektron just retried without the change in codegen.zig, and it works, I must have used the wrong command at the begining.
The main thing I need to change to output PTX is that I need to ask LLVM to output assembly instead of an object.
For that I need to modify the flushModule to ask for a assembly.
Also I don't want to link the ".ptx" file I obtain so I think I need to create a link/NvPtx.zig if I'm not mistaken to disable that.

I think a refactoring could make this cleaner, by having Compilation.zig check the architecture and enabling/disabling some options in there. Also it could be a good place to add the architecture specific validation that you put at

zig/src/link/SpirV.zig

Lines 73 to 76 in 310f3df

switch (options.target.cpu.arch) {
.spirv32, .spirv64 => {},
else => return error.TODOArchNotSupported,
}

btw upstream stage2 yields:

> stage2/bin/zig build-obj cudaz/kernel.zig -target nvptx64-cuda -O ReleaseSafe 
error(codegen): LLVM failed to emit asm=(none) bin=kernel.o.o ir=(none) bc=(none): TargetMachine can't emit an object file
error: FailedToEmit

I've started looking into how to add support for the PTX intrinsics mentionned above. The default way seems to add an @ptxSyncThreads and @ptxThreadId, ... to Zig throught the various stage of the compiler. (Like @wasmMemoryGrow is implemented).

Other approaches have been discussed in #7702, #4466.
But by looking in the issue backlog I also found #2291 that says that the @"identifier" syntax can be abused to generate LLVM intrinsic.
I've tried it and the following snippet compiles with nvptx backend:

export fn hello(out: [*c]u8) void {
    out.* = 72 + @intCast(u8, threadIdX());
    __syncthreads();
}

extern fn @"llvm.nvvm.barrier0"() void;
inline fn __syncthreads() void {
    @"llvm.nvvm.barrier0"();
}

But for some reason LLVM called from Zig crashes when doing something similar for reading the special registers

extern fn @"llvm.nvvm.read.ptx.sreg.tid.x"() i32;
inline fn threadIdX() i32 {
    return @"llvm.nvvm.read.ptx.sreg.tid.x"();
}

export fn hello(out: [*c]u8) void {
    out.* = 72 + @intCast(u8, threadIdX());
}

the error message:

~/github/zig/stage2/bin/zig build-obj cudaz/kernel.zig -target nvptx64-cuda -O ReleaseSafe --verbose-llvm-ir 
info(link): Creatinsg .ptx target kernel.ptx.ptx
info(link): Updating function: 3:3
info(link): Updating function: 41:46
info(link): Flush kernel.ptx.ptx
; ModuleID = 'kernel'
source_filename = "kernel"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"

; Function Attrs: noredzone nounwind
define dso_local void @entry() #0 {
Entry:
  ret void
}

; Function Attrs: noredzone nounwind
define dso_local void @hello(i8* %0) #0 {
Entry:
  %1 = alloca i32, align 4
  %2 = alloca i8*, align 8
  store i8* %0, i8** %2, align 8
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  store i32 %3, i32* %1, align 4
  %4 = load i32, i32* %1, align 4
  %5 = trunc i32 %4 to i8
  %6 = add nuw i8 72, %5
  store i8 %6, i8* %0, align 1
  ret void
}

; Function Attrs: noredzone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

attributes #0 = { noredzone nounwind "frame-pointer"="none" }
info(codegen): emit LLVM object to kernel.ptx.ptx
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.read.ptx.sreg.tid.x
zsh: IOT instruction (core dumped)  ~/github/zig/stage2/bin/zig build-obj cudaz/kernel.zig -target  -O  
134

Interestingly pasting the debug output into a .ll and calling directly llc on it works, and generate the following .ptx file:
So there must be something special in how we calling LLVM from Zig. I've tried changing some options, but without success.
I'm not familiar enough with LLVM to understand what could lead to a Cannot select error.

~/local/llvm13-release/bin/llc -mcpu=sm_20 --mtriple=nvptx64-nvidia-cuda kernel.ll -o kernel.ll.ptx

	// .globl	hello                   // -- Begin function hello
.visible .func hello(
	.param .b64 hello_param_0
)                                       // @hello
{
	.reg .b16 	%rs<3>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<2>;

// %bb.0:                               // %Entry
	ld.param.u64 	%rd1, [hello_param_0];
	mov.u32 	%r1, %tid.x;
	cvt.u16.u32 	%rs1, %r1;
	add.s16 	%rs2, %rs1, 72;
	st.u8 	[%rd1], %rs2;
	ret;
                                        // -- End function
}

Anyway given that the "assembly" corresponding to reading from a register is pretty simple I also tried to
generate the corresponding assembly directly using asm keyword which seems to have seen some recent improvement in Stage 2.

So to generate mov.u32 %r1, %tid.x; in PTX it seems I need
call i32 asm sideeffect "mov.u32 \09$0, %tid.x;", "=r"() in LLVM, which can be generated by

asm volatile ("mov.u32 \t$0, %tid.x;"
        : [ret] "=r" (-> i32)
    );

So stay tuned, because I feel I'm making progress ^^
I still need to test the generated ptx, but I feel l'll be able to make some progress without having to poke too much into stage2 internals (which I appreciate)

Since my presentation at last Zig Rush I removed the worst hacks,
and made PR #10189

I've opened #12878 to update the backend to 0.10. I feel we are in a good place now, I'll close this issue.

Thanks to the progress of self hosted, we can now generate debug information in the PTX files. Also the assembly syntax now works as documented, so it's simpler to use special ptx registers. And I can use the same zig binary for building the device and host code, so it's pretty exciting.

See https://github.com/gwenzek/cudaz/tree/e8895596009c689300fe7c7193fa2dbf7db07629 for user code using this Zig branch.

@gwenzek Is it possible to experiment with this on the latest zig (since it looks like all PRs have been merged)? I'm just curious, since Andrew added it to the 0.12.0 milestone - so is it not on main yet?

its in the 0.10 milestone

Support for the ptx backend is kind of a work in progress, but it should work for some programs already. You can check out this repository for some pointers on how to get started, but I think the Zig parts are a little bit out of date. In general it should mostly work.