Hands on: Try Arm SVE on Docker

Summary

This document is a hands-on for people who want to try Arm SVE on Docker. To have an environment to try Arm SVE with QEMU, just run the following.

docker run -it kaityo256/xbyak_aarch64_handson

Then you will see something like this.

[user@2cd82e1ea4e3 ~]$

Now you are in an image of ArchLinux with the necessary software pre-installed. You will logged in as the account user. But you can su - with the password root. So if you need any package, install it with pacman.

In the following, you will try ARM SVE with intrinsic functions and Xbyak_aarch64.

Contribution

This document is an English translation of the Japanese version. The English in this document can be poor, so we appreciate pull requests for improvements.

Intrinsic Functions

You can use Arm SVE instructions via intrinsic function of C language, which is called the Arm C Language Extensions (ACLEs) for SVE. The sample codes for the intrinsic functions are in the directory ~/xbyak_aarch64_handson/sample/intrinsic.

1. SVE Length

Since the length of the vector is scalable, and the length is not determined at compile time. So, let's first look at a sample that gets the vector length at runtime.

The sample code can be built as follows.

cd 01_sve_length/
make

Then you can run the executable using QEMU.

$ qemu-aarch64 ./a.out
SVE is available. The length is 512 bits

You can specify the vector length in the QEMU options.

$ qemu-aarch64 -cpu max,sve128=on ./a.out
SVE is available. The length is 128 bits

$ qemu-aarch64 -cpu max,sve256=on ./a.out
SVE is available. The length is 256 bits

Here is the source code (sve_length.cpp).

#include <cstdio>
#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif

int main() {
  int n = 0;
#ifdef __ARM_FEATURE_SVE
  n = svcntb() * 8;
#endif
  if (n) {
    printf("SVE is available. The length is %d bits\n", n);
  } else {
    printf("SVE is unavailable.\n");
  }
}

Whether or not ARM SVE can be used can be determined by whether or not __ARM_FEATURE_SVE is defined. If __ARM_FEATURE_SVE is defined, then you can use intrinsic functions for SVE by including arm_sve.h.

The vector length can be obtained by svcntb() which returns the vector length in bytes. The corresponding instruction is cntb. The name of a ACLE SVE function consists of a prefix sv followed by the corresponding instruction in lower case.

In order to enable SVE, you need to compile with the -march=armv8-a+sve option. Without the option, __ARM_FEATURE_SVE will not be defined.

$ aarch64-linux-gnu-g++ -static sve_length.cpp
$ qemu-aarch64 ./a.out
SVE is unavailable.

2. Predicate registers

Arm SVE adopts Predicate-centric Approach. Most of ACLE SVE functions involve predicate registers, which allow you to control whether or not to execute an instruction on an element-by-element basis. The predicate register has different lengths depending on the vector length, and the length is not determined at compile time. Here, we will try to visualize the predicate register.

The type corresponding to the predicate register is svbool_t.

The sample code can be built as follows.

cd 02_predicatemake
make

It is useful to prepare a function that takes a variable of type svbool_t and prints its bit representation.

void show_pr(svbool_t tp) {
  int n = svcntb();
  std::vector<int8_t> a(n);
  std::vector<int8_t> b(n);
  std::fill(a.begin(), a.end(), 1);
  std::fill(b.begin(), b.end(), 0);
  svint8_t va = svld1_s8(tp, a.data());
  svst1_s8(tp, b.data(), va);
  for (int i = 0; i < n; i++) {
    std::cout << (int)b[n - i - 1];
  }
  std::cout << std::endl;
}

To set all bits of the predicate register, use svptrue function family. For example, to use a predicate register as a byte-by-byte mask, use svptrue_b8.

show_pr(svptrue_b8());

The output will look like this.

1111111111111111111111111111111111111111111111111111111111111111

The function svptrue_b8() is equivalent to the function svptrue_pat_b8 with SV_ALL option, and the corresponding assembly is ptrue p0.b, ALL.

Similarly, the output results for svptrue_b16, svptrue_b32, and svptrue_b64 are as follows.

svptrue_b16
0101010101010101010101010101010101010101010101010101010101010101
svptrue_b32
0001000100010001000100010001000100010001000100010001000100010001
svptrue_b64
0000000100000001000000010000000100000001000000010000000100000001

The correspondence between svptrue function family and assembly is as follows.

svptrue_b8 => ptrue p0.b, ALL
svptrue_b16 => ptrue p0.h, ALL
svptrue_b32 => ptrue p0.s, ALL
svptrue_b64 => ptrue p0.d, ALL

There are various ways to give patterns to the predicate registers, for example, SV_VL1 means "set a bit from LSB", and VL2 means "set two bits up two from LSB". Let's see how it works.

void ptrue_pat() {
  std::cout << "# ptrue_pat samples for various patterns" << std::endl;
  std::cout << "svptrue_pat_b8(SV_ALL)" << std::endl;
  show_pr(svptrue_pat_b8(SV_ALL));
  std::cout << "svptrue_pat_b8(SV_VL1)" << std::endl;
  show_pr(svptrue_pat_b8(SV_VL1));
  std::cout << "svptrue_pat_b8(SV_VL2)" << std::endl;
  show_pr(svptrue_pat_b8(SV_VL2));
  std::cout << "svptrue_pat_b8(SV_VL3)" << std::endl;
  show_pr(svptrue_pat_b8(SV_VL3));
  std::cout << "svptrue_pat_b8(SV_VL4)" << std::endl;
  show_pr(svptrue_pat_b8(SV_VL4));
}

The output will be as follows.

# ptrue_pat samples for various patterns
svptrue_pat_b8(SV_ALL)
1111111111111111111111111111111111111111111111111111111111111111
svptrue_pat_b8(SV_VL1)
0000000000000000000000000000000000000000000000000000000000000001
svptrue_pat_b8(SV_VL2)
0000000000000000000000000000000000000000000000000000000000000011
svptrue_pat_b8(SV_VL3)
0000000000000000000000000000000000000000000000000000000000000111
svptrue_pat_b8(SV_VL4)
0000000000000000000000000000000000000000000000000000000000001111

The position of the bits to be set depend on the type. Let's see which bits are set by SV_VL2 for various types.

Here is the source code.

void ptrue_pat_types() {
  std::cout << "# pture_pat samples for various types" << std::endl;
  std::cout << "svptrue_pat_b8(SV_VL2)" << std::endl;
  show_pr(svptrue_pat_b8(SV_VL2));
  std::cout << "svptrue_pat_b16(SV_VL2)" << std::endl;
  show_pr(svptrue_pat_b16(SV_VL2));
  std::cout << "svptrue_pat_b32(SV_VL2)" << std::endl;
  show_pr(svptrue_pat_b32(SV_VL2));
  std::cout << "svptrue_pat_b64(SV_VL2)" << std::endl;
  show_pr(svptrue_pat_b64(SV_VL2));
}

And here are the outputs.

# pture_pat samples for various types
svptrue_pat_b8(SV_VL2)
0000000000000000000000000000000000000000000000000000000000000011
svptrue_pat_b16(SV_VL2)
0000000000000000000000000000000000000000000000000000000000000101
svptrue_pat_b32(SV_VL2)
0000000000000000000000000000000000000000000000000000000000010001
svptrue_pat_b64(SV_VL2)
0000000000000000000000000000000000000000000000000000000100000001

You can change the vector length and see how the results change.

qemu-aarch64 -cpu max,sve128=on ./a.out
qemu-aarch64 -cpu max,sve256=on ./a.out
qemu-aarch64 -cpu max,sve512=on ./a.out

3. Vector operations

In order to use SIMD instructions, data should be loaded into the SIMD registers. In the following, we will see how the loading of the registers and the arithmetic operations are performed, and how mask processing can be performed using the predicate register.

The sample code is in sample/intrinsic/03_load_add.

The name of the SVE ACLE type corresponds to the type of the element contained in the SIMD register. For example, the variable that stores float64_t is svfloat64_t, and since SVE registers are not specified in length, it is not known at compile time how many float64_t variables are stored in the register.

It is useful to prepare a function to display the contents of the SIMD register.

void svshow(svfloat64_t va){
  int n = svcntd();
  std::vector<double> a(n);
  svbool_t tp = svptrue_b64();
  svst1_f64(tp, a.data(), va);
  for(int i=0;i<n;i++){
    printf("%+.7f ", a[n-i-1]);
  }
  printf("\n");
}

Use svld1_f64 to load into svfloat64_t. If you pass the predicate register and the first address, it will fetch the data as wide as the register. The code to define an array, load it into a register from it, and display the register can be written as follows.

  double a[] = {0, 1, 2, 3, 4, 5, 6, 7};
  svfloat64_t va = svld1_f64(svptrue_b64(), a);
  printf("va = ");
  svshow(va);

Here is the result.

va = +7.0000000 +6.0000000 +5.0000000 +4.0000000 +3.0000000 +2.0000000 +1.0000000 +0.0000000

Similarly, a register with all elements set to 1 is also prepared.

  double b[] = {1, 1, 1, 1, 1, 1, 1, 1};
  svfloat64_t vb = svld1_f64(svptrue_b64(), b);
  printf("vb = ");
  svshow(vb);
  printf("\n");

Use svadd_f64_z to add svfloat64_t.

  svfloat64_t vc1 = svadd_f64_z(svptrue_b64(), va, vb);
  printf("va + vb = ");
  svshow(vc1);

Here is the result.

va + vb = +8.0000000 +7.0000000 +6.0000000 +5.0000000 +4.0000000 +3.0000000 +2.0000000 +1.0000000

By specifying a pattern in the predicate register, it is possible to mask where the addition is to be performed. For example, if SV_VL2 is specified, only two operations are performed from the lower address.

For operations where the predicate register is inactive, you can choose to clear zero or merge the first argument.Using svadd_f64_z, inactive elements will be cleared to zero as follows (zeroing predication).

  svfloat64_t vc2 = svadd_f64_z(svptrue_pat_b64(SV_VL2), va, vb);
  printf("va + vb = ");
  svshow(vc2);

va + vb = +0.0000000 +0.0000000 +0.0000000 +0.0000000 +0.0000000 +0.0000000 +2.0000000 +1.0000000

Using svadd_f64_m, which changes the last z of svadd_f64_z to m, inactive elements will merge the first argument.

  svfloat64_t vc3 = svadd_f64_m(svptrue_pat_b64(SV_VL2), va, vb);
  printf("va + vb = ");
  svshow(vc3);

va + vb = +7.0000000 +6.0000000 +5.0000000 +4.0000000 +3.0000000 +2.0000000 +2.0000000 +1.0000000

In SVE, you need to code for variable length SIMD registers like this, making full use of the masking process.

4. Fizz Buzz Implementation with ACLE SVE

Let's try to write Fizz Buzz as an example of code that makes full use of scalable SIMD registers and mask operations. Instead of displaying as Fizz or Buzz, we replace elements with -1, -2, or -3 when they are multiple of 3, 5, or 15, respectively.

The serial code looks like this.

#include <cstdio>
#include <vector>

int main() {
  // init
  const int n = 32;
  std::vector<int32_t> a(n);
  for (int i = 0; i < n; i++) {
    a[i] = i + 1;
  }
  // FizzBuzz
  for (int i = 0; i < n; i++) {
    if (a[i] % 15 == 0) {
      a[i] = -3;
    } else if (a[i] % 3 == 0) {
      a[i] = -1;
    } else if (a[i] % 5 == 0) {
      a[i] = -2;
    }
  }
  // Show Results
  for (int i = 0; i < n; i++) {
    if (a[i] == -1) {
      puts("Fizz");
    } else if (a[i] == -2) {
      puts("Buzz");
    } else if (a[i] == -3) {
      puts("FizzBuzz");
    } else {
      printf("%d\n", a[i]);
    }
  }
}

Now, this code consists of three parts: initialization, FizzBuzz, and result display. We rewrite the FizzBuzz part with SVE.

First of all, let's deal with the if statement. We use mask operations to express if statements. To make the mask, we fist divide a[i] by 3, then multiply it by 3, and determine whether it is a multiple of 3 or not by matching the original number.

It would look like this.

  // FizzBuzz
  for (int i = 0; i < n; i++) {
    uint32_t t = a[i];
    uint32_t r3 = (t / 3) * 3;
    if (r3 == t) {
      a[i] = -1;
    }
    uint32_t r5 = (t / 5) * 5;
    if (r5 == t) {
      a[i] = -2;
    }
    uint32_t r15 = (t / 15) * 15;
    if (r15 == t) {
      a[i] = -3;
    }
  }

We will rewrite this part with intrinsic functions. We don't know the vector length of SVE until runtime, but for simplicity, let's assume that n is always a multiple of the number of elements corresponding to the vector length.

Before entering the loop, we prepare registers that contain the necessary constants. ARM SVE intrinsic functions will always receive a predicator register, so we prepare a register which is all true.

  svbool_t tp = svptrue_b32();

Next, we make a vector filled with -1, -2, and -3 for value assignment. vf, vb, and vfb denote "Vector for Fizz", "Vector for Buzz", and "Vector for FizzBuzz", respectively.

  svint32_t vf = svdup_n_s32_x(tp, -1);
  svint32_t vb = svdup_n_s32_x(tp, -2);
  svint32_t vfb = svdup_n_s32_x(tp, -3);

We also make a vector filled with 3, 5, and 15 for divisions and multiplications.

  svint32_t v3 = svdup_n_s32_x(tp, 3);
  svint32_t v5 = svdup_n_s32_x(tp, 5);
  svint32_t v15 = svdup_n_s32_x(tp, 15);

How many integers, i.e. uint32_t, can be stored in SVE registers can be obtained with cntw. The corresponding intrinsic function is svcntw. Therefore, the loop structure looks as follows.

  int w = svcntw();
  int s = 0;
  while (s + w <= n) {
      // FizzBuzz
    s += w;
  }

Here, s is the index of the beginning of the data to be operated.

To load w data from the sth index of std::vector<int32_t> a(n) into a SIMD register, we use svld1_s32.

svint32_t va = svld1_s32(svptrue_b32(), a.data() + s);

We prepare a temporary variable svint32_t vr to store the value of va divided by 3. The function for integer division is svdiv_s32_z.

svint32_t vr;
vr = svdiv_s32_z(tp, va, v3);

We next multiply by 3. The function for multiplication of integers is svmul_s32_z.

vr = svmul_s32_z(tp, vr, v3);

Now vr stores the values of va divided by 3 and multiplied by 3. Compare each element between va and vr, and the location that matches is a multiple of 3. We put the matching locations into the predicate register by svcmpeq_s32, which compares the two vector registers as if they were uint32_t, and returns the predicate register with the matching location.

svbool_t pg;
pg = svcmpeq_s32(tp, va, vr);

Since the positions of multiples of 3 are now stored in pg, we use it to write the register vf filled with -1 back to a.

svst1_s32(pg, a.data() + s, vf);

The same goes for multiples of 5 and multiples of 15. Putting all the above together, the code looks like the following.

  // FizzBuzz
  svbool_t tp = svptrue_b32();
  svint32_t vf = svdup_n_s32_x(tp, -1);
  svint32_t vb = svdup_n_s32_x(tp, -2);
  svint32_t vfb = svdup_n_s32_x(tp, -3);

  svint32_t v3 = svdup_n_s32_x(tp, 3);
  svint32_t v5 = svdup_n_s32_x(tp, 5);
  svint32_t v15 = svdup_n_s32_x(tp, 15);

  int w = svcntw();
  int s = 0;

  while (s + w <= n) {
    svint32_t va = svld1_s32(svptrue_b32(), a.data() + s);

    svint32_t vr;
    svbool_t pg;
    vr = svdiv_s32_z(tp, va, v3);
    vr = svmul_s32_z(tp, vr, v3);
    pg = svcmpeq_s32(tp, va, vr);
    svst1_s32(pg, a.data() + s, vf);

    vr = svdiv_s32_z(tp, va, v5);
    vr = svmul_s32_z(tp, vr, v5);
    pg = svcmpeq_s32(tp, va, vr);
    svst1_s32(pg, a.data() + s, vb);

    vr = svdiv_s32_z(tp, va, v15);
    vr = svmul_s32_z(tp, vr, v15);
    pg = svcmpeq_s32(tp, va, vr);
    svst1_s32(pg, a.data() + s, vfb);
    s += w;
  }

The sample code is in sample/intrinsic/04_fizzbuzz. You can build and run the sample as follows.

$ make
aarch64-linux-gnu-g++ -static -march=armv8-a+sve -O2 fizzbuzz.cpp
$ ./a.out
1
2
Fizz
4
Buzz
(snip)
26
Fizz
28
29
FizzBuzz
31
32

You can confirm that changing the register length does not change the result.

qemu-aarch64 -cpu max,sve128=on ./a.out
qemu-aarch64 -cpu max,sve256=on ./a.out
qemu-aarch64 -cpu max,sve512=on ./a.out

You should see the same results for all of the above.

Xbyak_aarch64

1. Test

First of all, let's test the operation of Xbyak_aarch64. The sample codes are in sample/xbyak. First, let's compile and run the test code.

$ cd xbyak_aarch64_handson
$ cd sample
$ cd xbyak
$ cd 01_test
$ make
aarch64-linux-gnu-g++ -static test.cpp -L/home/user/xbyak_aarch64_handson/xbyak_aarch64/lib -lxbyak_aarch64
$ ./a.out
1

Note that even though a.out is a binary for ARM, you can run a.out directly without QEMU like this. Even if you don't specify QEMU explicitly, a.out is executed through QEMU.

Here is the source code.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    mov(w0, 1);
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<int (*)()>();
  c.ready();
  printf("%d\n", f());
}

Here, the mov(w0, 1) part is where the return value of the function is assigned. Let's change the return value of the function to another value, say 42. Replace the code with mov(w0, 42), and compile and run it again.

$ make
aarch64-linux-gnu-g++ -static test.cpp -L/home/user/xbyak_aarch64_handson/xbyak_aarch64/lib -lxbyak_aarch64
$ ./a.out
42

You will see 42 as the result.

2. Calling convention

Xbyak is a tool for writing a function in full assembly. In assembly, function calls are jumps, and registers and other variables are all global variables, so it is up to the programmer to decide how to pass function arguments and how to return values. However, when using a high-level language such as C, it is inconvenient if each compiler has different calling conventions, because object files compiled by different compilers cannot be linked. The Application Binary Interface (ABI) defines binary-level interfaces for each ISA, and calling conventions are one of the many things defined by the ABI. In the following, we will take a brief look at the calling convention and how to write Xbyak.

In the directory /sample/xbyak/02_abi, there is abi.cpp as a template code of Xbyak.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<void (*)()>();
  c.ready();
}

This code is a sample that creates and executes a function that does nothing but simply return when called. Since the signature of the function is void f(), we pass the type void (*)() to getCode. First, we will modify it to a function that returns an integer 1. To do this, we need to know how to return an integer in AAarch64. Of course, you can read the official document (136 pages, wow!), but it's not practical to read or memorize it every time. Here it is easy to write a simple code and compile it. So, let's write some simple code and compile it to see the calling convention.

Consider a code like the following.

int func(){
  return 1;
}

Compile it and see the assembly.

$ ag++ -S test.cpp
$ cat test.s
        .arch armv8-a+sve
        .file   "test.cpp"
        .text
        .align  2
        .p2align 4,,11
        .global _Z4funcv
        .type   _Z4funcv, %function
_Z4funcv:
.LFB0:
        .cfi_startproc
        mov     w0, 1
        ret
        .cfi_endproc
.LFE0:
        .size   _Z4funcv, .-_Z4funcv
        .ident  "GCC: (GNU) 11.2.0"
        .section        .note.GNU-stack,"",@progbits

This shows that an integer should be returned by putting their value in the register w0. From here, we can see that we should modify abi.cpp as follows.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    mov(w0, 1); // mov w0, 1 
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<int (*)()>(); // Fixed function pointer type to correspond to int f().
  c.ready();
  printf("%d\n",f()); // Display the result of f()
}

Let's compile and run it.

$ make
$ ./a.out
1

You will see 1 as the result.

Next, let's take arguments. Let's consider a function that takes an integer and returns a value added by 1. As before, let's ask the compiler to tell us the assembly.

int func(int i){
  return i+1;
}

If you compile the above code with ag++ -S, you will see that the corresponding assembly looks like the following.

  add w0, w0, 1
  ret

That is, the first integer argument comes in w0, so we should assign the result of adding 1 to it to w0. The corresponding code of Xbyak will be as follows.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    add(w0, w0, 1);
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<int (*)(int)>(); // the signature of function is changed to be `int f(int)`
  printf("%d\n",f(1)); // call f(1)
  c.ready();
}

You will see 2 as the results.

$ make
$ ./a.out
2

In similar manner, we can write a function that takes two arguments and returns the sum as follows.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    add(w0, w0, w1); // w0 = w0 + w1
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<int (*)(int, int)>(); // the signature is changed to be `int f(int, int)`
  printf("%d\n",f(3,4)); // calculate 3+4
  c.ready();
}

You will get 7 as a result of adding 3 to 4.

Since we treat integer operation here, he registers were w0, w1 and the add instruction was add. If we change the registers to d0, d1 and the add instruction to fadd, we can make it a double-precision version.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    fadd(d0, d0, d1); // d0 = d0 + d1
    ret();
  }
};

int main() {
  Code c;
  auto f = c.getCode<double (*)(double, double)>(); // double f(double, double);
  printf("%f\n",f(3.0,4.0)); // calculate 3.0+4.0
  c.ready();
}

You will see 7.000000 as the result.

3. Display the assembler mnemonics generated by Xbyak

Xbyak is a tool that puts assembly instructions in memory and executes them from the first address. What kind of assembly instructions are placed depends on the program you have created. Therefore, we are writing a C / C ++ program that outputs assembly, that is, a code generator.

When there is a bug in the code generator, we want to debug it by looking at the output assembly. Therefore, we will see how to disassemble the code generated by Xbyak. The sample code is sample/xbyak/03_dump/dump.cpp.

To get a machine language generated by Xbyak, use Xbyak_aarch64::CodeGenerator::getCode(). You can also get the length of a machine language with getSize(). Let's create a method to save it with a name. Add a method dump for saving the machine language to the Xbyak as follows.

#include <cstdio>
#include <xbyak_aarch64/xbyak_aarch64.h>

struct Code : Xbyak_aarch64::CodeGenerator {
  Code() {
    mov(w0, 1);
    ret();
  }
  void dump(const char *filename) {
    FILE *fp = fopen(filename, "wb");
    fwrite(getCode(), 1, getSize(), fp);
  }
};

void dump(const char *filename) is the code to save the machine language created by Xbyak with a name.

The code that executes the code generated by Xbyak, but also saves its machine language, can be written as follows.

int main() {
  Code c;
  auto f = c.getCode<int (*)(int)>();
  c.ready();
  c.dump("xbyak.dump");
  printf("%d\n",f(10));
}

Here is the results.

$ make
$ ./a.out
1

The first 1 is the result of the function generated by Xbyak. The machine language generated by Xbyak is saved as xbyak.dump. You can disassemble it by passing it to objdump, but you need to give it some information because it has no header information.

$ aarch64-linux-gnu-objdump -D -maarch64 -b binary -d xbyak.dump

xbyak.dump:     file format binary


Disassembly of section .data:

0000000000000000 <.data>:
   0:   52800020        mov     w0, #0x1                        // #1
   4:   d65f03c0        ret

You can see that the code that mov w0, 1; ret is generated as intended.

Since it is troublesome to type aarch64-linux-gnu-objdump -D -maarch64 -b binary -d every time, the following alias is defined in .bashrc.

alias xdump="aarch64-linux-gnu-objdump -D -maarch64 -b binary -d"

You can use it as follows.

xdump xbyak.dump

Now, let's see how Xbyak generates code dynamically. Modify the function so that adds 1 to n times and returns it.

struct Code : Xbyak_aarch64::CodeGenerator {
  Code(int n) {
    for(int i=0;i<n;i++){
      add(w0, w0, 1);
    }
    ret();
  }
};

The constructor Code receives int n and repeats add(w0, w0, 1); as many times as it takes. Specify the number of iterations as Code c(3);.

int main() {
  Code c(3); // Modified here
  auto f = c.getCode<int (*)(int)>();
  c.ready();
  printf("%d\n", f(10));
  dump(c.getCode(), c.getSize());
}

You can build and run as follows.

$ make
$ ./a.out
13

As a result of execution, 13 which is the number of 10 plus 1 three times was displayed. Here are the assembly codes.

$ xdump xbyak.dump

xbyak.dump:     file format binary


Disassembly of section .data:

0000000000000000 <.data>:
   0:   11000400        add     w0, w0, #0x1
   4:   11000400        add     w0, w0, #0x1
   8:   11000400        add     w0, w0, #0x1
   c:   d65f03c0        ret

As intended, the code execute add three times. This is generated at runtime, so it doesn't have to be fixed at compile time. Let's feed it from the standard input.

int main(int argc, char **argv) {
  Code c(atoi(argv[1]));
  auto f = c.getCode<int (*)(int)>();
  c.ready();
  printf("%d\n", f(10));
  dump(c.getCode(), c.getSize());
}

You can feed any number, say 5.

$ ./a.out  5
15

$ xdump xbyak.dump

xbyak.dump:     file format binary


Disassembly of section .data:

0000000000000000 <.data>:
   0:   11000400        add     w0, w0, #0x1
   4:   11000400        add     w0, w0, #0x1
   8:   11000400        add     w0, w0, #0x1
   c:   11000400        add     w0, w0, #0x1
  10:   11000400        add     w0, w0, #0x1
  14:   d65f03c0        ret

You can see that Xbyak generates code dynamically.

4. Fizz Buzz Implementation with Xbyak

Finally, let's try to write FizzBuzz with Xbyak. As in the example of intrinsic functions, FizzBuzz is expressed by writing -1 for multiples of 3, -2 for multiples of 5, and -3 for multiples of 15 to an array of integers of type int32_t. The data are stored in std::vector of the type int32_t, and the first address is passed as an argument of the function made by Xbyak.

The algorithm is as follows.

Set all bits of p0 to true.
Prepare registers with all elements initialized with -1, -2, -3, 3, and 5 (in order, z1 to z5).
Load the values of the array to the register z0. Since int32_t is used, 16 elements can be loaded at a time (when the SVE register is 512-bit-width).
Copy the z0 register to the z7 register.
Divide all elements in the z7 register by 3, then multiply by 3 (use sdiv and mul).
Compare the z7 register with the z0 register, make a mask with the bits of the matching places up and put it in p1.
Write the z1 register back to the address of the array with the p1 register as a mask (Fizz).
Buzz is calculated in the same way, and the mask information is put into p2 and written back.
For FizzBuzz, put the mask information p1 and p2 logical conjunction (AND) into p3, and write back -3 using the mask.

Although the code is not very efficient, it is good for practicing SVE with Xbyak because it contains processing multiple elements at once using SVE, mask store using predicate registers, and logical operations with predicate registers.

The source code is in fizzbuzz.cpp in the directory /sample/xbyak/04_fizzbuzz.

First, make the code of Xbyak to get how many variables of type int32_t are in the SVE register.

struct Cntw : Xbyak_aarch64::CodeGenerator {
  Cntw() {
    cntw(x0);
    ret();
  }
};

It simply calls cntw and puts it in x0. You can get the number of elements as follows.

  Cntw cw;
  int nw = cw.getCode<int (*)()>()();
  printf("Number of int32_t in a register is %d.\n",nw);

The signature of the FizzBuzz function generated by Xbyak is as follows.

void f(int32_t *);

Therefore, the type passed to getCode is as follows.

auto f = c.getCode<void (*)(int32_t *)>();

The generated function can be called as follows. a is a variable of type std::vector<int32_t>.

f(a.data());

Positive integers are stored in the vector a, and when f is called, multiples of 3 are rewritten to -1, multiples of 5 to -2, and multiples of 15 to -3, respectively. The code of Xbyak to make such a function f is as follows.

  Code(int n, int nw) {
    ptrue(p0.s);
    dup(z1.s, -1);
    dup(z2.s, -2);
    dup(z3.s, -3);
    dup(z4.s, 3);
    dup(z5.s, 5);
    for (int i = 0; i < n/nw; i++) {
      ld1w(z0.s, p0, ptr(x0));
      // Fizz
      // b[i] = (a[i] / 3) * 3
      mov(z7.s, p0, z0.s);
      sdiv(z7.s, p0.s, z4.s);
      mul(z7.s, p0.s, z4.s);
      // Mask
      cmpeq(p1.s, p0, z0.s, z7.s);
      // Write -1
      st1w(z1.s, p1, ptr(x0));

      // Buzz
      // b[i] = (a[i] / 5) * 5
      mov(z7.s, p0, z0.s);
      sdiv(z7.s, p0.s, z5.s);
      mul(z7.s, p0.s, z5.s);
      // Mask
      cmpeq(p2.s, p0, z0.s, z7.s);
      // Write -2
      st1w(z2.s, p2, ptr(x0));

      // FizzBuzz
      and_(p3.b, p0, p1.b, p2.b);
      // Write -3
      st1w(z3.s, p3, ptr(x0));

      adds(x0, x0, nw*4);
    }

    ret();
  }
  void dump(const char *filename) {
    FILE *fp = fopen(filename, "wb");
    fwrite(getCode(), 1, getSize(), fp);
  }
};

Let me explain step by step.

First, it takes the number of elements n and the number of elements per register nw in the constructor, and generates the code that expands the loop by that amount. Note that these arguments are for the code generator, not for the function generated by Xbyak.

We prepare the registers that store the constants.

ptrue(p0.s);    // all true
dup(z1.s, -1);  // filled with -1
dup(z2.s, -2);  // filled with -2
dup(z3.s, -3);  // filled with -3
dup(z4.s, 3);   // filled with 3
dup(z5.s, 5);   // filled with 5

The function f is passed f(a.data()) and the first address of a. This address is in x0, and we use it to load data into the z0 register with the ld1w instruction. At this time, since we bring all the data, we use p0, where all the data is true.

ld1w(z0.s, p0, ptr(x0));

The elements of a are now stored together in z0.

Next, copy the value of z0 to z7.

mov(z7.s, p0, z0.s);

Divide all the elements of z7 by 3, and then multiply by 3. We have a register z4 whose elements are all 3, so we can use it.

sdiv(z7.s, p0.s, z4.s);
mul(z7.s, p0.s, z4.s);

We divide the elements by 3 (rounded down to the nearest whole number) and multiply by 3. So for example, if the register z0 stores "1,2,3,4", then values in z7 will be "0,0,3,3". We make a mask by comparing these z0 and z7 with `cmpeq.

cmpeq(p1.s, p0, z0.s, z7.s);

Then the predicate register p1 will be 0010. We can use it as a mask to write the data "-1,-1,-1,-1" to the address of array a. Use the register z1 with all elements initialized to -1.

st1w(z1.s, p1, ptr(x0));

In this way, the data that was originally "1,2,3,4" became "1,2, -1,4". You can handle Buzz in exactly the same way.

Next, we will make a mask for FizzBuzz, i.e., a place divisible by 15. Since we already have the predicate register p1, which stores the positions for divisible by 3 and the register p2 for divisible by 5, we can make a mask that stores the positions for divisible by 15 p3 by performing logical conjunction between p1 and p2.

and_(p3.b, p0, p1.b, p2.b);

Since and is a reserved word, Xbyak seems to name it and_. Then, you can write back -3 in the same way.

Finally, shift the value of register x0, which is the address to be read or written, by the number of elements per register * 4 bytes (= sizeof(int32_t)).

adds(x0, x0, nw*4);

Repeat this as many times as necessary to complete the process for all the elements. While we can create a loop with Xbyak, let's unroll the loop completely to take advantage of the JIT.

You will obtain the following results by compiling and executing the code.

$ make
$ ./a.out
Number of int32_t in a register is 16.
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
Fizz
22
23
Fizz
Buzz
26
Fizz
28
29
FizzBuzz
31
32

Let's also take a look at the machine language generated by Xbyak.

$ xdump xbyak.dump

xbyak.dump:     file format binary


Disassembly of section .data:

0000000000000000 <.data>:
   0:   2598e3e0        ptrue   p0.s
   4:   25b8dfe1        mov     z1.s, #-1
   8:   25b8dfc2        mov     z2.s, #-2
   c:   25b8dfa3        mov     z3.s, #-3
  10:   25b8c064        mov     z4.s, #3
  14:   25b8c0a5        mov     z5.s, #5
  18:   a540a000        ld1w    {z0.s}, p0/z, [x0]
  1c:   05a7c007        mov     z7.s, p0/m, z0.s
  20:   04940087        sdiv    z7.s, p0/m, z7.s, z4.s
  24:   04900087        mul     z7.s, p0/m, z7.s, z4.s
  28:   2487a001        cmpeq   p1.s, p0/z, z0.s, z7.s
  2c:   e540e401        st1w    {z1.s}, p1, [x0]
  30:   05a7c007        mov     z7.s, p0/m, z0.s
  34:   049400a7        sdiv    z7.s, p0/m, z7.s, z5.s
  38:   049000a7        mul     z7.s, p0/m, z7.s, z5.s
  3c:   2487a002        cmpeq   p2.s, p0/z, z0.s, z7.s
  40:   e540e802        st1w    {z2.s}, p2, [x0]
  44:   25024023        and     p3.b, p0/z, p1.b, p2.b
  48:   e540ec03        st1w    {z3.s}, p3, [x0]
  4c:   b1010000        adds    x0, x0, #0x40
  50:   a540a000        ld1w    {z0.s}, p0/z, [x0]
  54:   05a7c007        mov     z7.s, p0/m, z0.s
  58:   04940087        sdiv    z7.s, p0/m, z7.s, z4.s
  5c:   04900087        mul     z7.s, p0/m, z7.s, z4.s
  60:   2487a001        cmpeq   p1.s, p0/z, z0.s, z7.s
  64:   e540e401        st1w    {z1.s}, p1, [x0]
  68:   05a7c007        mov     z7.s, p0/m, z0.s
  6c:   049400a7        sdiv    z7.s, p0/m, z7.s, z5.s
  70:   049000a7        mul     z7.s, p0/m, z7.s, z5.s
  74:   2487a002        cmpeq   p2.s, p0/z, z0.s, z7.s
  78:   e540e802        st1w    {z2.s}, p2, [x0]
  7c:   25024023        and     p3.b, p0/z, p1.b, p2.b
  80:   e540ec03        st1w    {z3.s}, p3, [x0]
  84:   b1010000        adds    x0, x0, #0x40
  88:   d65f03c0        ret

You can see that the code created by Xbyak has been doubly expanded.

The current result is for 512-bit registers, but you can also try it for 128-bit and 256-bit registers.

$ make run128
qemu-aarch64 -cpu max,sve128=on ./a.out
Number of int32_t in a register is 4.
1
2
Fizz
(snip)

$ make run256
qemu-aarch64 -cpu max,sve256=on ./a.out
Number of int32_t in a register is 8.
1
2
Fizz
(snip)

If you disassemble xbyak.dump after each execution, you can see that the loops are expanded 8 times and 4 times, it shows that the scalable codes are generated dynamically.

Licence

MIT

wikipedia2008 / qemu-aarch64