Mozilla-Ocho / llamafile

llamafile/llamafile/sgemm_bss_avx512bf16.cpp

Line 33 in a812463

return (__m512bh)_mm512_loadu_ps((const float *)p);

not 100% sur but, I think this code is wrong because of x86 CPU are little endian...

_mm512_loadu_ps is to load fp32 that have 4byte it load {b1,b2,b3,b4} and register get {b4,b3,b2,b1}
I think in this case we need {b2,b1,b4,b3} ... for that I use _mm512_loadu_epi16

static inline __m512bh load(const ggml_bf16_t *p) {
    return (__m512bh)_mm512_loadu_epi16((const float *)p);
}

But I may be wrong.

Note: code I use to test (it is for store ... but expect same with load.

/*

g++ -Wall -O3 -fopenmp -march=native -o main main2.cpp; ./main

*/
#include <immintrin.h>
#include <iostream>
#include <iomanip>

using float32_t = float;

# pragma pack(push, 1)
union bf16_t {
  struct {
    unsigned short fraction:7;
    unsigned short exponent:8;  // -127
    bool           sign:1;      // +:false -:true
  } p;
  unsigned short u=0x8000; // +0?
  // auto &sign = p.sign;
};
#pragma pack(pop)

# pragma pack(push, 1)
union fp32_t {
  struct {
    unsigned int fraction:23;
    unsigned int exponent:8;
    bool         sign:1;
  } p;
  unsigned int u;
  float f=1;
};
#pragma pack(pop)

std::ostream& operator<<(std::ostream& target, const bf16_t& source) {
    target <<"0x"<< std::hex << source.u << std::setbase(0) ;
    return target;
}
std::ostream& operator<<(std::ostream& target, const fp32_t& source) {
    target <<"0x"<< std::hex << source.u << std::setbase(0) ;
    return target;
}

template <int N, typename T>
void rang_init(T* val) {
    for (int i=0; i<N; ++i) val[i].f = i+1;
}

template <int N, typename T>
void print(T* val) {
    for (int i=0; i<N; ++i) std::cout << val[i] << "," ;
    std::cout << std::endl;
}

int main() {
    fp32_t in[32];
    bf16_t out[32];
    rang_init<32>(in);

    auto tmp1 = _mm512_loadu_ps(in);
    auto tmp2 = _mm512_loadu_ps(in+16);
    auto res  = _mm512_cvtne2ps_pbh(tmp2,tmp1); // merci les ptit-indien
    _mm512_storeu_epi16(out, (__m512i)res);
    std::cout << "pf32: "; print<32>(in);
    std::cout << "bf16: "; print<32>(out);
    return 0;
}

Note: I forget... nice work you are doing with llamafile 👍

Your program prints out:

pf32: 0x3f800000,0x40000000,0x40400000,0x40800000,0x40a00000,0x40c00000,0x40e00000,0x41000000,0x41100000...
bf16: 0x3f80,0x4000,0x4040,0x4080,0x40a0,0x40c0,0x40e0,0x4100,0x4110,0x4120,0x4130,0x4140,0x4150,0x4160...

That looks correct to me. Now consider this example.

#ifdef __x86_64__
#include "llama.cpp/ggml.h"
#include <assert.h>
#include <immintrin.h>
#include <string.h>
int main(int argc, char *argv[]) {
    float in[32];
    ggml_bf16_t out[32];
    for (int i = 0; i < 32; ++i)
        in[i] = i;
    __m512 tmp1 = _mm512_loadu_ps(in);
    __m512 tmp2 = _mm512_loadu_ps(in + 16);
    __m512bh res = _mm512_cvtne2ps_pbh(tmp2, tmp1);
    memcpy(out, &res, 64);
    for (int i = 0; i < 32; ++i)
        assert(ggml_bf16_to_fp32(out[i]) == in[i]);
    _mm512_storeu_epi16(out, (__m512i)res);
    for (int i = 0; i < 32; ++i)
        assert(ggml_bf16_to_fp32(out[i]) == in[i]);
}
#else
int main(int argc, char *argv[]) {
}
#endif

When Intel documents instructions as having a type (e.g. _mm512_loadu_ps() and _mm512_loadu_epi16()) it doesn't mean anything. How the bits are arranged in silicon is opaque to the programmer. The only thing endianness concerns is how words and vectors are serialized to a char[] array. John von Neumann had the insight in his first draft on the EDVAC that little endian is simply the natural order of memory, even though it's counter-intuitive to people raised on Arabic notation.

In any case that code has since been refactored. We're experimenting with a slightly different BF16 technique, but I might resurrect that code again sometime soon. Glad you've been enjoying it! This is one of my favorite issues so far, since I love this subject and I don't always get the opportunity to talk to about it that much.

👍 Wow!!! good point for you.

__m512i _mm512_loadu_epi16 (void const* mem_addr)  => vmovdqu16
__m512i _mm512_loadu_epi32 (void const* mem_addr)  => vmovdqu32
__m512i _mm512_loadu_epi64 (void const* mem_addr)  => vmovdqu64
__m512i _mm512_loadu_epi8 (void const* mem_addr)   => vmovdqu8
__m512d _mm512_loadu_pd (void const* mem_addr)     => vmovupd
__m512h _mm512_loadu_ph (void const* mem_addr)     => vmovups
__m512 _mm512_loadu_ps (void const* mem_addr)      => vmovups
__m512i _mm512_loadu_si512 (void const* mem_addr)  => vmovdqu32

look like they define 6 different "vmov" that did the same thing...
and that _mm512_loadu_ph is the same as _mm512_loadu_ps ...
We may have to define a new one (gcc/clang/amd???)

 __m512bh _mm512_loadu_pbh(void const* mem_addr);
 void _mm512_storeu_ph (void * mem_addr, __m512bh a);

Juste to avoid casting __m512bh 😉

Don't know why there is so much asm ops defined. What it mean is that the full __m512 vector register is "reorder"... I missed that.

char[64]="1,2,3 ... 64"  <=> __m512 = "64,63,...2,1"

(try to write more useful issue next time.)

wrong endian