Loads and stores are very inefficient for 3D vectors

Question

Loads and stores are very inefficient for 3D vectors

Const-me opened this issue 5 years ago · comments

C++ source code:

template<class TLoadStore>
void scaleStdVector( std::vector<DirectX::XMFLOAT3>& vec )
{
    const __m128 mul = _mm_setr_ps( 2, 1, 0.5, 0 );
    for( DirectX::XMFLOAT3& f3 : vec )
    {
        __m128 v = TLoadStore::load( f3 );
        v = _mm_mul_ps( v, mul );
        TLoadStore::store( f3, v );
    }
}

// Reference implementation which calls DirectXMath functions.
struct BuiltinLoadStore
{
    __forceinline static __m128 load( const DirectX::XMFLOAT3& f3 )
    {
        return DirectX::XMLoadFloat3( &f3 );
    }
    __forceinline static void store( DirectX::XMFLOAT3& f3, __m128 v3 )
    {
        DirectX::XMStoreFloat3( &f3, v3 );
    }
};

// Improved implementation which uses SSE2 instructions
struct SSE2LoadStore
{
    __forceinline static __m128 load( const DirectX::XMFLOAT3& f3 )
    {
        // Load XY values with a single movsd instruction.
        const __m128 xy = _mm_castpd_ps( _mm_load_sd( reinterpret_cast<const double*>( &f3 ) ) );
        // Load Z value
        const __m128 z = _mm_load_ss( &f3.z );
        // Combine the 2
        return _mm_movelh_ps( xy, z );

    }
    __forceinline static void store( DirectX::XMFLOAT3& f3, __m128 v3 )
    {
        // Store XY values
        _mm_store_sd( reinterpret_cast<double*>( &f3 ), _mm_castps_pd( v3 ) );
        // Store Z value
        const __m128 z = _mm_movehl_ps( v3, v3 );
        _mm_store_ss( &f3.z, z );
    }
};

// Improved implementation which uses SSE 4.1 instructions
struct SSE41LoadStore
{
    __forceinline static __m128 load( const DirectX::XMFLOAT3& f3 )
    {
        __m128d dbl = _mm_load_sd( reinterpret_cast<const double*>( &f3 ) );
        __m128 xy = _mm_castpd_ps( dbl );
        // insertps can insert value directly from memory: https://www.felixcloutier.com/x86/insertps
        // Unfortunately, VC++ 2017 compiler is unable to combine 2 following instructions into a single insertps.
        // Other compilers, or newer versions of VC++, can do better.
        __m128 z = _mm_load_ss( &f3.z );
        return _mm_insert_ps( xy, z, 0x20 );
        // If you're thinking about _mm_insert_epi32 (which normally combines both load & insert just fine), it will probably be slower on many CPUs due to cross-domain latency.
        // https://www.agner.org/optimize/microarchitecture.pdf search for "Data bypass delays" (Intel) or "Data delay between different execution domains" (AMD)
        // Using `loadsd` to load 2 floats is fine because floats and doubles are in the same domains.
    }

    __forceinline static void store( DirectX::XMFLOAT3& f3, __m128 v3 )
    {
        _mm_store_sd( reinterpret_cast<double*>( &f3 ), _mm_castps_pd( v3 ) );
        // extractps can store directly to memory: https://www.felixcloutier.com/x86/extractps
        // Again, VC++ compiler still compiles the line below into 2 instructions, extractps into EAX register, only then store the value.
        *reinterpret_cast<int*>( &f3.z ) = _mm_extract_ps( v3, 2 );
    }
};

On my PC, scaleStdVector<BuiltinLoadStore> takes 71.9µs to process 80000-elements std::vector.
Both SSE2 and SSE 4.1 versions take 43.2µs. That's a huge difference, almost 2 times faster.

Chuck Walbourn · Answer 1 · Fri Feb 14 2020 04:12:23 GMT+0800 (China Standard Time)

Thanks. Appreciate the feedback!

See this pull request

Konstantin · Answer 2 · Fri Feb 14 2020 05:03:38 GMT+0800 (China Standard Time)

@walbourn I’m not sure about this part:

_mm_storel_epi64( reinterpret_cast<__m128i*>(pDestination), _mm_castps_si128(V) );

I haven’t measured, and even if I did it’s CPU specific, but think you should use _mm_store_sd there instead.

When you use integer instructions like movq on float/double values, you’re introducing cross-domain latency. For this reason, _mm_and_ps and _mm_and_si128 are two separate instructions, andps and pand respectively, despite they’re equivalent.

Chuck Walbourn · Answer 3 · Sat Feb 15 2020 04:00:51 GMT+0800 (China Standard Time)

@Const-me Sounds reasonable. Is there anything we can do here to take advantage of the fact that we know it's 16-byte aligned?

Chuck Walbourn · Answer 4 · Sat Feb 15 2020 04:14:12 GMT+0800 (China Standard Time)

@Const-me Actually, I can only use the _mm_store_pd trick if the target memory is 16-byte aligned! Using _mm_load_sd is good as it doesn't have any alignment requirements.

Chuck Walbourn · Answer 5 · Sat Feb 15 2020 04:36:44 GMT+0800 (China Standard Time)

@Const-me: To sum up:

_mm_load_pd is a great way to to load 2 floats. It has no memory alignment requirement.
_mm_store_pd is a great way to write the lower 2 floats if the destination memory is 16-byte aligned
_mm_storel_epi64 writes 2 floats. It's a little unclear about the destination memory requirement. It is prottyped using __m128i* which implies 16-byte alignment is required. The Intel Intrinsics guide doesn't state if it requires 16-byte alignment or not. The MOVQ m64, xxm docs page is al little vague. According to this blog post the instructions support unaligned memory but the compiler's can get confused since the type is aligned.

Konstantin · Answer 6 · Sat Feb 15 2020 05:02:14 GMT+0800 (China Standard Time)

@walbourn 1 – I’m not aware of any instruction which does aligned load/store but moves less than the complete register (128/256 bit). If I’m not mistaken, on modern CPUs there’s no penalty for using unaligned load/store instructions when the memory address is actually aligned. See this question on SO https://stackoverflow.com/q/20259694/126995 especially answers and comments, note that’s from 2013, what they meant by older CPUs are now prehistoric ones.

2 – I’m not sure I follow. When you want to load or store lower 2 floats in __m128, use _mm_load_sd / _mm_store_sd intrinsics. That’s what I’m doing in my example code above. For both of them, Intel docs say “mem_addr does not need to be aligned on any particular boundary.” Float and double instructions are in the same execution domain (at least for current and past CPUs), I don’t think this causes cross-domain latency that happens when mixing integer and float instructions (e.g. storing floats with _mm_storel_epi64 or inserting floats with _mm_insert_epi32)

When you want to load/store complete SIMD registers, all 128 bits of them like _mm_load_pd / _mm_store_pd does, you don’t need any tricks. All 3 data types (floats, doubles, integers) have corresponding intrinsics, both aligned and unaligned versions: _mm_load_ps, _mm_loadu_ps, _mm_loadu_si128, _mm_store_si128, and so on.

If you want faster version of aligned float3 load, you can try to _mm_load_ps, then _mm_and_ps with g_XMMask3 constant. This may cause inconveniences due to false positives in ASAN and similar tools. Works nevertheless, because OS memory allocation and protection works on page granularity, pages are aligned and an aligned XMFLOAT3A is guaranteed to not cross page boundary. If the process has access to read 3 floats, it will read 4 of them just fine. The trick won’t work for aligned stores. You’re guaranteed to have write access for the same reason. But overwriting extra 4 bytes is not OK for correctness, the compiler may place a float or int variable immediately after aligned XMFLOAT3A. Load-modify-store is much slower than 2 stores, one 64 bit another one 32-bit.

Chuck Walbourn · Answer 7 · Sat Feb 15 2020 05:13:17 GMT+0800 (China Standard Time)

#Const-me Is the Intel Intrinsics Guide entry on _mm_store_pd wrong?

Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

I understand that in practice all modern machines no longer really care about aligned vs aligned. Does Visual C++ and clang/LLVM no longer use movapd and just uses movupd for this intrinsic?

Konstantin · Answer 8 · Sat Feb 15 2020 05:53:07 GMT+0800 (China Standard Time)

@walbourn

Is the Intel Intrinsics Guide entry on _mm_store_pd wrong?

The documentation is fine. _mm_store_pd stores all 128 bits of the value. It’s the complete equivalent of _mm_store_ps.

When loading/storing 3D float vectors as opposed to 4D, 128 bits are too many, we need to load/store only first 96 of them. The code I wrote above does that with 2 load/store instructions, the _mm_load_sd / _mm_store_sd to handle first 64 bits, then handle Z with another one.

_mm_store_pd requires alignment just like _mm_store_ps (there're unaligned versions of both, _mm_storeu_pd, _mm_storeu_ps). Meanwhile, _mm_store_sd is guaranteed to work for unaligned addresses, just like _mm_store_ss.

Does Visual C++ and clang/LLVM no longer use movapd and just uses movupd for this intrinsic?

In many cases, when compiling for AVX the compiler emits neither of them. It often merges loads into subsequent operations with the values. When you build for AVX, compiler uses VEX encoding https://en.wikipedia.org/wiki/VEX_prefix for everything, VEX encoded instructions may directly read from unaligned RAM addresses.

When compiling for SSE, compilers don’t use VEX. In this case, unaligned loads can’t merge as it would cause runtime crashes “unaligned access”, that’s when you’ll see movupd, movdqu or something else depending on which intrinsic was used. In my experience, VC++ 2015 and 2017 respect the source, i.e. the instruction you got directly corresponds to the intrinsic you used.

When the compiler won’t merge that’s when you’ll see these movapd/movupd/etc. Happens at least in the following cases.

When the loaded value is used more than once.
When you wrote code with an intrinsic with more than 1 value from memory. Most instructions can only load a single operand from RAM.
As described above, when you’re building for SSE and the load was unaligned, e.g. _mm_loadu_ps.

Chuck Walbourn · Answer 9 · Sat Feb 15 2020 06:38:45 GMT+0800 (China Standard Time)

Gah, right.. Sorry it's been a long week. _mm_store_sd the scalar store, not _mm_store_pd the vector store...

Chuck Walbourn · Answer 10 · Sun Mar 01 2020 07:57:21 GMT+0800 (China Standard Time)

Thanks again. Everything in this commit