voxels/thru : January 2023

It isn't uncommon to hear things like "you should always used signed integers" from some developers.

Well here is one downside of signed integers that bit me with regards to performance.

I had some SIMD code that needed to do some strided 64 byte loads.

The actual code is more complex, but here is a simplified version to demonstrate.

   _m256i StridedLoadsSigned( i64* p, __m128i adr){
     /*alignas(16)*/  int Mem[4];
    _mm_store_si128((__m128i*)Mem,adr);

    i64* low = p + Mem[0];
    i64* b = p + Mem[1];
    i64* c = p + Mem[2];
    i64* high = p + Mem[3];
    __m128i v = _mm_loadl_epi64((const __m128i*)(low));
    __m128i v3 = _mm_loadl_epi64((const __m128i*)(c));
    __m128i v2 = _mm_insert_epi64(v, *((i64*)(b)+4), 1);
    __m128i v4 = _mm_insert_epi64(v3, *(i64*)(high), 1);
    __m256i a = _mm256_setr_m128i(v2,v4);
            
    return a;
} 

Here is the horror story MSVC generates for this:

        vpextrd eax, xmm1, 3
        movsxd  r8, eax
        mov     r9, rcx
        vpextrd eax, xmm1, 2
        movsxd  rdx, eax
        vpextrd eax, xmm1, 1
        vmovq   xmm0, QWORD PTR [rcx+rdx*8]
        vpinsrq xmm3, xmm0, QWORD PTR [rcx+r8*8], 1
        movsxd  rdx, eax
        vmovd   eax, xmm1
        movsxd  rcx, eax
        vmovq   xmm0, QWORD PTR [r9+rcx*8]
        vpinsrq xmm1, xmm0, QWORD PTR [r9+rdx*8], 1
        vinsertf128 ymm0, ymm1, xmm3, 1        

According to UICA It has a predicted throughput of 14 cycles and issues 19 uops(Skylake)

Now lets do one tiny change and make Mem unsigned.

       unsigned int Mem[4];

                vpextrd edx, xmm1, 2
                vpextrd r8d, xmm1, 3
                vmovd   eax, xmm1
                vmovq   xmm0, QWORD PTR [rcx+rdx*8]
                vpinsrq xmm3, xmm0, QWORD PTR [rcx+r8*8], 1
                vmovq   xmm0, QWORD PTR [rcx+rax*8]
                vpextrd edx, xmm1, 1
                vpinsrq xmm1, xmm0, QWORD PTR [rcx+rdx*8], 1
                vinsertf128 ymm0, ymm1, xmm3, 1
 
 
 

So what is the difference, and why is the 2nd one so much simpler? Well the compiler no longer felt the need to insert sign conversions, which preserves the sign of the 32 bit integer into the 64 bit address.

A further amusement is that this simplified code is basically the equivalent of _mm256_i32gather_epi64, but gather is implemented so poorly on most CPUs that this code will outform the hardware gather.

On Zen2 _mm256_i32gather_epi64 emits 32 uops!

The only CPU that I am aware of that might be better off with hardware gather is Raptor/Alder Lake P core(7 uops), unfortunately those are paired with E cores where gather is *terrible*(54 uops). Even modern Zen4 still has a fairly terrible gather(24 uops).

voxels/thru

Pages

Signed integers addressing downsides..

Here is the horror story MSVC generates for this:

According to UICA It has a predicted throughput of 14 cycles and issues 19 uops(Skylake)

Now lets do one tiny change and make Mem unsigned.

unsigned int Mem[4];

vector_scatter

Links