It isn't uncommon to hear things like "you should always used signed integers" from some developers.
Well here is one downside of signed integers that bit me with regards to performance.
I had some SIMD code that needed to do some strided 64 byte loads.
The actual code is more complex, but here is a simplified version to demonstrate.
Here is the horror story MSVC generates for this:
According to UICA It has a predicted throughput of 14 cycles and issues 19 uops(Skylake)
Now lets do one tiny change and make Mem unsigned.
unsigned int Mem[4];
So what is the difference, and why is the 2nd one so much simpler? Well the compiler no longer felt the need to insert sign conversions, which preserves the sign of the 32 bit integer into the 64 bit address.
A further amusement is that this simplified code is basically the equivalent of _mm256_i32gather_epi64, but gather is implemented so poorly on most CPUs that this code will outform the hardware gather.
On Zen2 _mm256_i32gather_epi64 emits 32 uops!
The only CPU that I am aware of that might be better off with hardware gather is Raptor/Alder Lake P core(7 uops), unfortunately those are paired with E cores where gather is *terrible*(54 uops). Even modern Zen4 still has a fairly terrible gather(24 uops).