I've just been writing a small library to add some SIMD functionality for common functionality to Godot:
https://github.com/lawnjelly/godot-simd
I'm now detecting the CPU caps for x86 / x86_64, so that I can choose a different codepath at runtime. The problem I am facing is that GCC is complaining, e.g.
/usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h:56: error: inlining failed in call to always_inline '__m128 _mm_hadd_ps(__m128, __m128)': target specific option mismatch
When I try to use SSE function that is above that that the compilation flags are for (in this case SSE3 instruction). I can *solve* this by adding e.g. -mAVX to the compilation flags, however I am worried that this will allow e.g. AVX instructions to be used throughout, which I don't want because it will crash older CPUs.
I never had this problem before (I don't remember doing anything special to get around this before, and successfully would be able to run on older CPUs and newer). I would have expected sensible compiler behaviour would be to follow the flags, unless intrinsics were explicitly used (in which case allow the intrinsic instructions). Apparently not though.
Anyway I found a mention of this in the gcc docs:
https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html
-mmmx
-mno-mmx
-msse
-mno-sse
-msse2
These switches enable or disable the use of instructions in the MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, AES, PCLMUL, SSE4A, FMA4, XOP, LWP, ABM or 3DNow! extended instruction sets. These extensions are also available as built-in functions: see X86 Built-in Functions, for details of the functions enabled and disabled by these switches.
To have SSE/SSE2 instructions generated automatically from floating-point code (as opposed to 387 instructions), see -mfpmath=sse.
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.
These options will enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications which perform runtime CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options.
They seem to recommend compiling a different file for each SSE version, and altering the flags each time to limit the architecture to that file only.
My questions is, is this a good approach, and is it cross platform? I need to get things compiling with clang and visual studio in addition to gcc.
The docs are a bit unclear on this. Does -mavx make the compiler generate its own avx (or alternatively is it just allowing you to use it)? Or does -marchavx do this? If so can I simply use -mavx -marchsse2 in combination to limit compiler generated code to sse2, while allowing avx with intrinsics?