SIMD Compile-time vs. run-time
- Compile-time
- Use intrinsics and/or autovectorization by telling the compiler which extensions are supported on the target host. The executable can not be used on hosts that do not support the included instructions.
- Autovectorization will try to use the extensions you tell the compiler you support.
- For intrinsics if you want multiple SIMD extensions / host machines support, then you can rely on the
__AVX512F__macro definition set by the compiler to manually ensure you only include the correct implementation. E.g., see how Kuffo’s PDX.- In Kuffo’s PDX, C++‘s preprocessor is used to include/exclude architecture specific headers. Thus, if you say
-march=nativeand I have AVX512, then the compiler will set the__AVX512F__definition, which then in the source code with an#ifdefwill ensure that the desired implementation header is included (which in turn defines an implementation of some function).- In this case the
neon_computers.hppheader directly uses the NEON intrinsics. #ifdef- `__ARM_NEON #include “pdxearch/distance_computers/neon_computers.hpp”
#endif
- In this case the
- In Kuffo’s PDX, C++‘s preprocessor is used to include/exclude architecture specific headers. Thus, if you say
- Use intrinsics and/or autovectorization by telling the compiler which extensions are supported on the target host. The executable can not be used on hosts that do not support the included instructions.
- Run-time
- Dynamic CPU dispatching
- DIY: The source code can perform dynamic dispatch (e.g., run
CPUIDthen choose one out of a few “backends” (this seems to be what SimSIMD does in its dynamic dispatch mode: you “prove” your capabilities once, after which the dynamic dispatch mechanism is initialized)) - (Better) Or utilize modern compiler (GCC, Clang) support (e.g.,
__attribute__((target_clones("avx2", "sse4.2", "default")))) to compile multiple versions of a function and insert the dispatch logic for you automatically.- At process load time it will determine the supported extensions and store the correct function pointer into the procedure linkage table (PLT).
- This is better than doing it at runtime (AKA in the DIY approach above).
- https://stackoverflow.com/a/61005989
- This is also known as function multiversioning (FMV) in GCC.
- LWN - Function multi-versioning in GCC 6.
- Great overview of
target_clonesandtarget.
- Great overview of
- LLVM Clang
- https://stackoverflow.com/questions/39958935/does-clang-offer-anything-similar-to-gcc-6-xs-function-multi-versioning-target
- LLVM 7.0: “…function multiversioning in Clang with the ‘target’ attribute for ELF-based x86/x86_64 targets…”
- https://lf-rise.atlassian.net/wiki/spaces/HOME/pages/8586554/CT_01_009+-+Target+Attribute+Support+LLVM
- attribute(target) is supported in LLVM 18.
- target_clones support is in progess https://github.com/llvm/llvm-project/pull/85786
- LWN - Function multi-versioning in GCC 6.
- It seems there is also a
targetattribute where you provide the implementation yourself.
- At process load time it will determine the supported extensions and store the correct function pointer into the procedure linkage table (PLT).
- DIY: The source code can perform dynamic dispatch (e.g., run
- JIT autovectorization can use the widest SIMD instructions available on the host machine if you use specific types (e.g., .NET’s
Vector<T>).
- Dynamic CPU dispatching