Add vectorization sample. #1067

DanAlbert · 2024-06-12T22:14:36Z

This adds a new generic vectorization sample to replace hello-neon. Most importantly, it covers non-Neon options for SIMD. One of the useful things it shows, for example, is that there's actually no reason to write SIMD code the way that hello-neon does any more.

This also solves a much simpler problem (small matrix multiplication), which makes it easier to see how to deal with the SIMD features rather than figuring out what a FIR filter is.

Finally, this sample benchmarks each of the implementations so it's obvious what is and isn't worth doing. I was sort of surprised that auto-vectorization didn't do better, and was pleased to learn that there's no reason at all to write Neon intrinsics.

I'll delete hello-neon after this merges and I've fixed up the doc links.

#1011

Deriving pointer alignment essentially makes this do nothing, since for whatever reason (fighting with Studio maybe?), for new files it's deriving the wrong style.

DanAlbert · 2024-06-12T22:15:15Z

Currently a draft because I seem to have encountered an AGP bug that makes cross-module dependencies flaky, so I may need to deal with base in a different way...

The code itself wouldn't change to accommodate that though. Aside from the todo about checking .at() vs operator[] (which might actually be significant here), the code is, I think, done.

vectorization/README.md

enh-google · 2024-06-13T12:32:00Z

vectorization/README.md

+portable as the auto-vectorization implementation, with the only caveat being
+that it is limited by the width of the vector registers for the target hardware.
+To deal with problems that don't fit in the target's vector registers, you would
+need to either alter the algorithm to tile the operations, or use Scalable


link to an explanation/wikipedia page?

It is linked? Is there a better doc I should be linking to? I didn't find much.

"tile the operations" isn't a phrase in my vocabulary.

Ah, gotcha. I've not actually written or seen this either, but it's what all the references I used said too. I'll see if I can find something that illustrates it better, but aiui it's just a matter of breaking up e.g. a 16x16 matrix multiply into 4 4x4 matrix multiplies.

vectorization/README.md

enh-google · 2024-06-13T12:35:40Z

vectorization/README.md

+making for less concise code than is needed for a constrained vector size like
+we have here, handle windowing of data to fit the hardware vector size for you.
+For problems like the small matrix multiply we do here, it's overkill. For
+portability to a wide variety of (Arm) CPUs, it can reduce the difficulty of


this makes it sound more available than it is ... "qualcomm has not shipped SVE" is a pretty huge caveat for real-world use.

That's fair. I've toned it down, but avoided being specific because I know we won't remember to come fix it as the hardware does roll out. PTAL

vectorization/src/main/cpp/auto_vectorization.h

vectorization/src/main/cpp/benchmark.cpp

vectorization/src/main/cpp/matrix.h

This adds a new generic vectorization sample to replace hello-neon. Most importantly, it covers non-Neon options for SIMD. One of the useful things it shows, for example, is that there's actually no reason to write SIMD code the way that hello-neon does any more. This also solves a much simpler problem (small matrix multiplication), which makes it easier to see how to deal with the SIMD features rather than figuring out what a FIR filter is. Finally, this sample benchmarks each of the implementations so it's obvious what is and isn't worth doing. I was sort of surprised that auto-vectorization didn't do better, and was pleased to learn that there's no reason at all to write Neon intrinsics. I'll delete hello-neon after this merges and I've fixed up the doc links. android#1011

DanAlbert · 2024-06-17T22:02:15Z

Added a fair amount of content to the readme for explaining FMV (to the best of my knowledge, anyway, and since I learned it all just now it might be wrong). That probably belongs on DAC rather than here, but it can go here for now.

enh-google · 2024-06-18T14:18:18Z

vectorization/README.md

+that it is limited by the width of the vector registers for the target hardware.
+To deal with problems that don't fit in the target's vector registers, you would
+need to either alter the algorithm to tile the operations, or use Scalable
+Vector Extensions (AKA [SVE]).


does SVE actually help with this problem? (have you considered that negge might be a better reviewer for this stuff? :-) )

I pointed him at it a few days ago but haven't heard back yet. @negge explicitly now that I know his github username :)

enh-google · 2024-06-18T14:19:49Z

vectorization/README.md

+support big endian, so we can ignore that caveat) is use the `*` operator and
+leave the correct instruction selection up to Clang.
+
+In other words, you should probably never use the Neon-specific approach. The


(this really seems worth a separate note on the d.a.c neon page...)

Yes. One thing at a time though :)

enh-google · 2024-06-18T14:21:37Z

vectorization/README.md

+we have here, handle windowing of data to fit the hardware vector size for you.
+For problems like the small matrix multiply we do here, it's overkill. For
+portability across various vector widths for the Arm CPUs that support SVE, it
+can reduce the difficulty of writing SIMD code.


"In practice though, most mobile hardware has no SVE support, and all the hardware that does support SVE only has 128-bit vector lengths. In practice, SVE's portability is currently only between mobile and server (because server SVE implementations do sometimes have larger vector lengths)."?

enh-google · 2024-06-18T14:22:23Z

vectorization/README.md

+The `target` attribute makes it easier to write multiple implementations for a
+function that should be selected based on the runtime hardware. If benchmarking
+shows that one implementation performs better on armv8.2 and a different
+implementation performs better on armv8 (see the docs for more details on


maybe explicitly say 8.0 or something?

I'd written that originally, but it looked odd enough that I wasn't sure I'd ever actually seen it phrased that way before. SGTM though

Fix .clang-format to not preserve bad format.

a5f12fb

Deriving pointer alignment essentially makes this do nothing, since for whatever reason (fighting with Studio maybe?), for new files it's deriving the wrong style.

DanAlbert requested a review from enh-google June 12, 2024 22:14

DanAlbert commented Jun 12, 2024

View reviewed changes

vectorization/README.md Outdated Show resolved Hide resolved