(dev/ramblings) Optimizing Android apps with NEON
Posted on 2013-07-07, 6 comments, 92 +1's, imported from Google+/Chainfire

NOTICE: This content was originally posted to Google+, then imported here. Some formatting may be lost, links may be dead, and images may be missing.

The past few weeks I've spent a lot of time optimizing quite a few DSLR Controller routines. Aside from doing some things in smarter (less expensive) ways, rescheduling code, and parallelizing where possible to make more efficient use of multi-core chips, there's also a lot of hand-crafted NEON code in there.

I remember when MMX was still new-ish, and I got my first taste of SIMD instructions. I was already fond of hand-optimizing (rather useless) things in assembler, and MMX added a whole new level of cool. Intel also had some rather groovy tutorials on the subject, visually teaching you all the basics of SIMD in a matter of hours.

As it was back then, I had great fun doing this sort of optimization again - it's actually one of the things in development I enjoy (as well as curse) most. These sort of optimizations rarely pay off, as they are very expensive programming time wise, and often do not bring the user any noticeable improvement. As such, it is rare a viable opportunity presents itself to spend time on them.

So I guess you could say I was pretty excited to get to work on this. But I quickly found out I had been spoiled by Intel in the past - I found the quick start tutorials on NEON lacking. Or maybe my Google-Fu just abandoned me. 

I was happy to find out about NEON intrinsics though - it allows you to use NEON as C functions instead of having to write in assembler. I'm sorry to say that even though I've spent most of the past 6 years exclusively working with ARM devices, my ARM assembler skills still are no match for the X86 assembler skills I had way back when. So this was a very welcome discovery indeed.

If you're interesting in doing some NEON optimizations yourself , here's a short list of links you will want to read:

Enabling NEON support in the NDK

http://www.kandroid.org/ndk/docs/CPU-ARM-NEON.html

You will need this to get started

Introduction to NEON on iPhone

http://wanderingcoder.net/2010/06/02/intro-neon/

Yes, it's for iPhone, but this article has some background and cleared up some questions for me

Coding for NEON series

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/

http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/

http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/

http://blogs.arm.com/software-enablement/277-coding-for-neon-part-4-shifting-left-and-right/

http://blogs.arm.com/software-enablement/684-coding-for-neon-part-5-rearranging-vectors/

http://blogs.arm.com/software-enablement/961-coding-using-neon-technology/

NEON assembler reference

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/CJABFHEJ.html

NEON C reference

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491i/Badcdfad.html

NEON header

http://androidxref.com/4.2.2_r1/xref/prebuilts/gcc/linux-x86/arm/arm-eabi-4.6/lib/gcc/arm-eabi/4.6.x-google/include/arm_neon.h

Just "#include <arm_neon.h>" to use it.

Aside from these links, if there's one thing I feel is not stressed enough in the other articles and is the number one rule for NEON performance, it is this:

DO repeat yourself

For lengthy pieces of code, it is very cumbersome to write (lots of copy/paste involved), but this is a must do with NEON. If you've used a NEON register as target in one call, you cannot immediately use that register again (without introducing a small delay). One solution is to interleave loops. It is generally assumed you know this, and samples only very rarely include (or even mention) the interleaved instructions. Working around this limitation is usually referred to as latency hiding. Sometimes you can simply change the order of your instructions to work around the limitation, but I found in my case interleaving loops appears to work best - and is easiest.

For example, if your inner loop would do nothing but load data from an array, perform two AND operations, and store it back, you'd get something like:

uint64x1_t reg1 = vld1_u64( u64ptr );

reg1 = vand_u64( reg1, mask1 );

reg1 = vand_u64( reg1, mask2 );

vst1_u64( u64ptr, reg1 );

u64ptr++;

What you would want to make from that is:

uint64x1_t reg1 = vld1_u64( u64ptr );

uint64x1_t reg2 = vld1_u64( u64ptr + 1 );

uint64x1_t reg3 = vld1_u64( u64ptr + 2 );

uint64x1_t reg4 = vld1_u64( u64ptr + 3 );

reg1 = vand_u64( reg1, mask1 );

reg2 = vand_u64( reg2, mask1 );

reg3 = vand_u64( reg3, mask1 );

reg4 = vand_u64( reg4, mask1 );

reg1 = vand_u64( reg1, mask2 );

reg2 = vand_u64( reg2, mask2 );

reg3 = vand_u64( reg3, mask2 );

reg4 = vand_u64( reg4, mask2 );

vst1_u64( u64ptr, reg1 );

vst1_u64( u64ptr + 1, reg2 );

vst1_u64( u64ptr + 2, reg3 );

vst1_u64( u64ptr + 3, reg4 );

u64ptr += 4;

In this example (I'm not even sure it works, but you get the idea) you're putting 4 iterations of the old loop in a single iteration of the new loop, interleaving the use of the registers. This will typically be about 3 (but not 4) times faster. Of course, your mileage will vary with the instructions you use, and how many loops you need to interleave for optimal performance is something you need to benchmark. For some routines you won't get extra benefit beyond 2, but for me it seemed most benefit from 3 or 4. But as 3 is not an easy number to work with, I tend to use 4. Of course, you will also need enough registers available for this, complicated routines may use too many registers to be able to interleave 4 loops.

Another thing you need to be aware of is that you can load from and store to normal registers, but in practice it is rarely a good idea. Loading from an normal register isn't a big deal, but storing to a normal register is incredibly expensive.

Last but not least, especially older articles will advise you to manually prefetch. Do this with extreme care and benchmarking only. Modern ARM chips have a very good prefetcher, and most (but not all) of the time it will do a better job at it than you will. Prefetching badly can certainly be slower than not prefetching at all, so take care doing it. If you do want to prefetch, you can use the builtin prefetch - "__builtin_prefetch()" - function.

+192
Abdu Fenniche commented on 2013-07-07 at 23:02:

Thank's

Osa Jam commented on 2013-07-08 at 01:06:

Thanks for your hard work +Chainfire and keep going

Joe Philipps commented on 2013-07-08 at 02:24:

I just have to wonder what the compiler writers are doing wrong if you are feeling the need to do such optimizations by hand as instruction scheduling.  I thought that was one of the things taken into account by the code generation phase for whatever architecture the compiler supports, at least at some level of optimization (e.g., -O2 -O3 or whatever for the GCC).  Or at least one would hope that's taking place.  Therefore you wouldn't have to copy/paste asm code and instead could concentrate on the algorithms.  There are just a whole lotta GCC options for example, and not sure which ones would help more than others for your task at hand...a bunch of -m that might describe the target processor better (e.g., make it aware the NEON instructions can be used in the resulting object module), as well as -f options.

hmmm....

Chainfire commented on 2013-07-08 at 03:00:

+Joe Philipps I am using -Ofast but I'm not using any of the automatic NEON vectorisations. This is because the library may be used on devices that do not support NEON, so you have to decide at runtime which routines to use. It would not surprise me if this is the reason the instruction interleave is not done automagically. On the other hand, the compiler could also have been smart enough to detect a routine uses NEON, so the entire routine could be optimized. I've also seen it said that NEON "intrinsic" (using it in C instead of ASM) support is rather young and far from perfect yet. Whatever the reason, it most certainly makes a massive difference on performance, and it's well known enough that several (though far from all) articles and examples mention you might want to do this.

Regardless, it would be interesting to see if separating the armv7a NDK lib into a specific NEON and non-NEON version is worth the effort, compiling the NEON variant with all possible optimizations enabled. See if it has any additional overall performance increase, and also test if this manual code interleaving is still needed.

Pietari Heino commented on 2013-07-08 at 05:11:
Natron Blake commented on 2013-07-10 at 00:30:

Thank you for sharing you knowledge with us. I look forward to seeing your posts and projects. I got my device totally optimal. But....I'm trying to root a SEMC X10a and cannot find unbroken links for files

This post is over a month old, commenting has been disabled.