Getting the compiler to auto-vectorize code in a sensible manner - c

I'm trying to figure out how to structure the main loop code for a numerical simulation in such a way that the compiler generates nicely vectorized instructions in a compact way.
The problem is most easily explained by a C pseudocode, but I also have a Fortran version which is affected by the same kind of issue. Consider the following loop where lots_of_code_* are some complicated expressions which produces a fair number of machine instructions.
void process(const double *in_arr, double *out_arr, int len)
{
for (int i = 0; i < len; i++)
{
const double a = lots_of_code_a(i, in_arr);
const double b = lots_of_code_b(i, in_arr);
...
const double z = lots_of_code_z(i, in_arr);
out_arr[i] = final_expr(a, b, ..., z);
}
}
When compiled with an AVX target the Intel compiler generates code which goes like
process:
AVX_loop
AVX_code_a
AVX_code_b
...
AVX_code_z
AVX_final_expr
...
SSE_loop
SSE_instructions
...
scalar_loop
scalar_instructions
...
The resulting binary is already quite sizable. My actual calculation loop, though, looks more like the following:
void process(const double *in_arr1, ... , const double *in_arr30,
double *out_arr1, ... double *out_arr30,
int len)
{
for (int i = 0; i < len; i++)
{
const double a1 = lots_of_code_a(i, in_arr1);
...
const double a30 = lots_of_code_a(i, in_arr30);
const double b1 = lots_of_code_b(i, in_arr1);
...
const double b30 = lots_of_code_b(i, in_arr30);
...
...
const double z1 = lots_of_code_z(i, in_arr1);
...
const double z30 = lots_of_code_z(i, in_arr30);
out_arr1[i] = final_expr1(a1, ..., z1);
...
out_arr30[i] = final_expr30(a30, ..., z30);
}
}
This results in a very large binary indeed (400KB for the Fortran version, 800KB for C99). If I now define lots_of_code_* as functions, then each function gets turned into non-vectorized code. Whenever the compiler decides to inline a function it does vectorize it, but seems to also duplicate the code each time as well.
In my mind, the ideal code should look like:
AVX_lots_of_code_a:
AVX_code_a
AVX_lots_of_code_b:
AVX_code_b
...
AVX_lots_of_code_z:
AVX_code_z
SSE_lots_of_code_a:
SSE_code_a
...
scalar_lots_of_code_a:
scalar_code_a
...
...
process:
AVX_loop
call AVX_lots_of_code_a
call AVX_lots_of_code_a
...
SSE_loop
call SSE_lots_of_code_a
call SSE_lots_of_code_a
...
scalar_loop
call scalar_lots_of_code_a
call scalar_lots_of_code_a
...
This clearly results in a much smaller code which is still just as well optimized as the fully-inlined version. With luck it might even fit in L1.
Obviously I can write the this myself using intrinsics or whatever, but is it possible to get the compiler to automatically vectorize in the way described above through "normal" source code?
I understand that the compiler will probably never generate separate symbols for each vectorized version of the functions, but I thought it could still just inline each function once inside process and use internal jumps to repeat the same code block, rather than duplicating code for each input array.

Formal answer to questions like yours:
Consider using OpenMP4.0 SIMD-enabled (I didn't say inlined) functions or equivalent proprietary mechanisms. Available in Intel Compiler or fresh GCC4.9.
See more details here: https://software.intel.com/en-us/node/522650
Example:
//Invoke this function from vectorized loop
#pragma omp declare simd
int vfun(int x, int y)
{
return x*x+y*y;
}
It will give you capability to vectorize loop with function calls without inlining and as a result without huge code generation. (I didn't really explore your code snippet in details; instead I answered the question you asked in textual form)

The immediate problem that comes to mind is the lack of restrict on the input/output-pointers. The input is const though, so it's probably not too much of a problem, unless you have multiple output-pointers.
Other than that, I recommend -fassociative-math or whatever the ICC equivalent is. Structurally, you seem to iterate over the array, doing multiple independent operations on the array that are only munged together in the very end. Strict fp compliance might kill you on the array-operations.Finally, there's probably no way this will get vectorized if you need more intermediate results than vector_registers - input_arrays.Edit:
I think I see your problem now. You call the same function on different data, and want each result stored independently, right?The problem is that the same function always writes to the same output register, so subsequent, vectorized calls would clobber earlier results. The solution could be:A stack of results (either in memory or like the old x87 FPU-stack), that gets pushed every time. If in memory, it is slow, if x87, it's not vectorized. Bad idea.
Effectively multiple functions to write into different registers. Code duplication. Bad idea.Rotating registers, like on the Itanium. You don't have an Itanium? You're not alone.It's possible that this can't be easily vectorized on current architectures. Sorry.
Edit, you're apparently fine with going to memory:
void function1(double const *restrict inarr1, double const *restrict inarr2, \
double *restrict outarr, size_t n)
{
for (size_t i = 0; i<n; i++)
{
double intermediateres[NUMFUNCS];
double * rescursor = intermediateres;
*rescursor++ = mungefunc1(inarr1[i]);
*rescursor++ = mungefunc1(inarr2[i]);
*rescursor++ = mungefunc2(inarr1[i]);
*rescursor++ = mungefunc2(inarr2[i]);
...
outarr[i] = finalmunge(intermediateres[0],...,intermediateres[NUMFUNCS-1]);
}
}
This might be vectorizable. I don't think it'll be all that fast, going at memory speed, but you never know till you benchmark.

If you moved the lots_of_code blocks into separate compilation units without the for loop, they will probably not vecorize. Unless the compiler has a motive for vectorization, it will not vectorize the code because vectorization might lead for longer latencies in the pipelines. To get around that, split the loop into 30 loops, and put each one of them in a separate compilation unit like that:
for (int i = 0; i < len; i++)
{
lots_of_code_a(i, in_arr1);
}

Related

how to include several lines of C code in one line of source code (AVR-GCC)

Is there a way to add an identifier that the compiler would replace with multiple lines of code?
I read up on macros and inline functions but am getting no where.
I need to write an Interrupt Service Routine and not call any functions for speed.
Trouble is I have several cases where I need to use a function so currently I just repeat all several lines in many places.
for example:
void ISR()
{
int a = 1;
int b = 2;
int c = 3;
// do some stuff here ...
int a = 1;
int b = 2;
int c = 3;
// do more stuff here ...
int a = 1;
int b = 2;
int c = 3;
}
The function is many pages and I need the code to be more readable.
I basically agree with everyone else's reservations with regards to using macros for this. But, to answer your question, Multiline macros can be created with a backslash.
#define INIT_VARS \
int a = 1; \
int b = 2; \
int c = 3;
#define RESET_VARS \
a = 1; \
b = 2; \
c = 3;
void ISR()
{
INIT_VARS
// do some stuff here ...
RESET_VARS
// do more stuff here ...
RESET_VARS
}
You can use inline function that will be rather integrated into place where it is called in source instead of really being called (note that behavior of this depends on several things like compiler support and optimizations setup or using -fno-inline flag feature). GCC documentation on inline functions.
For completeness - other way would be defining // do some stuff here... as pre-processor macro which again gets inserted in place where called; this time by preprocessor - so no type safety, harder to debug and also to read. Usual good rule of thumb is to not write a macro for something that can be done with function.
You are correct - it is recommended that you not place function calls in an ISR. It's not that you cannot do it, but it can be a memory burden depending on the type of call. The primary reason is for timing. ISRs should be quick in and out. You shouldn't be doing a lot of extended work inside them.
That said, here's how you can actually use inline functions.
// In main.c
#include static_defs.h
//...
void ISR() {
inline_func();
// ...
inline_func();
}
// In static_defs.h
static inline void inline_func(void) __attribute__((always_inline));
// ... Further down in file
static inline void inline_func(void) {
// do stuff
}
The compiler will basically just paste the "do stuff" code into the ISR multiple times, but as I said before, if it's a complex function, it's probably not a good idea to do it multiple times in a single ISR, inlined or not. It might be better to set a flag of some sort and do it in your main loop so that other interrupts can do their job, too. Then, you can use a normal function to save program memory space. That depends on what you are really doing and when/why it needs done.
If you are actually setting variables and returning values, that's fine too, although, setting multiple variables would be done by passing/returning a structure or using a pointer to a structure that describes all of the relevant variables.
If you'd prefer to use macros (I wouldn't, because function-like macros should be avoided), here's an example of that:
#define RESET_VARS() do { \
a = 1; \
b = 2; \
c = 3; \
while (0)
//...
void ISR() {
uint8_t a=1, b=2, c=3;
RESET_VARS();
// ...
RESET_VARS();
}
Also, you said it was a hypothetical, but it's recommended to use the bit-width typedefs found in <stdint.h> (automatically included when you include <io.h> such as uint8_t rather than int. On an 8-bit MCU with AVR-GCC, an int is a 16-bit signed variable, which will require (at least) 2 clock cycles for every operation that would have taken one with an 8-bit variable.

Why this OpenMP parallel for loop doesn't work properly?

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

SSE intrinsics without compiler optimization

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.
I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...
This code is for an assignment where we're not allowed to enable compiler optimization options.
No SSE version:
int get_freq(const float* matrix, float value) {
int freq = 0;
for (ssize_t i = start; i < end; i++) {
if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
SSE version:
#include <immintrin.h>
#include <math.h>
#include <float.h>
#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)
int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {
int freq = 0;
int i;
__m128 value = _mm_set1_ps(givenValue);
__m128 count = _mm_setzero_ps();
__m128 and_value = _mm_set1_ps(0x00000001);
for (i = 0; i + 15 < g_elements; i += 16) {
GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
GETEQU(0); GETEQU(1); GETEQU(2); GETEQU(3);
GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
}
__m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
count = _mm_add_ps(count, shuffle_a);
__m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
count = _mm_add_ps(count, shuffle_b);
freq = _mm_cvtss_si32(count);
for (; i < g_elements; i++) {
if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.
-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.
I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.
Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.
In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)
Re: your update:
Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.
The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)
BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).
You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.
Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

C How extract predefined huge switch from huge loop without loss performance?

I have a bottleneck, which looks like this:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
switch (type) {
case 0:
// do some stuff 0
break;
[...]
case n:
// do some stuff n
break;
}
// do some stuff B
}
}
n and m are large enough.
m millions, sometimes hundreds of millions.
n is the 2 ^ 7 - 2 ^ 10 (128 - 1024)
Chunks of code A and B are sufficiently large.
I rewrote the code (via macros) as follows:
void function(int type) {
switch (type) {
case 0:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff 0
// do some stuff B
}
break;
[...]
case n:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
break;
}
}
As a result, it looks like this in IDA for this function:
Is there a way to remove the switch from the loop:
without creating a bunch of copies of the loop
not create huge function with macros
without losing performance?
A possible solution seems to me the presence of goto variable. Something like this:
void function(int type) {
label* typeLabel;
switch (type) {
case 0:
typeLabel = &label_1;
break;
[...]
case n:
typeLabel = &label_n;
break;
}
for (int i = 0; i < m; i++) {
// do some stuff A
goto *typeLabel;
back:
// do some stuff B
}
goto end;
label_1:
// do some stuff 0
goto back;
[...]
label_n:
// do some stuff n
goto back;
end:
}
The matter is also complicated by the fact that all of this will be carried out on different Android devices with different speeds.
Architecture as ARM, and x86.
Perhaps this can be done assembler inserts rather than pure C?
EDIT:
I run some tests. n = 45,734,912
loop-within-switch: 891,713 μs
switch-within-loop: 976,085 μs
loop-within-switch 9.5% faster from switch-within-loop
For example: simple realisation without switch takes 1,746,947 μs
At the moment, the best solution I can see is:
Generate with macros n functions, which will look like this:
void func_n() {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
}
Then make an array of pointers to them, and called from the main function:
void main(int type) {
func* table[n];
// fill table array with pointers to func_0 .. func_n
table[type](); // call appropriate func
}
This allows the optimizer to optimize the compiler function func_0 .. func_n. Moreover, they will not be so big.
Realistically, a static array of labels is likely the fastest sane option (array of pointers being the sanest fast option). But, let's get creative.
(Note that this should have been a comment, but I need the space).
Option 1: Exploit the branch predictor
Let's build on the fact that if a certain outcome of a branch happens, the predictor will likely predict the same outcome in the future. Especially if it happens more than once. The code would look something like:
for (int i = 0; i < m; i++)
{
// do some stuff A
if (type < n/2)
{
if (type < n/4)
{
if (type < n/8)
{
if (type == 0) // do some stuff 0
else // do some stuff 1
}
else
{
...
}
}
else
{
...
}
}
else
{
...
// do some stuff n
}
// do some stuff B
}
Basically, you binary search what to do, in log(n) steps. That is a log(n) possible jumps, but after only one or two iterations, the branch predictor will predict them all correctly, and will speculatively execute the proper instructions without problem. Depending on the CPU, this could be faster than a goto *labelType; back: as some are unable to prefetch instructions when the jump address is calculated dynamically.
Option 2: JIT load the proper 'stuff'
So, ideally, your code would look like:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff [type]
// do some stuff B
}
}
With all the other 0..n "stuffs" being junk in the current function invocation. Well, let's make it like that:
void function(int type) {
prepare(type);
for (int i = 0; i < m; i++) {
// do some stuff A
reserved:
doNothing(); doNothing(); doNothing(); doNothing(); doNothing();
// do some stuff B
}
}
The doNothing() calls are there just to reserve the space in the function. Best implementation would be goto B. The prepare(type) function will look in the lookup table for all the 0..n implementations, take the type one, and copy it over all those goto Bs. Then, when you are actually executing your loop, you have the optimal code where there are no needless jumps.
Just be sure to have some final goto B instruction in the stuff implementation - copying a smaller one over a larger one could cause problems otherwise. Alternatively, before exiting function you can restore all the placeholder goto B; instructions. It's a small cost, since you're only doing it once per invocation, not per iteration.
prepare() would be much easier to implement in assembly than in C, but it is doable. You just need the start/end addresses of all stuff_i implementations (in your post, these are label_[i] and label_[i+1]), and memcpy that into reserved.
Maybe the compiler will even let you do:
memcpy((uint8_t*)reserved, (uint8_t*)label_1, (uint8_t*)label_2 - (uint8_t*)label_1);
Likely not, though. You can, however, get the proper locations using setjmp or something like __builtin_return_address / _ReturnAddress within a function call.
Note that this will require write access to the instruction memory. Getting that is OS specific, and likely requires su/admin privileges.
The compiler is generally good at choosing an optimal form of the switch. For an ARM device you can have a few forms for a dense code snippets. Either a branch table (like a bunch of function pointers) or if the code in the switch is near identical you may do an array index. Semantically something like this,
dest = &first_switch_pc;
dest += n*switch_code_size;
current_pc = dest;
An ARM CPU may do this in a single instruction. This is probably not profitable in your case as the type seems to be constant per loop iteration.
However, I would definitely explore restructuring your code like this,
void function(int type) {
i = 0;
if (m==0) return;
// initialize type_label;
goto entry;
while(1) {
// do some stuff B
i++;
if(i < m) break;
entry:
// do some stuff A
goto *type_label;
label_1:
// do some stuff 0
continue;
[...]
label_n:
// do some stuff n
continue;
}
}
This will merge the 'A' and 'B' so that it will fit well in the code cache. The 'control flow' from the 'goto label' will then be to the top of the loop. You maybe able to simplify the control flow logic depending on how i is used in the unknown snippets. A compiler may do this for you automatically depending on optimization levels, etc. No one can really give an answer without more information and profiling. The cost of 'stuff A', 'stuff B' and the size of the switch snippets are all important. Examining the assembler output is always helpful.
This pdf of slides from a presentation about getting gcc to thread jumps is interesting. This is the exact optimization gcc needs to do to compile the switch-inside-loop version similarly to the loop-inside-switch version.
BTW, the loop-inside-switch version should be equivalent in performance to the loop-inside-separate-functions version. Cache operates in terms of cache lines, not whole functions. If most of the code in a function never runs, it doesn't matter that it's there. Only the code that does run takes space in the cache.
If all ARM cores in Android devices have branch-target prediction for indirect jumps, your second implementation of doing the compiler's job for it, and doing an indirect goto inside the loop, is probably the best tradeoff between code size and performance. A correctly-predicted unconditional indirect branch costs about the same as a couple add instructions on x86. If ARM is similar, the savings in code size should pay for it. Those slides talk about some ARM cores having indirect-branch prediction, but doesn't say that all do.
This Anandtech article about A53 cores (the little cores in big.LITTLE) says that A53 vastly increased the indirect-branch prediction resources compared to A7. A7 cores have an 8-entry indirect branch target buffer. That should be enough to make the goto *label in your loop efficient, even on very weak LITTLE cores, unless the rest of your loop has some indirect branches inside the loop. One mispredict on the occasional iteration should only cost maybe 8 cycles. (A7 has a short 8-stage pipeline, and is "partial dual issue, in-order", so branch mispredicts are cheaper than on more powerful CPUs.
Smaller code size means less code to be loaded from flash, and also less I-cache pressure if the function is called with different arguments for type while the do stuff for A and do stuff for B code is still present in I$, and has its branch-prediction history still available.
If the do stuff for [type] code changes how branches in the stuff for A and B code behaves, it may be best to have the entire loop body duplicated, so different copies of the branch have their own prediction entries.
If you want to sort out what's actually slow, you're going to have to profile your code. If ARM is like x86 in having hardware performance counters, you should be able to see which instructions are taking a lot of cycles. Also actually count branch mispredicts, I$ misses, and lots of other stuff.
To make any further suggestions, we'd need to see how big your pieces of code are, and what sort of thing they're doing. Clearly you think loop and switch overhead are making this hot function more of a bottleneck than it needs to be, but you haven't actually said that loop-inside-switch gave better performance.
Unless all the do stuff A, do stuff B, and many of the do stuff [type] blocks are very small, the switch is probably not the problem. If they are small, then yes, it is probably worth duplicating the loop N times.
Another solution is use labels as values:
void function(int type) {
void *type_switch = &&type_break;
switch (type) {
case 0:
type_switch = &&type_0;
break;
[...]
case n:
type_switch = &&type_n;
break;
}
for (int i = 0; i < m; i++) {
// do some stuff A
goto *type_switch;
type_0: {
// do some stuff 0
goto type_break;
}
[...]
type_n: {
// do some stuff n
goto type_break;
}
type_break: ;
// do some stuff B
}
}
This solution is worse than the version with lots of functions.
If not enabled the optimization of the code, the variables will be loaded each time from the stack in the parts of code 0 .. n.
Address goto can also be loaded each time from the stack.
Two extra goto.

How to give hint to gcc about loop count

Knowing the number of iteration a loop will go through allows the compiler to do some optimization. Consider for instance the two loops below :
Unknown iteration count :
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < vbuf->bytesused) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
Known iteration count
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < 1280*400) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
The second version will compile to faster code, because it will be unrolled twice (on ARM with gcc 4.6.3 and -O2 at least). Is there a way to make assertion on the loop count that gcc will take into account when optimizing ?
There is hot attribute on functions to give a hint to compiler about hot-spot: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html. Just abb before your function:
static void bitreverse(vbuf_desc * vbuf) __attribute__ ((pure));
Here the docs about 'hot' from gcc:
hot The hot attribute on a function is used to inform the compiler
that the function is a hot spot of the compiled program. The function
is optimized more aggressively and on many target it is placed into
special subsection of the text section so all hot functions appears
close together improving locality. When profile feedback is available,
via -fprofile-use, hot functions are automatically detected and this
attribute is ignored.
The hot attribute on functions is not implemented in GCC versions
earlier than 4.3.
The hot attribute on a label is used to inform the compiler that path
following the label are more likely than paths that are not so
annotated. This attribute is used in cases where __builtin_expect
cannot be used, for instance with computed goto or asm goto.
The hot attribute on labels is not implemented in GCC versions earlier
than 4.8.
Also you can try to add __builtin_expect around your idx < vbuf->bytesused - it will be hint that in most cases the expression is true.
In both cases I'm not sure that your loop will be optimized.
Alternatively you can try profile-guided optimization. Build profile-generating version of program with -fprofile-generate; run it on target, copy profile data to build-host and rebuild with -fprofile-use. This will give a lot of information to compiler.
In some compilers (not in GCC) there are loop pragmas, including "#pragma loop count (N)" and "#pragma unroll (M)", e.g. in Intel; unroll in IBM; vectorizing pragmas in MSVC
ARM compiler (armcc) also has some loop pragmas: unroll(n) (via 1):
Loop Unrolling: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJACACFE.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJAHJDAB.html
and __promise intrinsic:
Using __promise to improve vectorization
The __promise(expr) intrinsic is a promise to the compiler that a given expression is nonzero. This enables the compiler to improve vectorization by optimizing away code that, based on the promise you have made, is redundant.
The disassembled output of Example 3.21 shows the difference that __promise makes, reducing the disassembly to a simple vectorized loop by the removal of a scalar fix-up loop.
Example 3.21. Using __promise(expr) to improve vectorization code
void f(int *x, int n)
{
int i;
__promise((n > 0) && ((n&7)==0));
for (i=0; i<n;i++) x[i]++;
}
You can actually specify the exact count with __builtin_expect, like this:
while (idx < __builtin_expect(vbuf->bytesused, 1280*400)) {
This tells gcc that vbuf->bytesused is expected to be 1280*400 at runtime.
Alas, this does nothing for optimization with current gcc version. Haven't tried with 4.8, though.
Edit: Just realized that every standard C compiler has a way to exactly specify the loop count, via assert. Since the assert
#include <assert.h>
...
assert(loop_count == 4096);
for (i = 0; i < loop_count; i++) ...
will call exit() or abort() if the condition is not true, any compiler with value propagation will know the exact value of loop_count. I always thought that this would be the most elegant and standard-conforming way to give such optimization hints. Now, I want a C compiler that actually uses this information.
Note that if you want to make this faster, bytewise unrolling might be less effective than using a wider lookup table. A 16-bit table would occupy 128K, and thus often fit into the CPU cache. If the data is not completely random, an even wider table (3 bytes) might be effective.
2-byte example:
unsigned short *bitrev2;
...
for (idx = 0; idx < vbuf->bytesused; idx += 2) {
*(unsigned short *)(&img[idx]) = bitrev2[*(unsigned short *)(&img[idx]);
}
This is an optimization the compiler can't perform, regardless of the information you give it.

Resources