Incrementing: x++ vs x += 1 - arrays

I've read that many developers use x += 1 instead of x++ for clarity. I understand that x++ can be ambiguous for new developers and that x += 1 is always more clear, but is there any difference in efficiency between the two?
Example using for loop:
for(x = 0; x < 1000; x += 1) vs for(x = 0; x < 1000; x++)
I understand that it's usually not that big of a deal, but if I'm repeatedly calling a function that does this sort of loop, it could add up in the long run.
Another example:
while(x < 1000) {
someArray[x];
x += 1;
}
vs
while(x < 1000) {
someArray[x++];
}
Can x++ be replaced with x += 1 without any performance loss? I'm especially concerned about the second example, because I'm using two lines instead of one.
What about incrementing an item in an array? Will someArray[i]++ be faster than doing someArray[i] += 1 when done in a large loop?

Any sane or insane compiler will produce identical machine code for both.

Assuming you talk about applying these to base types and no own classes where they could make a huge difference they can produce the same output especially when optimization is turned on. To my surprise I often found in decompiled applications that x += 1 is used over x++ on assembler level(add vs inc).

Any decent compiler should be able to recognize that the two are the same so in the end there should be no performance difference between them.
If you want to convince yourself just do a benchmark..

When you say "it could add up in the long run" - don't think about it that way.
Rather, think in terms of percentages. When you find the program counter is in that exact code 10% or more of the time, then worry about it.
The reason is, if the percent is small, then the most you could conceivably save by improving it is also small.
If the percent of time is less than 10%, you almost certainly have much bigger opportunities for speedup in other parts of the code, almost always in the form of function calls you could avoid.
Here's an example.

Consider you're a lazy compiler implementer and wouldn't bother writing OPTIMIZATION routines in the machine-code-gen module.
x = x + 1;
would get translated to THIS code:
mov $[x],$ACC
iadd $1,$ACC
mov $ACC,$[x]
And x++ would get translated to:
incr $[x] ;increment by 1
if ONE instruction is executed in 1 machine cycle, then x = x + 1 would take 3 machine cycles where as x++ would take 1 machine cycle. (Hypothetical machine used here).
BUT luckily, most compiler implementers are NOT lazy and will write optimizations in the machine-code-gen module. So x = x+1 and x++ SHOULD take equal time to execute. :-P

Related

Why can GCC only do loop interchange optimization when the int size is a compile-time constant?

When I compile this snippet (with -Ofast -floop-nest-optimize) gcc generates assembly which traverses the array in source order.
However, if I uncomment the line // n = 32767 and assign any number to n, it interchanges the index order to x[i * n + j]. Traversing memory in contiguous row-major order is much more cache-friendly than striding down columns.
float matrix_sum_column_major(float* x, int n) {
// n = 32767;
float sum = 0;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
sum += x[j * n + i];
return sum;
}
On godbolt
Why can't GCC or clang do loop interchange with a runtime-variable int size? Real-world code won't usually have the size declared explicitly.
PD: I've tried this with different versions of gcc and clang-9 and it seems to happen in both.
PD2: Even if I make x be a local variable malloced inside the function it still happens.
Compilers generally focus their efforts (and should focus their efforts) on places where constructs which will likely be used by programmers interested in efficiency can be replaced with other constructs that are easily proven to be equivalent in all cases that should be expected to matter. If n is a constant, a compiler can determine the exact set of array indices that will be used in the loop and then figure out how to process all those indices. If n isn't constant, a compiler might be able to determine that when n is positive, code will use all indices from 0 to n*n-1, but that would likely require a lot more effort. The authors of clang and might have been able to make such a determination in this case if they tried hard enough, but they likely thought the effort wasn't worthwhile.
Note that if code will use a few particular values of n far more than any others, having code explicitly check for those values and use loops tailored for them, a compiler might be able to generate far more efficient code for those loops than would be possible for loops that can use an arbitrary n. Because many real-world problems would likely have some values of n that get used much more than others, it would not be unreasonable for a compiler writer to assume that programmers interested in performance would be likely to use such special-purpose loops, and spending a certain amount of effort improving the arbitrary-n loop may offer less benefit than spending the same amount of effort elsewhere.

optimization of a code in C

I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.
Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.
I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.
I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.
Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;

Performance of modulus operator in C [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
From the execution time perspective, is using modulus operator more beneficial or the manual way of doing it if i am supposed to do the modulus thing a large number of times, about 10^6 times ?
Manually doing (number % mod_number) :
while(number >= mod_number) {
number = number - mod_number;
}
Doing the same thing using % operator :
number = number % mod_number;
From what i have tested, manually doing it gives better time performance.
How is the modulus operator defined? I know the outputs for negative numbers are implementation defined, i am asking about the working of the operator, i.e., its complexity so that i can justify the better manual performance.
Note : The question is specifically for implementation in C.
The code snippet:
for (j = 0; j < idx; j++) {
num = mark[j];
dif = k - num;
if (dif < 0) dif = (-1 * dif) + 100;
many = count[num];
prev = ap[dif][k];
ap[dif][k] = ap[dif][k] + ap[dif][num];
//the manual way here works faster than %
if (ap[dif][k] >= mod) ap[dif][k] -= mod;
ap[dif][k] += many;
if (ap[dif][k] >= mod) ap[dif][k] -= mod;
sum = (sum + ap[dif][k]);
if (sum >= mod) sum -= mod;
sum = sum - prev;
}
The above loop is executed 2*(10^5)*t times with 'idx' gradually increasing till 100 for each 't'. Used t = 10.
I would be very surprised if the loop were more efficient when number is many times larger than mod_number. Any CPU you're likely to use has a built-in division operation that returns both the quotient and the remainder in constant time, and this will be used to implement the % operator. Your loop takes O(number/mod_number) time.
I suggest you take a look at the generated assembly code for the two versions and you'll see this.
It depends on the implementation. It is pointless to discuss performance without a given system in mind.
The modulus operator will likely be implemented through the CPU's division instruction, which on most CPUs is relatively slow in comparison to other CPU instructions. However, it seems highly unlikely that a loop like the one in your example will be more efficient.
More likely, the performance difference you are experiencing is either related to wrong optimization settings or incorrect benchmarking.
According to my experience, using the modulus operator should give you better performance. The people who have written C compilers should have considered the optimization of the operation they are performing.
But your test results shows the other away, it may depend on the code you have written. It would be easier to find why? if you show your code...
The example you have shown (not that while loop at the top, the snippet at the bottom) is a case where the "divisor" is only subtracted at most once. That is essentially the one case in which "repeated" subtraction (0 or 1 times, a special case of repeated subtraction) can be (and commonly is, but not necessarily) faster than division-based modulo. Obviously it depends on how fast division is on the target, how fast a test/branch (or test/predicated instruction) is on the target, and in the case of branches it even depends on how predictable the branch will be.
A compiler is unlikely to make that optimization (but it's not impossible), because it only makes sense if it is known that the subtraction will only happen at most once (or perhaps more than one, if division is especially slow on the target, but some lowish bound is still needed), which is in general a hard thing to find out for a compiler.
To give some real life numbers, on a Haswell 32bit signed division (and therefore also modulo) would take 22 to 29 cycles, and a branch misprediction might take up to 20 cycles, but that's a worst case and the branch should not be mispredicted all the time. Also, you could avoid the branch (if it's badly predicted) and do something like this (not tested, just to give you some idea)
sub eax, edx
lea edx, [eax + edx]
cmovl eax, edx
Which should only take about 4 cycles, independent of any predictability. Using a branch may be faster if it can be predicted well.

Find out max & min of two number without using If else?

I am able to find out the logic from: Here
r = y ^ ((x ^ y) & -(x < y)); // min(x, y)
r = x ^ ((x ^ y) & -(x < y)); // max(x, y)
It says it is faster then doing
r = (x < y) ? x : y
Can someone explain a bit more about it to understand it with example.
How it could be faster?
Discussing optimization without a specific hardware in mind doesn't make any sense. You really can't tell which alternative that is fastest without going into details of a specific system. Boldly making a statement about the first alternative being fastest without any specific hardware in mind, is just pre-mature optimization.
The obscure xor solution might be faster than the comparison alternative if the given CPU's performance relies heavily on branch prediction. In other words, if it executes regular instructions such as arithmetic ones very fast, but gets a performance bottleneck at any conditional statement (such as an if), where the code might branch on several ways. Other factors such as the amount instruction cache memory etc. also matter.
Many CPUs will however execute the second alternative much faster, because it involves fewer operations.
So to sum it up, you'll have to be an expert of the given CPU to actually tell in theory which code that will be the fastest. If you aren't such an expert, simply benchmark it and see. Or look at the disassembly for notable differences.
In the link that you provided, it is explicitly stated:
On some rare machines where branching is very expensive and no condition move instructions exist, the [code] might be faster than the obvious approach, r = (x < y) ? x : y
Later on, it says:
On some machines, evaluating (x < y) as 0 or 1 requires a branch instruction, so there may be no advantage.
In short, the bit manipulation solution is only faster on machines that have poor branch execution, as it operates solely on the numerical values of the operands. On most machines the branching approach is just as fast (and sometimes even faster) and should be preferred for its readability.
Using bit manipulation :
void func(int a,int b){
int c = a - b;
int k = (c >> 31) & 0x1;
int max = a - k * c;
int min = b + k * c;
printf("max = %d\nmin = %d",max,min);
}
The question does not specify the hardware this will run on. My answer will address the case where this is running on x86 (for instance any PC). Lets look at the assembly generated by each.
; r = y ^ ((x ^ y) & -(x < y))
xor edx,edx
cmp ebx,eax
mov ecx,eax
setl dl
xor ecx,ebx
neg edx
and edx,ecx
xor eax,edx
; r = (x < y) ? x : y
cmp ebx,eax
cmovl eax,ebx
The XOR version has to zero registers and move values around on top of the operations it inherently needs to perform, adding up to 8 instructions. However x86 has a cmov or conditional move instruction. So the ?: version compiles to a comparison and a cmovl, just 2 instructions. However this doesn't necessary make the ?: version 4 times faster since different instructions may have different latencies, and different dependency chains. But you can certainly see how ?: will very likely be faster than the XOR version.
It's also worth noting that neither version requires a branch, and so there is no branch misprediction penalty.
The ? risks to be implemented with a conditional branch (instead of a conditional assignment).
Conditional branching is a small "catastrophy" for a processor, as it cannot guess what instruction will be fetched later. This breaks the pipeline organization of the ALU (several instructions being in progress concurrently to increase throughput), and causes pipeline re-initialization delays. To alleviate this, processors resort to branch prediction, i.e. they bet on the branch that will be taken, but they can't be successful all the time.
In conclusion: conditional branches can be slloooowwwwwwww...

32x32 Multiply and add optimization

I'm working on optimizing an application . I found that i need to optimize an inner loop for improved performance.
rgiFilter is a 16 bit arrary.
for (i = 0; i < iLen; i++) {
iPredErr = (I32)*rgiResidue;
rgiFilter = rgiFilterBuf;
rgiPrevVal = rgiPrevValRdBuf + iRecent;
rgiUpdate = rgiUpdateRdBuf + iRecent;
iPred = iScalingOffset;
for (j = 0; j < iOrder_Div_8; j++) {
iPred += (I32) rgiFilter[0] * rgiPrevVal[0];
rgiFilter[0] += rgiUpdate[0];
iPred += (I32) rgiFilter[1] * rgiPrevVal[1];
rgiFilter[1] += rgiUpdate[1];
iPred += (I32) rgiFilter[2] * rgiPrevVal[2];
rgiFilter[2] += rgiUpdate[2];
iPred += (I32) rgiFilter[3] * rgiPrevVal[3];
rgiFilter[3] += rgiUpdate[3];
iPred += (I32) rgiFilter[4] * rgiPrevVal[4];
rgiFilter[4] += rgiUpdate[4];
iPred += (I32) rgiFilter[5] * rgiPrevVal[5];
rgiFilter[5] += rgiUpdate[5];
iPred += (I32) rgiFilter[6] * rgiPrevVal[6];
rgiFilter[6] += rgiUpdate[6];
iPred += (I32) rgiFilter[7] * rgiPrevVal[7];
rgiFilter[7] += rgiUpdate[7];
rgiFilter += 8;
rgiPrevVal += 8;
rgiUpdate += 8;
}
ode here
Your only bet is to do more than one operation at a time, and that means one of these 3 options:
SSE instructions (SIMD). You process multiple memory locations with a single instructions
Multi-threading (MIMD). This works best if you have more than 1 cpu core. Split your array into multiple, similarly sized strips that are independant of eachother (dependency will increase this option's complexity a lot, to the point of being slower than sequentially calculating everything if you need a lot of locks). Note that the array has to be big enough to offset the extra context switching and synchronization overhead (it's pretty small, but not negligeable). Best for 4 cores or more.
Both at once. If your array is really big, you could gain a lot by combining both.
If rgiFilterBuf, rgiPrevValRdBuf and rgiUpdateRdBuf are function parameters that don't alias, declare them with the restrict qualifier. This will allow the compiler to optimise more aggresively.
As some others have commented, your inner loop looks like it may be a good fit for vector processing instructions (like SSE, if you're on x86). Check your compiler's intrinsics.
I don't think you can do much to optimize it in C. Your compiler might have options to generate SIMD code, but you probably need to just go and write your own SIMD assembly code if performance is critical...
You can replace the inner loop with very few SSE2 intrinsics
see [_mm_madd_epi16][1] to replace the eight
iPred += (I32) rgiFilter[] * rgiPrevVal[];
and [_mm_add_epi16][2] or _[mm_add_epi32][3] to replace the eight
rgiFilter[] += rgiUpdate[];
You should see a nice acceleration with that alone.
These intrinsics are specific to Microsoft and Intel Compilers.
I am sure equivalents exist for GCC I just havent used them.
EDIT: based on the comments below I would change the following...
If you have mixed types the compiler is not always smart enough to figure it out.
I would suggest the following to make it more obvious and give it a better chance
at autovectorizing.
declare rgiFilter[] as I32 bit for
the purposes of this function. You
will pay one copy.
change iPred to iPred[] as I32 also
do the iPred[] summming outside the inner (or even outer) loop
Pack similar instructions in groups of four
iPred[0] += rgiFilter[0] * rgiPrevVal[0];
iPred[1] += rgiFilter[1] * rgiPrevVal[1];
iPred[2] += rgiFilter[2] * rgiPrevVal[2];
iPred[3] += rgiFilter[3] * rgiPrevVal[3];
rgiFilter[0] += rgiUpdate[0];
rgiFilter[1] += rgiUpdate[1];
rgiFilter[2] += rgiUpdate[2];
rgiFilter[3] += rgiUpdate[3];
This should be enough for the Intel compiler to figure it out
Ensure that iPred is hold in a register (not read from memory before and not written back into memory after each += operation).
Optimize the memory layout for 1st level cache. Ensure that the 3 arrays to not fight for same cache entries. This depends on CPU architecture and isn't simple at all.
Loop unrolling and vectorizing should left to the compiler.
See Gcc Auto-vectorization
Start out by making sure that the data is layed out linearly in memory so that you get no cache misses. This doesn't seem to be an issue though.
If you can't SSE the operations (and if the compiler fails with it - look at the assembly), try to separate into several different for-loops that are smaller (one for each 0 .. 8). Compilers tend to be able to do better optimizations on loops that perform less amount of operations (except in cases like this where it might be able to do vectorization/SSE).
16 bit integers are more expensive for 32/64 bit architecture to use (unless they have specific 16-bit registers). Try converting it to 32 bits before doing the loop (most 64-bit architectures have 32bit registers as well afaik).
Pretty good code.
At each step, you're basically doing three things, a multiplication and two additions.
The other suggestions are good. Also, I've sometimes found that I get faster code if I separate those activities into different loops, like
one loop to do the multiplication and save to a temporary array.
one loop to sum that array in iPred.
one loop to add rgiUpdate to rgiFilter.
With the unrolling, your loop overhead is negligible, but if the number of different things done inside each loop is minimized, the compiler can sometimes make better use of its registers.
There's lots of optimizations that you can do that involve introducing target specific code. I'll stick mostly with generic stuff, though.
First, if you are going to loop with index limits then you should usually try to loop downward.
Change:
for (i = 0; i < iLen; i++) {
to
for (i = iLen-1; i <= 0; i--) {
This can take advantage of the fact that many common processors essentially do a comparison with 0 for the results of any math operation, so you don't have to do an explicit comparison.
This only works, though, if going backwards through the loop has the same results and if the index is signed (though you can sneak around that).
Alternately you could try limiting by pointer math. This might eliminate the need for an explicit index (counter) variable, which could speed things up, especially if registers are in short supply.
for (p = rgiFilter; p <= rgiFilter+8; ) {
iPred += (I32) (*p) + *rgiPreval++;
*p++ += *rgiUpdate++;
....
}
This also gets rid of the odd updating at the end of your inner loop. The updating at the end of the loop could confuse the compiler and make it produce worse code. You may also find that the loop unrolling that you did do may produce worse or equally as good results as if you had only two statements in the body of the inner loop. The compiler is likely able to make good decisions about how this loop should be rolled/unrolled. Or you might just want to make sure that the loop is unrolled twice since rgiFilter is an array of 16 bit values and see if the compiler can take advantage of accessing it just twice to accomplish two reads and two writes -- doing one 32 bit load and one 32 bit store.
for (p = rgiFilter; p <= rgiFilter+8; ) {
I16 x = *p;
I16 y = *(p+1); // Hope that the compiler can combine these loads
iPred += (I32) x + *rgiPreval++;
iPred += (I32) y + *rgiPreval++;
*p++ += *rgiUpdate++;
*p++ += *rgiUpdate++; // Hope that the complier can combine these stores
....
}
If your compiler and/or target processor supports it you can also try issuing prefetch instructions. For instance gcc has:
__builtin_prefetch (const void * addr)
__builtin_prefetch (const void * addr, int rw)
__builtin_prefetch (const void * addr, int rw, int locality)
These can be used to tell the compiler that if the target has prefetch instructions it should use them to try to go ahead and get addr into the cache. Optimally these should be issued once per cache line step per array you're working on. The rw argument is to tell the compiler if you want to read or write to address. Locality has to do with if the data needs to stay in cache after you access it. The compiler just tries to do the best it can figure out how to to generate the right instructions for this, but if it can't do what you ask for on a certain target it just does nothing and it doesn't hurt anything.
Also, since the __builtin_ functions are special the normal rules about variable number of arguments don't really apply -- this is a hint to the compiler, not a call to a function.
You should also look into any vector operations your target supports as well as any generic or platform specific functions, builtins, or pragmas that your compiler supports for doing vector operations.

Resources