How to use the NEON comparison instructions in general?
Here is a case, I want to use, Greater-than-or-equal-to instruction?
Currently I have a,
int x;
...
...
...
if(x >= 0)
{
....
}
In NEON, I would like to use x in the same way, just that x this time is a vector.
int32x4_t x;
...
...
...
if(vcgeq_s32(x, vdupq_n_s32(0))) // Whats the best way to achieve this effect?
{
....
}
With SIMD it's not straightforward to go from a single scalar if/then to a test on multiple elements. Usually you want to test if any element is greater than or if all elements are greater than, and there will usually be different SIMD predicates for each case which you can put inside an if (...). I don't see anything like this in NEON though, so you may be out of luck.
Often though you want to take a different approach, since branches are usually not desirable in optimised code. Ideally you will want to use the result of a SIMD comparison as a mask for subsequent operations (e.g. select different values based on mask using bitwise operations).
Related
Suppose I have an integer that is a power of 2, eg. 1024:
int a = 1 << 10; //works with any power of 2 no.
Now I want to check whether another integer b is the same as a. Which is faster/better (especially on weak embedded systems):
if (b == a) {}
or
if (b & a) {}
?
Sorry if this is a noob question, but couldn't find an answer using the search.
edit: thanks for many insightful answers. I could select only one of them, but all of them are welcome.
These operations are not even equivalent, because a & b will be false when both a and b are 0.
So I'd suggest to express the semantics that you want (i.e. a == b) and let the compiler to the optimization.
If you then measuer that you have performance issues at that point, then you can start analyzing/optimizing...
The short answer is this - it depends on what sort of things you're comparing. However, in this case, I'll assume that you're comparing two variables to each other (as opposed to a variable and an immediate, etc.)
This website, although rather old, studied how many clock cycles different instructions took on the x86 platform. The two instructions we're interested in here are the "AND" instruction and the "CMP" instruction (which the compiler uses for & and == respectively). What we can see here is that both of these instructions take about 1/3 of a cycle - that is to say, you can execute 3 of them in 1 cycle on average. Compare this to the "DIV" instruction which (in 1996) took 23 cycles to execute.
However, this omits one important detail. An "AND" instruction is not sufficient to complete the behavior you're looking for. In fact, a brief compilation on x86_64 suggests that you need both an "AND" and a "TEST" instruction for the "&" version, while "==" simply uses the "CMP" instruction. Because all these instructions are otherwise equivalent in IPC, the "==" will in fact be slightly faster...as of 1996.
Nowadays, processors optimize so well at the bare metal layer that you're unlikely to notice a difference. That said, if you wanted to see for sure...simply write a test program and find out for yourself.
As noted above though, even in the case that you have a power of 2, these instructions are still not equivalent, since it doesn't work for 0. Well...I guess technically zero ISN'T a power of 2. :) However you want to spin it though, use "==".
An X86 CPU sets a flag according to how the result of any operation compares to zero.
For the ==, your compiler will either use a dedicated compare instruction or a subtraction, setting this flag in both cases. The if() is then implemented by a jump that is conditional on this bit.
For the &, another instructions is used, the logical bitwise and instruction. That too sets the flag appropriately. So, again, the next instruction will be the conditional branch.
So, the question boils down to: Is there a performance difference between a subtraction and a bitwise and instruction? And the answer is "no" on any sane architecture. Both instructions use the same ALU, both set the same flags, and this ALU is typically designed to perform a subtraction in a single clock cycle.
Bottom line: Write readable code, and don't try to microoptimize what cannot be optimized.
Suppose I have an array:
uint8_t arr[256];
and an element
__m128i x
containing 16 bytes,
x_1, x_2, ... x_16
I would like to efficiently fill a new __m128i element
__m128i y
with values from arr depending on the values in x, such that:
y_1 = arr[x_1]
y_2 = arr[x_2]
.
.
.
y_16 = arr[x_16]
A command to achieve this would essentially be loading a register from a non-contiguous set of memory locations. I have a painfully vague memory of having seen documentation of such a command, but can't find it now. Does it exist? Thanks in advance for your help.
This kind of capability in SIMD architectures is known as load/store scatter/gather. Unfortunately SSE does not have it. Future SIMD architectures from Intel may have this - the ill-fated Larrabee processor was one case in point. For now though you will just need to design your data structures in such a way that this kind of functionality is not needed.
Note that you can achieve the equivalent effect by using e.g. _mm_set_epi8:
y = _mm_set_epi8(arr[x_16], arr[x_15], arr[x_14], ..., arr[x_1]);
although of course this will just generate a bunch of scalar code to load your y vector. This is fine if you are doing this kind of operation outside any performance-critical loops, e.g. as part of initialisation prior to looping, but inside a loop it is likely to be a performance-killer.
How often you use bitwise operation "hacks" to do some kind of
optimization? In what kind of situations is it really useful?
Example: instead of using if:
if (data[c] >= 128) //in a loop
sum += data[c];
you write:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
Of course assuming it does the same intended result for this specific situation.
Is it worth it? I find it unreadable. How often do you come across
this?
Note: I saw this code here in the chosen answers :Why is processing a sorted array faster than an unsorted array?
While that code was an excellent way to show what's going on, I usually wouldn't use code like that. If it had to be fast, there are usually even faster solutions, such as using SSE on x86 or NEON on ARM. If none of that is available, sure, I'll use it, provided it helps and it's necessary.
By the way, I explain how it works in this answer
Like Skylion, one thing I've used a lot is figuring out whether a number is a power of two. Think a while about how you'd do that.. then look at this: (x & (x - 1)) == 0 && x != 0
It's tricky the first time you see it, I suppose, but once you get used to it it's just so much simpler than any alternative that doesn't use bitmath. It works because subtracting 1 from a number means that the borrow starts at the rightmost end of the number and runs through all the zeroes, then stops at the first 1 which turns into a zero. ANDing that number with the original then makes the rightmost 1 zero. Powers of two only have one 1, which disappears, leaving zero. All other numbers will have at least one 1 left, except zero, which is a special case. A common variant doesn't test for zero, and is OK with treating it as power of two or knows that zero can't happen.
Similarly there are other things that you can easily do with bitmath, but not so easy without. As they say, use the right tool for the job. Sometimes bitmath is the right tool.
Bitwise operations are so useful that prof. Knuth wrote a book abot them: http://www.amazon.com/The-Computer-Programming-Volume-Fascicle/dp/0321580508
Just to mention a few simplest ones: int multiplication and division by a power of two (using left and right shift), mod with respect to a power of two, masking and so on. When using bitwise ops just be sure to provide sufficient comments about what's going on.
However, your example, data[c]>128 is not applicable IMO, just keep it that way.
But if you want to compute data[c] % 128 then data[c] & 0x7f is much faster (where & represents bitwise AND).
There are several instances where using such hacks may be useful. For instance, they can remove some Java Virtual Machine "Optimizations" such as branch predictors. I have found them useful only once in a few cases. The main one is multiplying by -1. If you are doing it hundreds of times across a massive array it is more efficient to simply flip the first bit, than to actually multiple. Another example I have used it is to know if a number is a power of 2 (since it's so easy to figure out in binary.) Basically, bit hacks are useful when you want to cheat. Here is a human analogy. If you have list of numbers and you need to know if they are greater than 29, You can automatically know if the first digit is larger than 3, then the whole thing is larger than 30 an vice versa. Bitwise operations simply allow you to perform similar cheats to binary.
Im trying to write optimized code for accesing image pixels and need to make a for loop super fast without going down to assembly level. Further more the indexing is done along the rows to minimize cache misses.
This is what I have:
for (indr=0;indr<(height-1)*width;indr+=width) {
for (indc=0;indc<width;indc++){
I[indr+indc]= dostuff ;
}
}
I cant make it a single loop because the "dostuff" includes accessing elements that arent on the same row.
Is there a faster way to do this?
EDIT
Okay, because my previous post was slightly unclear im adding here the full code. Its pretty unreadable but the general idea is that Im performing a convolution with a simple box using an integral image. The image is first padded with ws+1 zeros on the left and bottom and ws zeros on the right and top. It is then made into an integral image Ii. The following function takes the integral image and extracts the convolution where the result Ic is the same size as the original image.
void convI(float *Ic,float *Ii,int ws, int width, int height)
{
int W=width+ws*2+1,indR;
int H=height+ws*2+1,indC;
int w=width, indr;
int h=height, indc;
int jmpA=W*(ws+1),jmpC=W*ws,jmpB=ws+1,jmpD=ws;
for (indR=W*(ws+1),indr=0;indr<width*(height-1);indR+=W,indr+=width) {
for (indC=ws+1,indc=0;indc<width;indC++,indc++){
//Performs I[indA]+I[indD]-I[indB]-I[indC];
Ic[indr+indc]=
Ii[indR-jmpA+indC-jmpB]+
Ii[indR+jmpC+indC+jmpD]-
Ii[indR+jmpC+indC-jmpB]-
Ii[indR-jmpA+indC+jmpD];
}
}
}
So thats the "dostuff" part. The loop is sluggish.
There is not much reason that other code would result in better performance than the one you gave, if you have all optimization levels on.
Why do you suspect the loop itself to be a bottleneck? There is not much that can be said without knowing what you are actually doing. Benchmark your code and look at the assembler that this produces if you have doubts.
Edit: After you showed the inner part of your loop.
There is a little bit of potential of putting expressions of your index computations as much as possible outside of the loops. Since it is intermixed with the loop variables, this probably can't be optimized as is should. (Or just reorder the computations of the indices, such that the compiler may see it and may precompute as much as possible.)
The most chances are that performance difficulties come from the access of your vectors. If you manage to compute your indices better, this might also improve, because the compiler/system will actually see that you access your vectors in a regular pattern.
If this doesn't help, reorganize your loop such that the load of your vectors is incremental and not the store. Loads always have to wait until the data is there to perform the operation, stores are less sensible to that.
You can unroll the innermost loop. You will lose readability, but the CPU's cache and the prefetch queue will do a better job. Although this is always true I don't know how much speed you will gain.
You can declare both indc and indr as register variables and try avoiding recalculating (height-1)*width, keep it in a temporary variable instead. You know, multiplications eat a lot of clock cycles...
Unless you want to use vectorizing instructions like SSE, there's not much that can be done.
What you have looks fine. If you want to avoid going into assembly, it's best to keep simple loops simple. GCC is smart. If you're clear about what you want your code doing it generally does a good job optimizing it. However, if you do fancy tricks that aren't common in production code, it might have trouble deducing what you "really mean".
Depending on what dostuff actually does, you might find some win in caching I[indr+indc] in a temporary so your code looks something like...
char t = I[indr+indc];
// do stuff
I[indr+indc] = t;
This code will not perform worse (I assume you have at least the basic optimizations turned on), but it might perform better if your do stuff is fancy enough (I can elaborate if you want).
And don't listen to the other guys lifting simple math out of loops. There's really no need. If you look at the assembly generated at -O1, you'll see this is done for you every time. It's one of the cheapest optimizations to make.
There MAY be a win in lifting the height-1 in the outer loop to an assignment before the loop. But, then, I suspect that a normal compiler these days would do that as a standard optimization. It may also be that having another pointer, set to I[indr] and then indexing off that may be a small win.
Both of these would require some pretty careful benchmarking to note.
// DragonLord style:
float *ic_p = I + (width * height) - 1; // fencepost
// Start at the end, and work backwards
// assumes I is 0-based and wraps, is contiguous
for (indr=(height -1) * width; indr>=0; indr-=width ) {
// Sadly cannot test on indr -= width here
// as the 0 pass is needed for the loop
for (indc=width; indc--; ){
// Testing on postdecrement
// allows you to use the 0 value one last time before testing it FTW
// indr and indc are both 0-based inside the loop for you
// e.g. indc varies from (width-1) down to 0
// due to postdecrement before usage
printf( "I[ %d + %d ] == %f \n", indr, indc, *ic_p );
// always use pointers in C/C++ for speed, we are not Java
*ic_p-- = dostuff ;
}
}
performance may be slightly improved by counting down from height towards 0 if you don't need to use indr inside the loop, or predecrementing instead of postdecrementing indc if you can get by with a 1's-based indc, in which case indc should initialize at (width +1):
for (indc=(width+1); --indc; ){
Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I've seen a few techniques on very isolated threads but never a list. The ones I know already fix situations where the condition can be turned to a bool and that 0/1 is used in some way to change. Are there other conditional branches that can be avoided?
e.g. (pseudocode)
loop () {
if (in[i] < C )
out[o++] = in[i++]
...
}
Can be rewritten, arguably losing some readability, with something like this:
loop() {
out[o] = in[i] // copy anyway, just don't increment
inc = in[i] < C // increment counters? (0 or 1)
o += inc
i += inc
}
Also I've seen techniques in the wild changing && to & in the conditional in certain contexts escaping my mind right now. I'm a rookie at this level of optimization but it sure feels like there's got to be more.
Using Matt Joiner's example:
if (b > a) b = a;
You could also do the following, without having to dig into assembly code:
bool if_else = b > a;
b = a * if_else + b * !if_else;
I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.
As someone else has mentioned, if you want to do more than unrolling loops, and providing branch hints, you're going to want to drop into assembly. Of course this should be done with utmost caution: your typical compiler can write better assembly in most cases than a human. Your best hope is to shave off rough edges, and make assumptions that the compiler cannot deduce.
Here's an example of the following C code:
if (b > a) b = a;
In assembly without any jumps, by using bit-manipulation (and extreme commenting):
sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0
Note that while conditional moves are immediately jumped on by assembly enthusiasts, that's only because they're easily understood and provide a higher level language concept in a convenient single instruction. They are not necessarily faster, not available on older processors, and by mapping your C code into corresponding conditional move instructions you're just doing the work of the compiler.
The generalization of the example you give is "replace conditional evaluation with math"; conditional-branch avoidance largely boils down to that.
What's going on with replacing && with & is that, since && is short-circuit, it constitutes conditional evaluation in and of itself. & gets you the same logical results if both sides are either 0 or 1, and isn't short-circuit. Same applies to || and | except you don't need to make sure the sides are constrained to 0 or 1 (again, for logic purposes only, i.e. you're using the result only Booleanly).
At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.
Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.
An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.
if (a) {
if (b) {
result = q;
} else {
result = r;
}
} else {
if (b) {
result = s;
} else {
result = t;
}
}
If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:
// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];
You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.
GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.
Additionaly to compute minimum you can use (see these magic tricks):
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
However, pay attention to things like:
c[i][j] = min(c[i][j], c[i][k] + c[j][k]); // from Floyd-Warshal algorithm
even no jumps are implied is much slower than
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
c[i][j] = tmp;
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.
In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.
Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?
Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.
You're better off worrying about cache hits and misses.
This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.