From a long time ago I have a memory which has stuck with me that says comparisons against zero are faster than any other value (ahem Z80).
In some C code I'm writing I want to skip values which have all their bits set. Currently the type of these values is char but may change. I have two different alternatives to perform the test:
if (!~b)
/* skip */
and
if (b == 0xff)
/* skip */
Apart from the latter making the assumption that b is an 8bit char whereas the former does not, would the former ever be faster due to the old compare to zero optimization trick, or are the CPUs of today way beyond this kind of thing?
If it is faster, the compiler will substitute it for you.
In general, you can't write C better than the compiler can optimize it. And it is architecture specific anyway.
In short, don't worry about it unless that sub-micro-nano-second is ultra important
From what I recall in my architecture classes, I believe they should be equally fast. Both have 2 instructions.
First example
1. Negate b into a temp register
2. Compare temp register equal 0
Second example
1. Subtract 0xff from b into a temp register
2. Compare temp register equal to 0
These are basically identical, and besides, even if your particular architecture requires more or less than this, is it really worth the fraction of a nanosecond? Several minutes have been spent just answering this question.
I would say it's not so much the that CPUs are beyond these kind of tricks as it is the compilers.
The CPUs of today are, however, beyond simple tricks which pull an extra clock-tick or two of speed. Even if you do this 100,000 times a second, we are still only talking about an increase in speed of 0.00003 seconds on a single-core 3Ghz computer - it is simply not worth your time to worry about things like this.
Go with the one that will be easier for the person who is maintaining your code to understand. If you have a successful product, most of the expense in software is in maintenance. If you write cryptic code you add to that expense. If you don't have a successful product, it doesn't matter because no one will have to maintain it. I have been in situations where I had to save every byte I could, and had to resort to tricks like the one you gave, but I only do it as the very very very last resort.
Related
With regards to coding in C, which would be faster, to check the statement with an If, or I just run the function anyway for example say the output is already 1.
if(a==b && output!=1)
{
output=1;
}
Or
if(a==b)
{
output=1;
}
In the first code, an extra check has to be run every time the code runs.
In the second you are running the code repeatedly unnecessarily
Which is more efficient??
The question basically boils down to the question of is a compare less expensive than a variable assignment. For integers, the answer is no. I am assuming this will be in a tight loop where the variables will already be in the CPU level 1 cache. The compare will compile down to Op codes like:
1) Move "output" memory locations data into Register A
2) Put 1 into Register B
3) Jump <somewhere> if Register A == Register B.
You might get an optimization where 2) is not done if comparing to 0 because there are special op codes for comparing to 0 in most CPUs.
The assignment will compiler to op codes like:
1) Put 1 into Register A
2) Push Register A to memory location of output
The question come down to clock cycles spent for each of the op codes. I think that they are all likely to be exactly the same clock cycles.
Regardless any possible optimization, as shown in the comments, the first code is less efficient than the second code due to the extra check.
Beware of your data meaning, that check may be mandatory.
If not, you should optimize your code as suggested.
Edit
I'm assuming your question to be more theoretical than practical. In any real scenario, the data context assume a huge role when we want to optimize some code.
The code don't need to be fast itself, but need to be fast in processing its data.
Which is more taxing? Is enclosing an array element exchange with a conditional if statement to prevent redundant exchanges, like say exchanging with itself, more efficient?
Or is having to check for an only probabilistic condition all the time more inefficient? Say the chance of the special condition increases every invocation.
Say you're developing an algorithm and is trying to check for efficiency: compares or exchanges(like insertion sort).
if(condition)
exchange two elements
This very much depends on your processor architecture, how often this would be done, the throughput its required to handle and the cost of doing said exchanges, in which case, the only viable, real world-answer is: "profile, profile and profile some more".
Basically, if you CPU suffers badly from branch miss-prediction, and the swapping of elements is trivial, then its makes sense to leave out the conditional.
however, if your target CPU architecture can support a fair amount of branch miss-predictions with cause too much stalling or the cost of swapping elements is not trivial, then you might gain performance, depending on the size of said array. you may also benefit from the use of instructions like MOVcc/CMPXCHG, or there non-x86 counterparts (though it this situation, you'd still need a read + compare, but it removes the branching).
With so many variable inputs, it makes sense to profile your code and find where its really bottlenecking, things like VTune or CodeAnalyst will also give you stats on branch miss-prediction so you can see how much it affects your algorithm as a whole.
A useful way to look at any condition-evaluation code to ask, "What is the probability of each outcome?"
For example, it there's a test expression test, and it's probability of being true is 1/100, then on average it is telling you very little, for your investment in processor cycles.
In fact you can quantify that.
If it's true, then the the amount of information you've learned is pretty good.
It is log2(100/1) = 6.6 bits, roughly, but that only happens 1 out of 100 times.
The other 99 times, the amount of information you learn is log2(100/99) = .014 bits.
Practically nothing.
So a condition like that is telling you very little, on average. It's not "working" very hard.
A good way to finish quantifying it is to multiply what you learn from each outcome by the probability of that outcode, and add those up.
That tells you what you learn on average.
That is 6.6 * 1/100 + .014 * 99/100 = .066 + .014 = .08 bits, which is very poor.
(This number is called the entropy of the decision.)
On the other hand, if you have a decision point where each outcome is equally likely, it learns a full 1 bit on average.
In fact that's the most work a binary decision can possibly do.
So if you're worried about the performance of a conditional test (you may not be) try to make it earn its cycles.
From a memory access standpoint... is it worth attempting an optimization like this?
int boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
Instead of
unsigned char boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
The unsigned char of course takes up only 1 byte as apposed to the integers 4 (assuming 32 bit platform here), but my understanding is that it would be faster for a processor to read the integer value from memory.
It may or may not be faster, and the speed depends on so many things that a generic answer is impossible. For example: hardware architecture, compiler, compiler options, amount of data (does it fit into L1 cache?), other things competing for the CPU, etc.
The correct answer, therefore, is: try both ways and measure for your particular case.
If measurement does not indicate that one method is significantly faster than the other, opt for the one that is clearer.
From a memory access standpoint... is
it worth attempting an optimization
like this?
Probably not. In almost all modern processors, memory will get fetched based on the word size of the processor. In your case, even to get one byte of memory out, your processor probably fetches the entire 32-bit word or more based on the caching of that processor. Your architecture may vary, so you will want to understand how your CPU works to gauge.
But as others have said, it doesn't hurt to try it and measure it.
This is almost never a good idea. Many systems can only read word-sized chunks from memory at once, so reading a byte then masking or shifting will actually take more code space and just as much (data) memory. If you're using an obscure tiny system, measure, but in general this will actually slow down and bloat your code.
Asking how much memory unsigned char takes versus int is only meaningful when it's in an array (or possibly a structure, if you're careful to order the elements to take care of alignment). As a lone variable, it's very unlikely that you save any memory at all, and the compiler is likely to generate larger code to truncate the upper bits of registers.
As a general policy, never use smaller-than-int types except in arrays unless you have a really good reason other than trying to save space.
Follow the standard rules of optimization. First, don't optimize. Then test if your code needs it at some point. Then optimize that point. This link provides an excellent intro to the topic of optimization.
http://www.catb.org/~esr/writings/taoup/html/optimizationchapter.html
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I understand most of the micro-optimizations out there but are they really useful?
Exempli gratia: does doing ++i instead of i++, or while(1) or for(;;) really result in performance improvements (either in memory fingerprint or CPU cycles)?
So the question is, what micro-optimizations can be done in C? Are they really useful?
You should rely on your compiler to optimise this stuff. Concentrate on using appropriate algorithms and writing reliable, readable and maintainable code.
The day tclhttpd, a webserver written in Tcl, one of the slowest scripting language, managed to outperform Apache, a webserver written in C, one of the supposedly fastest compiled language, was the day I was convinced that micro-optimizations significantly pales in comparison to using a faster algorithm/technique*.
Never worry about micro-optimizations until you can prove in a debugger that it is the problem. Even then, I would recommend first coming here to SO and ask if it is a good idea hoping someone would convince you not to do it.
It is counter-intuitive but very often code, especially tight nested loops or recursion, are optimized by adding code rather than removing them. The gaming industry has come up with countless tricks to speed up nested loops using filters to avoid unnecessary processing. Those filters add significantly more instructions than the difference between i++ and ++i.
*note: We have learned a lot since then. The realization that a slow scripting language can outperform compiled machine code because spawning threads is expensive led to the developments of lighttpd, NginX and Apache2.
There's a difference, I think, between a micro-optimization, a trick, and alternative means of doing something. It can be a micro-optimization to use ++i instead of i++, though I would think of it as merely avoiding a pessimization, because when you pre-increment (or decrement) the compiler need not insert code to keep track of the current value of the variable for use in the expression. If using pre-increment/decrement doesn't change the semantics of the expression, then you should use it and avoid the overhead.
A trick, on the other hand, is code that uses a non-obvious mechanism to achieve a result faster than a straight-forward mechanism would. Tricks should be avoided unless absolutely needed. Gaining a small percentage of speed-up is generally not worth the damage to code readability unless that small percentage reflects a meaningful amount of time. Extremely long-running programs, especially calculation-heavy ones, or real-time programs are often candidates for tricks because the amount of time saved may be necessary to meet the systems performance goals. Tricks should be clearly documented if used.
Alternatives, are just that. There may be no performance gain or little; they just represent two different ways of expressing the same intent. The compiler may even produce the same code. In this case, choose the most readable expression. I would say to do so even if it results in some performance loss (though see the preceding paragraph).
I think you do not need to think about these micro-optimizations because most of them is done by compiler. These things can only make code more difficult to read.
Remember, [edited] premature [/edited] optimization is an evil.
To be honest, that question, while valid, is not relevant today - why?
Compiler writers are a lot more smarter than they were 20 years ago, rewind back in time, then these optimizations would have been very relevant, we were all working with old 80286/386 processors, and coders would often resort to tricks to squeeze even more bytes out of the compiled code.
Today, processors are too fast, compiler writers knows the intimate details of operand instructions to make every thing work, considering that there is pipe-lining, core processors, acres of RAM, remember, with a 80386 processor, there would be 4Mb RAM and if you're lucky, 8Mb was considered superior!!
The paradigm has shifted, it was about squeezing every byte out of compiled code, now it is more on programmer productivity and getting the release out the door much sooner.
The above I have stated the nature of the processor, and compilers, I was talking about the Intel 80x86 processor family, Borland/Microsoft compilers.
Hope this helps,
Best regards,
Tom.
If you can easily see that two different code sequences produce identical results, without making assumptions about the data other than what's present in the code, then the compiler can too, and generally will.
It's only when the transformation from one to the other is highly non-obvious or requires assuming something that you may know to be true but the compiler has no way to infer (eg. that an operation cannot overflow or that two pointers will never alias, even though they aren't declared with the restrict keyword) that you should spend time thinking about these things. Even then, the best thing to do is usually to find a way to inform the compiler about the assumptions that it can make.
If you do find specific cases where the compiler misses simple transformations, 99% of the time you should just file a bug against the compiler and get on with working on more important things.
Keeping the fact that memory is the new disk in mind will likely improve your performance far more than applying any of those micro-optimizations.
For a slightly more pragmatic take on the question of ++i vs. i++ (at least in a C++ context) see http://llvm.org/docs/CodingStandards.html#micro_preincrement.
If Chris Lattner says it, I've got to pay attention. ;-)
You would do better to consider every program you write primarily as a language in which you communicate your ideas, intentions and reasoning to other human beings who will have to bug-fix, reuse and understand it. They will spend more time on decoding garbled code than any compiler or runtime system will do executing it.
To summarise, say what you mean in the clearest way, using the common idioms of the language in question.
For these specific examples in C, for(;;) is the idiom for an infinite loop and "i++" is the usual idiom for "add one to i" unless you use the value in an expression, in which case it depends whether the value with the clearest meaning is the one before or after the increment.
Here's real optimization, in my experience.
Someone on SO once remarked that micro-optimization was like "getting a haircut to lose weight". On American TV there is a show called "The Biggest Loser" where obese people compete to lose weight. If they were able to get their body weight down to a few grams, then getting a haircut would help.
Maybe that's overstating the analogy to micro-optimization, because I have seen (and written) code where micro-optimization actually did make a difference, but when starting off there is a lot more to be gained by simply not solving problems you don't have.
x ^= y
y ^= x
x ^= y
++i should be prefered over i++ for situations where you don't use the return value because it better represents the semantics of what you are trying to do (increment i) rather than any possible optimisation (it might be slightly faster, and is probably not worse).
Generally, loops that count towards zero are faster than loops that count towards some other number. I can imagine a situation where the compiler can't make this optimization for you, but you can make it yourself.
Say that you have and array of length x, where x is some very big number, and that you need to perform some operation on each element of x. Further, let's say that you don't care what order these operations occur in. You might do this...
int i;
for (i = 0; i < x; i++)
doStuff(array[i]);
But, you could get a little optimization by doing it this way instead -
int i;
for (i = x-1; i != 0; i--)
{
doStuff(array[i]);
}
doStuff(array[0]);
The compiler doesn't do it for you because it can't assume that order is unimportant.
MaR's example code is better. Consider this, assuming doStuff() returns an int:
int i = x;
while (i != 0)
{
--i;
printf("%d\n",doStuff(array[i]));
}
This is ok as long as printing the array contents in reverse order is acceptable, but the compiler can't decide that for you.
This being an optimization is hardware dependent. From what I remember about writing assembler (many, many years ago), counting up rather than counting down to zero requires an extra machine instruction each time you go through the loop.
If your test is something like (x < y), then evaluation of the test goes something like this:
subtract y from x, storing the result in some register r1
test r1, to set the n and z flags
branch based on the values of the n and z flags
If your test is ( x != 0), you can do this:
test x, to set the z flag
branch based on the value of the z flag
You get to skip a subtract instruction for each iteration.
There are architectures where you can have the subtract instruction set the flags based on the result of the subtraction, but I'm pretty sure x86 isn't one of them, so most of us aren't using compilers that have access to such a machine instruction.
1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.