As an example, assume that the expression sys->pot.atoms[item->P.kind].mass is evaluated inside a loop. The loop only changes item, so the expression can be simplified as atoms[item->P.kind].mass by defining a variable as atoms = sys->pot.atoms before the loop. Do modern compilers like gcc perform this kind of optimization automatically (if optimization is enabled)? And is it reliable regardless of the number of expressions like atoms[item->P.kind].mass existing inside a loop?
Yes it is a very common optimisation called Loop invariant code motion, also called hoisting or scalar promotion, often performed as a side effect of Common subexpression elimination.
It is valid to compute sys->pot.atoms just once before the loop if the compiler can ascertain that neither sys nor sys->pot.atoms can be modified inside the loop.
Note however, as commented by Groo, that if sys or sys->pot or sys->pot.atoms are specified as volatile, then it would be incorrect to compute it only once if the expression sys->pot.atoms is evaluated multiple times in the loop body or expressions.
It's a very common optimization.
And is it reliable regardless of the number of expressions
No, because optimizations is not something you can rely on happening in general. The C standard says nothing about it, so it's up to the maker of the compiler to give guarantees or not. But that's nothing you really do for the optimizer. The optimizer has a "best effort" approach, and a missed optimization is often treated like a flaw rather than an actual bug.
EDIT:
From discussion in comments, I found it useful to mention that just because a certain optimization was performed, that does not guarantee faster code. For instance, the benefit of loop unrolling is that the test in the loop does not need to be performed every iteration. But on the other hand, longer code can be less cache friendly. So asking if it's guaranteed that a certain optimization is performed or not does not really give any useful information.
I always wonder where I should do optimization myself, and where I should sit relax and leave it to the compiler.
That's very hard to know in advance. Guys like Linus Torvalds can basically see the assembly code in their head just by watching the C code, but for us mere mortals, it comes down to benchmarking and profiling.
Before even considering micro optimizations, perform these checks
Make sure that the code you're about to optimize actually is a bottleneck
Make sure you're using a good algorithm
Make sure the code is cache friendly
Related
I am writing a c code and I would like to know if making simple operation like multiplication more CPU friendly makes any difference and the code faster. For example, replacing this line of code:
y = x * 15;
with
y = x << 4;
y -= x;
Does the compiler already do that? Should I use the -O2 option in order to make it happen?
There are two parts to the answer:
No, unless you are writing a very specialized function (e.g. a signal processing function that must execute in 20 clocks) you should not optimize; leave that to the compiler. In general your job is to write readable code, and the compiler will (to it's capability optimize it). Note that the optimization will be different for different processors as their hardware (computational capability) can be very different. For example a shift by N instruction (like that in your code) may take N clocks on a processor with a regular shifter, but it will take a single clock (or less) on processors with a hardware barrel shifter.
Yes, most modern optimizing compilers will optimize (for example replace multiplication by shifting where appropriate) without explicit optimization options.
Summarized, optimize only in rare situations when you already know that the compiler didn't do a good job, it is a problem that must be addressed, you know well how to do better than the compiler, and the resulting increased maintenance cost is worth it.
Optimizing code by hand is almost always an exercise in futility these days, especially in higher level languages. While C is almost assembly, modern compilers have a lot more tricks built into them than most people are aware of.
In addition, unless the code you're optimizing is going to be used a lot, i.e., millions of times in close succession, the work of optimizing the code will cost more time than the savings you achieve.
With that said, the only way to see if your code is measurably faster would be to test it: Put each version in a tight loop and execute it a million (or more) times, and see if there's a noticeable difference.
Note that your optimization is for a specific multiplier - any other operand you use it for is going to yield different results. Because it can't be generalized, there's little likelyhood this optimization will be done by any compiler in all cases - and just looking at the code and not knowing what processor architecture it will be run on, I can't say whether it would be faster or not.
Context
The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:
#define BN_CONSTTIME_SWAP(ind) \
do { \
t = (a->d[ind] ^ b->d[ind]) & condition; \
a->d[ind] ^= t; \
b->d[ind] ^= t; \
} while (0)
…
BN_CONSTTIME_SWAP(9);
…
BN_CONSTTIME_SWAP(8);
…
BN_CONSTTIME_SWAP(7);
The intention is that so as to ensure that higher-level bignum operations take constant time, this function either swaps two bignums or leaves them in place in constant time. When it leaves them in place, it actually reads each word of each bignum, computes a new word that is identical to the old word, and write that result back to the original location.
The intention is that this will take the same time as if the bignums had effectively been swapped.
In this question, I assume a modern, widespread architecture such as those described by Agner Fog in his optimization manuals. Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
Question
I am trying to understand whether the construct above characterizes as a “best effort” sort of constant-time execution, or as perfect constant-time execution.
In particular, I am concerned about the scenario where bignum a is already in the L1 data cache when the function BN_consttime_swap is called, and the code just after the function returns start working on the bignum a right away. On a modern processor, enough instructions can be in-flight at the same time for the copy not to be technically finished when the bignum a is used. The mechanism allowing the instructions after the call to BN_consttime_swap to work on a is memory dependence speculation. Let us assume naive memory dependence speculation for the sake of the argument.
What the question seems to boil down to is this:
When the processor finally detects that the code after BN_consttime_swap read from memory that had, contrary to speculation, been written to inside the function, does it cancel the speculative execution as soon as it detects that the address had been written to, or does it allow itself to keep it when it detects that the value that has been written is the same as the value that was already there?
In the first case, BN_consttime_swap looks like it implements perfect constant-time. In the second case, it is only best-effort constant-time: if the bignums were not swapped, execution of the code that comes after the call to BN_consttime_swap will be measurably faster than if they had been swapped.
Even in the second case, this is something that looks like it could be fixed for the foreseeable future (as long as processors remain naive enough) by, for each word of each of the two bignums, writing a value different from the two possible final values before writing either the old value again or the new value. The volatile type qualifier may need to be involved at some point to prevent an ordinary compiler to over-optimize the sequence, but it still sounds possible.
NOTE: I know about store forwarding, but store forwarding is only a shortcut. It does not prevent a read being executed before the write it is supposed to come after. And in some circumstances it fails, although one would not expect it to in this case.
Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
I know it's not the thrust of your question, and I know that you know this, but I need to rant for a minute. This does not even qualify as a "best effort" attempt to provide constant-time execution. A compiler is licensed to check the value of condition, and skip the whole thing if condition is zero. Obfuscating the setting of condition makes this less likely to happen, but is no guarantee.
Purportedly "constant-time" code should not be written in C, full stop. Even if it is constant time today, on the compilers that you test, a smarter compiler will come along and defeat you. One of your users will use this compiler before you do, and they will not be aware of the risk to which you have exposed them. There are exactly three ways to achieve constant time that I am aware of: dedicated hardware, assembly, or a DSL that generates machine code plus a proof of constant-time execution.
Rant aside, on to the actual architecture question at hand: assuming a stupidly naive compiler, this code is constant time on the µarches with which I am familiar enough to evaluate the question, and I expect it to broadly be true for one simple reason: power. I expect that checking in a store queue or cache if a value being stored matches the value already present and conditionally short-circuiting the store or avoiding dirtying the cache line on every store consumes more energy than would be saved in the rare occasion that you get to avoid some work. However, I am not a CPU designer, and do not presume to speak on their behalf, so take this with several tablespoons of salt, and please consult one before assuming this to be true.
This blog post, and the comments made by the author, Henry, on the subject of this question should be considered as authoritative as anyone should allowed to expect. I will reproduce the latter here for archival:
I didn’t think the case of overwriting a memory location with the same value had a practical use. I think the answer is that in current processors, the value of the store is irrelevant, only the address is important.
Out here in academia, I’ve heard of two approaches to doing memory disambiguation: Address-based, or value-based. As far as I know, current processors all do address-based disambiguation.
I think the current microbenchmark has some evidence that the value isn’t relevant. Many of the cases involve repeatedly storing the same value into the same location (particularly those with offset = 0). These were not abnormally fast.
Address-based schemes uses a store queue and a load queue to track outstanding memory operations. Loads check the store queue to for an address match (Should this load do store-to-load forwarding instead of reading from cache?), while stores check the load queue (Did this store clobber the location of a later load I allowed to execute early?). These checks are based entirely on addresses (where a store and load collided). One advantage of this scheme is that it’s a fairly straightforward extension on top of store-to-load forwarding, since the store queue search is also used there.
Value-based schemes get rid of the associative search (i.e., faster, lower power, etc.), but requires a better predictor to do store-to-load forwarding (Now you have to guess whether and where to forward, rather than searching the SQ). These schemes check for ordering violations (and incorrect forwarding) by re-executing loads at commit time and checking whether their values are correct. In these schemes, if you have a conflicting store (or made some other mistake) that still resulted in the correct result value, it would not be detected as an ordering violation.
Could future processors move to value-based schemes? I suspect they might. They were proposed in the mid-2000s(?) to reduce the complexity of the memory execution hardware.
The idea behind constant-time implementation is not to actually perform everything in constant time. That will never happen on an out-of-order architecture.
The requirement is that no secret information can be revealed by timing analysis.
To prevent this there are basically two requirements:
a) Do not use anything secret as a stop condition for a loop, or as a predicate to a branch. Failing to do so will open you to a branch prediction attack https://eprint.iacr.org/2006/351.pdf
b) Do not use anything secret as an index to memory access. This leads to cache timing attacks http://www.daemonology.net/papers/htt.pdf
As for your code: assuming that your secret is "condition" and possibly the contents of a and b the code is perfectly constant time in the sense that its execution does not depend on the actual contents of a, b and condition. Of course the locality of a and b in memory will affect the execution time of the loop, but not the CONTENTS which are secret.
That is assuming of course condition was computed in a constant time manner.
As for C optimizations: the compiler can only optimize code based on information it knows. If "condition" is truly secret the compiler should not be able to discern it contents and optimize. If it can be deducted from your code then the compiler will most likely make optimization for the 0 case.
is there any performance difference between the following two cases:
First:
int test_some_condition(void);
if( some_variable == 2 && test_some_condition())
{
//body
}
Second:
int test_some_condition(void);
if( some_variable == 2 )
{
if(test_some_condition())
{
//body
}
}
UPDATE: I know how to create a test and measure the performance of each case or to look at the assembly produced for each case but I am sure I am not the first one to come across this question and it would be great if someone who has already tested that can be me a simple yes/no answer.
Difference in terms of what?
Readability? Yes, there's a difference. The first is much clearer and much better expresses your intent. And it also takes up fewer lines in the editor, which is not necessarily an advantage in itself, but it does make the code easier to read and grok at a glance for anyone who comes behind and wants to edit it.
Performance/"speed"? No. I'd be willing to bet actual money that there is absolutely no discernible difference once you run those two snippets of code through a compiler with optimizations turned on. And it wouldn't take much to convince me to bet on that same case even with optimizations disabled.
Why? Because in C (and all of the C-derived languages that I know of), the && operator performs short-circuit evaluation, which means that if the first condition evaluates to false, then it doesn't even bother to evaluate the second condition, because there's no way that the entire statement could ever turn out to be true.
Nesting the if statements was a common "optimization" trick in the bad old days of VB 6 when the And operator did not perform short-circuit evaluation. There's no purpose that I can imagine to using it in C code, unless it enhances readability. And honestly, if you run across a compiler that doesn't render these two code snippets completely equivalent in terms of performance, then it's time to throw that compiler away and stop using it. This is the most basic optimization under the sun, a "low-hanging fruit" for compiler writers. If they can't get this right, I wouldn't trust them with the rest of your code.
But, in general, worrying about this sort of thing (which definitely falls under the category of a "micro-optimization") is not helping you to write better code or become a better programmer. It's just causing you to waste a lot of time asking questions on Stack Overflow and contributing to the reputation of users like me who post this same answer to 2–3 similar questions a week. And that's time you're not spending writing code and improving your skills in tangible ways.
The only way anyone can really tell with a modern compiler is to look at the machine code.
There should be no difference. If there is a difference, it's likely not measurable (even if you do it millions of times you won't get conclusive results).
Testing these two examples in a loop running 10.000.000 times gives:
$ time ./test1
real 0m0.045s
user 0m0.044s
sys 0m0.001s
$ time ./test2
real 0m0.045s
user 0m0.043s
sys 0m0.003s
Also, keep in mind that if the first part of the expression fails, the second expression will never be evaluated. This is maybe not so clear in the first example.
Also, in conditions where the first evaluated expression returns false:
$ time ./test1_1
real 0m0.035s
user 0m0.034s
sys 0m0.001s
$ time ./test2_1
real 0m0.035s
user 0m0.034s
sys 0m0.000s
If you are thinking whether the first version always calls test_some_condition but the second version calls it only if the first condition is true, then the answer is that both versions are equivalent because the AND operator is lazy and will not evaluate its second argument if the first is already false.
The standard guarantees this behaviour. This makes it legal to say:
if (array_size > n && my_array[n] == 1) { ... }
This would be broken code without the laziness guarantee.
The two samples are logically equivalent, provided that compound conditions are lazily evaluated, meaning that in a && b, b won't be evaluated when a is known to be false. So even if there were a difference I wouldn't care because that might be an artefact (or a bug) of the compiler you're using, and it might change with the next release or bugfix.
Loop vectorization is when all right-hand-side expressions are computed at the onset. I just discovered my loops are being vectorized (in FORTRAN 77... don't ask). I need my loop condition variable to be updated in each iteration, but how can I rewrite to work around this vectorization?
In a related post, I'm looking for a way to disable this optimization "feature" in FORTRAN specifically, but here I am looking for a more algorithmic solution to the general case.
That's not what loop vectorisation means to me. To me the phrase means that the compiler will generate code which can take advantage of any vector computation capabilities of the hardware. On a simple Intel Xeon this might mean generating SSE4 instructions to simultaneously manipulate a few adjacent array elements together, on a Cray there may be much more available in terms of simultaneous execution of the same operation on vector registers.
How do you think that all the RHS expressions are 'computed at the onset' ? I'm not sure what you mean by that. Could you post some code to explain ? If you mean that the number of trips through the loop is computed on entry to the first iteration, then that is correct. That is a very useful feature when it comes to optimising code and not one most Fortran programs would benefit from avoiding.
If you are writing DO loops in Fortran updating the iteration variable is forbidden by the standard, and always has been so far as I recall. Your compiler might let you get away with it but I wouldn't trust a Fortran program in which this happened.
I have programmed an embedded software (using C of course) and now I'm considering ways to improve the running time of the system. The most important single module in my system is one very large nested for loop module.
That module consists of two nested for loops that loops max 122500 times. That's not very much yet, but the problem is that inside that nested for loop I have a function call to a function that is in another source file. That specific function consists mostly of two another nested for loops which loops always 22500 times. So now I have to make a function call 122500 times.
I have made that function that is to be called a lot lighter and shorter (yet still works as it should) and now I started to think that would it be faster to rip off that function call and write that process directly inside those first two for loops?
The processor in that system is ARM7TDMI and its frequency is 55MHz. The system itself isn't very time critical so it doesn't have to be real time capable. However the faster it can process its duties the better.
Also would it be also faster to use while loops instead of fors? And any piece of advice about how to improve the running time is appreciated.
-zaplec
TRY IT AND SEE!!
It'll almost certainly make a difference. Function call overhead isn't usually that much of an issue, but at over 100K repetitions it starts to add up.
...But whether or not it makes any real-world difference is something only you can answer, after trying it and timing the results.
As for for vs while... it shouldn't matter unless you actually change the behavior when changing the loop. If in doubt, make your compiler spit out assembler code for both and compare... or just change it and time it.
You need to be careful in the optimizations you make because you aren't always clear on which optimizations the compiler is making for you. Pre-optimization is a common mistake people make. Is it important that your code is readable and easily maintained or slightly faster? Like others have suggested, the best approach is to benchmark the different ways and see if there is a noticeable difference.
If you don't believe your compiler does much in the way of optimization I would look at some older concepts in optimizing C (searches on SO or google should provide some good links).
The ARM processor has an instruction pipeline (cache). When the processor encounters a branch (call) instruction, it must clear the pipeline and reload, thus wasting some time. One objective when optimizing for speed is to reduce the number of reloads to the instruction pipeline. This means reducing branch instructions.
As others have stated in SO, compile your code with optimization set for speed, and profile. I prefer to look at the assembly language listing as well (either printed from the compiler or displayed interwoven in the debugger). Use this as a baseline. If you can't profile, you can use assembly instruction counting as a rough estimate.
The next step is to reduce the number of branches; or the number times a branch is taken. Unrolling loops helps to reduce the number of times a branch is taken. Inlining helps reduce the number of branches. Before applying this fine-tuning techniques, review the design and code implementation to see if branches can be reduced. For example, reduce the number of "if" statements by using Boolean arithmetic or using Karnaugh Maps. My favorite is reducing requirements and eliminating code that doesn't need to be executed.
In the code implementation, move code that doesn't change outside of the for or while loops. Some loops may be reduce to equations (example, replacing a loop of additions with a multiplication). Also, reduce the quantity of iterations, by asking "does this loop really need to be executed this many times").
Another technique is to optimize for Data Oriented Design. Also check this reference.
Just remember to set a limit for optimizing. This is where you decide any more optimization is not generating any ROI or customer satisfaction. Also, apply optimizations in stages; which will allow you to have a deliverable when your manager asks for one.
Run a profiler on your code. If you are just guessing at where you are spending your time, you are probably wrong. A profiler will show what function is taking the most time and you can focus on that. You could be doing something in the function that takes longer than the function call itself. Did you look to see if you can change floating operations to integer, or integer math to shifts? You can spend a lot of time fiddling with things that don't make much difference. Run a profiler on your code and know for sure that the things you are changing will make a difference.
For function vs. inline, unfortunately there is no easy answer. I.e. it depends. See this FAQ. For "for" vs. "while", I wouldn't think there is any significant difference in performance.
In general, a function call should have more overhead than inlining. You really should profile however, as this can be affected quite a bit by your compiler (especially the compile/optimization settings). Some compilers will automatically inline code for example.