Compiler behavior? - c

I am reviewing some source code and I was wondering if the following was thread safe? I have heard of compiler or CPU instruction/read reordering (would it have something to do with branch prediction?) and the Data->unsafe_variable variable below can be modified at any time by another thread.
My question is: depending on how the compiler/CPU reorder read/writes, would it be possible that the below code would allow the Data->unsafe_variable to be fetched twice? (see 2nd snippet)
Note: I do not worry about the first access, any data can be there as long as it does not pass the 'if', I am just concerned by the possibility that the data would be fetched another time after the 'if'. I was also wondering if the cast into volatile here would help preventing a double fetch?
int function(void* Data) {
// Data is allocated on the heap
// What it contains at this point is not important
size_t _varSize = ((volatile DATA *)Data)->unsafe_variable;
if (_varSize > x * y)
{
return FALSE;
}
// I do not want Data->unsafe_variable to be fetch once this point reached,
// I want to use the value "supposedly" stored in _varSize
// Would any compiler/CPU reordering would allow it to be double fetched?
size_t size = _varSize - t * q;
function_xy(size);
return TRUE;
}
Basically I do not want the program to behave like this for security reasons:
_varSize = ((volatile DATA *)Data)->unsafe_variable;
if (_varSize > x * y)
{
return FALSE;
}
size_t size = ((volatile DATA *)Data)->unsafe_variable - t * q;
function10(size);
I am simplifying here and they cannot use mutex. However, would it be safer to use _ReadWriteBarrier() or MemoryBarrier() after the fist line instead of a volatile cast? (VS compiler)
Edit: Giving slightly more context to the code.

The code is broken for many reasons. I'll just point out one of the more subtle ones as others have pointed out the more obvious ones. The object is not volatile. Casting a pointer to a pointer to a volatile object doesn't make the object volatile, it just lies to the compiler.
But there's a much bigger point -- you are going about this totally the wrong way. You are supposed to be checking whether the code is correct, that is, whether it is guaranteed to work. You aren't clever enough, nobody is, to think of every possible way the system might fail to do what you assume it will do. So instead, just don't make those assumptions.
Thinking about things like CPU read re-ordering is totally wrong. You should expect the CPU to do what, and only what, it is required to do. You should definitely not think about specific mechanisms by which it might fail, but only whether it is guaranteed to work.
What you are doing is like trying to figure out if an employee is guaranteed to show up for work by checking if he had his flu shot, checking if he is still alive, and so on. You can't check for, or even think of, every possible way he might fail to show up. So if find that you have to check those kinds of things, then it's not guaranteed and relying on it is broken. Period.
You cannot make reliable code by saying "the CPU doesn't do anything that can break this, so it's okay". You can make reliable code by saying "I make sure my code doesn't rely on anything that isn't guaranteed by the relevant standards."
You are provided with all the tools you need to do the job, including memory barriers, atomic operations, mutexes, and so on. Please use them.
You are not clever enough to think of every way something not guaranteed to work might fail. And you have a plethora of things that are guaranteed to work. Fix this code, and if possible, have a talk with the person who wrote it about using proper synchronization.
This sounds a bit ranty, and I apologize for that. But I've seen too much code that used "tricks" like this that worked perfectly on the test machines but then broke when a new CPU came out, a new compiler, or a new version of the OS. Fixing code like this can be an incredible pain because these hacks hide the actual synchronization requirements. The right answer is almost always to code clearly and precisely what you actually want, rather than to assume that you'll get it because you don't know of any reason you won't.
This is valuable advice from painful experience.

The standard(s) are clear. If any thread may be modifying the object, all accesses, in all threads, must be synchronized, or you have undefined behavior.

The only portable solution for C++ is C++11 atomics, which is available in upcoming VS 2012.
As for C, I do not know if recent C standards bring some portable facilities, I am not following that, but as you are using Visal Studio, it does not matter anyway, as Microsoft is not implementing recent C standards.
Still, if you know you are developing for Visual Studio, you can rely on guarantees provided by this compiler, which apply to both C and C++. Some of them are implicit (accessing volatile variables implies also some memory barriers applied), some are explicit, like using _MemoryBarrier intrinsic.
The whole topic of the memory model is discussed in depth in Lockless Programming Considerations for Xbox 360 and Microsoft Windows, this should give you a good overview. Beware: the topic you are entering is full of hard topics and nasty surprises.
Note: Relying on volatile is not portable, but if you are using old C / C++ standards, there is no portable solution anyway, therefore be prepared to facing the need of reimplementing this for different platform should the need ever arise. When writing portable threaded code, volatile is considered almost useless:
For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:
atomicity
memory consistency, i.e. the order of a thread's operations as seen by another thread.

Related

How to find issues related to Data consistency in an Embedded C code base?

Let me explain what I mean by data consistency issue. Take following scenario for example
uint16 x,y;
x=0x01FF;
y=x;
Clearly, these variables are 16 bit but if an 8 bit CPU is used with this code, read or write operations would not be atomic. Thereby an interrupt can occur in between and change the value.This is one situation which MIGHT lead to data inconsistency.
Here's another example,
if(x>7) //x is global variable
{
switch(x)
{
case 8://do something
break;
case 10://do something
break;
default: //do default
}
}
In the above excerpt code, if an interrupt is changing the value of x from 8 to 5 after the if statement but before the switch statement,we end up in default case, instead of case 8.
Please note, I'm looking for ways to detect such scenarios (but not solutions)
Are there any tools that can detect such issues in Embedded C?
It is possible for a static analysis tool that is context (thread/interrupt) aware to determine the use of shared data, and that such a tool could recognise specific mechanisms to protect such data (or lack thereof).
One such tool is Polyspace Code Prover; it is very expensive and very complex, and does a lot more besides that described above. Specifically to quote (elided) from the whitepaper here:
With abstract interpretation the following program elements are interpreted in new ways:
[...]
Any global shared data may change at any time in a multitask program, except when protection
mechanisms, such as memory locks or critical sections, have been applied
[...]
It may have improved in the long time since I used it, but one issue I had was that it worked on a lock-access-unlock idiom, where you specified to the tool what the lock/unlock calls or macros were. The problem with that is that the C++ project I worked on used a smarter method where a locking object (mutex, scheduler-lock or interrupt disable for example) locked on instantiation (in the constructor) and unlocked in the destructor so that it unlocked automatically when the object went out of scope (a lock by scope idiom). This meant that the unlock was implicit and invisible to Polyspace. It could however at least identify all the shared data.
Another issue with the tool is that you must specify all thread and interrupt entry points for concurrency analysis, and in my case these were private-virtual functions in task and interrupt classes, again making them invisible to Polyspace. This was solved by conditionally making the entry-points public for the abstract analysis only, but meant that the code being tested does not have the exact semantics of the code to be run.
Of course these are non-problems for C code, and in my experience Polyspace is much more successfully applied to C in any case; you are far less likely to be writing code in a style to suit the tool rather than the tool working with your existing code-base.
There are no such tools as far as I am aware. And that is probably because you can't detect them.
Pretty much every operation in your C code has the potential to get interrupted before it is finished. Less obvious than the 16 bit scenario, there is also this:
uint8_t a, b;
...
a = b;
There is no guarantees that this is atomic! The above assignment may as well translate to multiple assembler instructions, such as 1) load a into register, 2) store register at memory address. You can't know this unless you disassemble the C code and check.
This can create very subtle bugs. Any assumption of the kind "as long as I use 8 bit variables on my 8 bit CPU, I can't get interrupted" is naive. Even if such code would result in atomic operations on a given CPU, the code is non-portable.
The only reliable, fully-portable solution is to use some manner of semaphore. On embedded systems, this could be as simple as a bool variable. Another solution is to use inline assembler, but that can't be ported cross platform.
To solve this, C11 introduced the qualifier _Atomic to the language. However, C11 support among embedded systems compilers is still mediocre.

Is there "compiler-friendly" code / convention [duplicate]

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link
Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.
Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).
The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}
Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.
The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.
On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.
The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.
use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc
Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.
I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.
In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.
A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.
Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.
I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.
Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.
Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.
Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.
Align your data to native/natural boundaries.
I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.
A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;
If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.
Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1
When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.
For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.
You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.
i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.
One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.
One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.
I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.
Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

Does memory dependence speculation prevent BN_consttime_swap from being constant-time?

Context
The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:
#define BN_CONSTTIME_SWAP(ind) \
do { \
t = (a->d[ind] ^ b->d[ind]) & condition; \
a->d[ind] ^= t; \
b->d[ind] ^= t; \
} while (0)
…
BN_CONSTTIME_SWAP(9);
…
BN_CONSTTIME_SWAP(8);
…
BN_CONSTTIME_SWAP(7);
The intention is that so as to ensure that higher-level bignum operations take constant time, this function either swaps two bignums or leaves them in place in constant time. When it leaves them in place, it actually reads each word of each bignum, computes a new word that is identical to the old word, and write that result back to the original location.
The intention is that this will take the same time as if the bignums had effectively been swapped.
In this question, I assume a modern, widespread architecture such as those described by Agner Fog in his optimization manuals. Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
Question
I am trying to understand whether the construct above characterizes as a “best effort” sort of constant-time execution, or as perfect constant-time execution.
In particular, I am concerned about the scenario where bignum a is already in the L1 data cache when the function BN_consttime_swap is called, and the code just after the function returns start working on the bignum a right away. On a modern processor, enough instructions can be in-flight at the same time for the copy not to be technically finished when the bignum a is used. The mechanism allowing the instructions after the call to BN_consttime_swap to work on a is memory dependence speculation. Let us assume naive memory dependence speculation for the sake of the argument.
What the question seems to boil down to is this:
When the processor finally detects that the code after BN_consttime_swap read from memory that had, contrary to speculation, been written to inside the function, does it cancel the speculative execution as soon as it detects that the address had been written to, or does it allow itself to keep it when it detects that the value that has been written is the same as the value that was already there?
In the first case, BN_consttime_swap looks like it implements perfect constant-time. In the second case, it is only best-effort constant-time: if the bignums were not swapped, execution of the code that comes after the call to BN_consttime_swap will be measurably faster than if they had been swapped.
Even in the second case, this is something that looks like it could be fixed for the foreseeable future (as long as processors remain naive enough) by, for each word of each of the two bignums, writing a value different from the two possible final values before writing either the old value again or the new value. The volatile type qualifier may need to be involved at some point to prevent an ordinary compiler to over-optimize the sequence, but it still sounds possible.
NOTE: I know about store forwarding, but store forwarding is only a shortcut. It does not prevent a read being executed before the write it is supposed to come after. And in some circumstances it fails, although one would not expect it to in this case.
Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
I know it's not the thrust of your question, and I know that you know this, but I need to rant for a minute. This does not even qualify as a "best effort" attempt to provide constant-time execution. A compiler is licensed to check the value of condition, and skip the whole thing if condition is zero. Obfuscating the setting of condition makes this less likely to happen, but is no guarantee.
Purportedly "constant-time" code should not be written in C, full stop. Even if it is constant time today, on the compilers that you test, a smarter compiler will come along and defeat you. One of your users will use this compiler before you do, and they will not be aware of the risk to which you have exposed them. There are exactly three ways to achieve constant time that I am aware of: dedicated hardware, assembly, or a DSL that generates machine code plus a proof of constant-time execution.
Rant aside, on to the actual architecture question at hand: assuming a stupidly naive compiler, this code is constant time on the µarches with which I am familiar enough to evaluate the question, and I expect it to broadly be true for one simple reason: power. I expect that checking in a store queue or cache if a value being stored matches the value already present and conditionally short-circuiting the store or avoiding dirtying the cache line on every store consumes more energy than would be saved in the rare occasion that you get to avoid some work. However, I am not a CPU designer, and do not presume to speak on their behalf, so take this with several tablespoons of salt, and please consult one before assuming this to be true.
This blog post, and the comments made by the author, Henry, on the subject of this question should be considered as authoritative as anyone should allowed to expect. I will reproduce the latter here for archival:
I didn’t think the case of overwriting a memory location with the same value had a practical use. I think the answer is that in current processors, the value of the store is irrelevant, only the address is important.
Out here in academia, I’ve heard of two approaches to doing memory disambiguation: Address-based, or value-based. As far as I know, current processors all do address-based disambiguation.
I think the current microbenchmark has some evidence that the value isn’t relevant. Many of the cases involve repeatedly storing the same value into the same location (particularly those with offset = 0). These were not abnormally fast.
Address-based schemes uses a store queue and a load queue to track outstanding memory operations. Loads check the store queue to for an address match (Should this load do store-to-load forwarding instead of reading from cache?), while stores check the load queue (Did this store clobber the location of a later load I allowed to execute early?). These checks are based entirely on addresses (where a store and load collided). One advantage of this scheme is that it’s a fairly straightforward extension on top of store-to-load forwarding, since the store queue search is also used there.
Value-based schemes get rid of the associative search (i.e., faster, lower power, etc.), but requires a better predictor to do store-to-load forwarding (Now you have to guess whether and where to forward, rather than searching the SQ). These schemes check for ordering violations (and incorrect forwarding) by re-executing loads at commit time and checking whether their values are correct. In these schemes, if you have a conflicting store (or made some other mistake) that still resulted in the correct result value, it would not be detected as an ordering violation.
Could future processors move to value-based schemes? I suspect they might. They were proposed in the mid-2000s(?) to reduce the complexity of the memory execution hardware.
The idea behind constant-time implementation is not to actually perform everything in constant time. That will never happen on an out-of-order architecture.
The requirement is that no secret information can be revealed by timing analysis.
To prevent this there are basically two requirements:
a) Do not use anything secret as a stop condition for a loop, or as a predicate to a branch. Failing to do so will open you to a branch prediction attack https://eprint.iacr.org/2006/351.pdf
b) Do not use anything secret as an index to memory access. This leads to cache timing attacks http://www.daemonology.net/papers/htt.pdf
As for your code: assuming that your secret is "condition" and possibly the contents of a and b the code is perfectly constant time in the sense that its execution does not depend on the actual contents of a, b and condition. Of course the locality of a and b in memory will affect the execution time of the loop, but not the CONTENTS which are secret.
That is assuming of course condition was computed in a constant time manner.
As for C optimizations: the compiler can only optimize code based on information it knows. If "condition" is truly secret the compiler should not be able to discern it contents and optimize. If it can be deducted from your code then the compiler will most likely make optimization for the 0 case.

Is there any performance difference in using int versus int8_t

My main question is, Is there any difference between int and int8_t for execution time ?
In a framework I am working on, I often read code where some paramteres are set as int8_t in function because "that particular parameter cannot be outside the -126,125 range".
In many places, int8_t is used for communication protocol, or to cut a packet into many fields into a __attribute((packed)) struct.
But at some point, it was mainly put there because someone thought it would be better to use a type that match more closely the size of the data, probably think ahead of the compiler.
Given that the code is made to run on Linux, compiled with gcc using glibc, and that memory or portability is not an issue, I am wondering if it is actually a good idea, performance-wise.
My first impression comes from the rule "Trying to be smarter than the compiler is always a bad idea" (unless you know where and how you need to optimize).
However, I do not know if using int8_t is actually a cost for performance (more testing and computation to match the int8_t size, more operations are needed to ensure the variable do not go out of bounds, etc.), or if it does improve performance in some way.
I am not good at reading simple asm, so I did not compile a test code into asm to try to know which one is better.
I tried to find a related question, but all discussion I found on int<size>_t versus int is about portability rather than performance.
Thanks for your input. Assembly samples explained or sources about this issue would be greatly appreciated.
int is generally equivalent of the size of register on CPU. C standard says that any smaller types must be converted to int before using operators on them.
These conversions (sign extension) can be costly.
int8_t a=1, b=2, c=3;
...
a = b + c; // This will translate to: a = (int8_t)((int)b + (int)c);
If you need speed, int is a safe bet, or use int_fast8_t (even safer). If exact size is important, use int8_t (if available).
when you talk about code performance, there are several things you need to take into account which affect this:
CPU architecture, more to the point, which data types does the cpu support natively ( does it support 8 bit operations? 16 bit? 32 bit? etc...)
compiler, working with a well known compiler is not enough, you need to be familiar with it: they way you write your code influences the code it generates
data types and compiler intrinsics: these are always considered by the compiler when generating code, using the correct data type (even signed vs unsigned matters) can have a dramatic performance impact.
"Trying to be smarter than the compiler is always a bad idea" - that is not actually true; remember, the compiler is written to optimize the general case and you are interested in you particular case; it's always a good idea to try and be smarter than the compiler.
Your question is really too broad for me to give a "to the point" answer (i.e. what is better performance wise). The only way to know for sure is to check the generated assembly code; at least count the number of cycles the code would take to execute in both cases. But you need to understand the code to understand how to help the compiler.

How bad is it to abandon THE rule in C (aka: return 0 on success)?

in a current project I dared to do away with the old 0 rule, i.e. returning 0 on success of a function. How is this seen in the community? The logic that I am imposing on the code (and therefore on the co-workers and all subsequent maintenance programmers) is:
.>0: for any kind of success/fulfillment, that is, a positive outcome
==0: for signalling no progress or busy or unfinished, which is zero information about the outcome
<0: for any kind of error/infeasibility, that is, a negative outcome
Sitting in between a lot of hardware units with unpredictable response times in a realtime system, many of the functions need to convey exactly this ternary logic so I decided it being legitimate to throw the minimalistic standard return logic away, at the cost of a few WTF's on the programmers side.
Opininons?
PS: on a side note, the Roman empire collapsed because the Romans with their number system lacking the 0, never knew when their C functions succeeded!
"Your program should follow an existing convention if an existing convention makes sense for it."
Source: The GNU C Library
By deviating from such a widely known convention, you are creating a high level of technical debt. Every single programmer that works on the code will have to ask the same questions, every consumer of a function will need to be aware of the deviation from the standard.
http://en.wikipedia.org/wiki/Exit_status
I think you're overstating the status of this mythical "rule". Much more often, it's that a function returns a nonnegative value on success indicating a result of some sort (number of bytes written/read/converted, current position, size, next character value, etc.), and that negative values, which otherwise would make no sense for the interface, are reserved for signalling error conditions. On the other hand, some functions need to return unsigned results, but zero never makes sense as a valid result, and then zero is used to signal errors.
In short, do whatever makes sense in the application or library you are developing, but aim for consistency. And I mean consistency with external code too, not just your own code. If you're using third-party or library code that follows a particular convention and your code is designed to be closely coupled to that third-party code, it might make sense to follow that code's conventions so that other programmers working on the project don't get unwanted surprises.
And finally, as others have said, whatever your convention, document it!
It is fine as long as you document it well.
I think it ultimately depends on the customers of your code.
In my last system we used more or less the same coding system as yours, with "0" meaning "I did nothing at all" (e.g. calling Init() twice on an object). This worked perfectly well and everybody who worked on that system knew this was the convention.
However, if you are writing an API that can be sold to external customers, or writing a module that will be plugged into an existing, "standard-RC" system, I would advise you to stick to the 0-on-success rule, in order to avoid future confusion and possible pitfalls for other developers.
And as per your PS, when in Rome, do like the romans do :-)
I think you should follow the Principle Of Least Astonishment
The POLA states that, when two
elements of an interface conflict, or
are ambiguous, the behaviour should be
that which will least surprise the
user; in particular a programmer
should try to think of the behavior
that will least surprise someone who
uses the program, rather than that
behavior that is natural from knowing
the inner workings of the program.
If your code is for internal consumption only, you may get away with it, though. So it really depends on the people your code will impact :)
There is nothing wrong with doing it that way, assuming you document it in a way that ensures others know what you're doing.
However, as an alternative, if might be worth exploring the option to return an enumerated type defining the codes. Something like:
enum returnCode {
SUCCESS, FAILURE, NO_CHANGE
}
That way, it's much more obvious what your code is doing, self-documenting even. But might not be an option, depending on your code base.
It is a convention only. I have worked with many api that abandon the principle when they want to convey more information to the caller. As long as your consistent with this approach any experienced programmer will quickly pick up the standard. What is hard is when each function uses a different approach IE with win32 api.
In my opinion (and that's the opinion of someone who tends to do out-of-band error messaging thanks to working in Java), I'd say it is acceptable if your functions are of a kind that require strict return-value processing anyway.
So if the return value of your method has to be inspected at all points where it's called, then such a non-standard solution might be acceptable.
If, however, the return value might be ignored or just checked for success at some points, then the non-standard solution produces quite some problem (for example you can no longer use the if(!myFunction()) ohNoesError(); idiom.
What is your problem? It is just a convention, not a law. If your logic makes more sense for your application, then it is fine, as long as it is well documented and consistent.
On Unix, exit status is unsigned, so this approach won't work if you ever have to run your program there, and this will confuse all your Unix programmers to no end. (I looked it up just now to make sure, and discovered to my surprised that Windows uses a signed exit status.) So I guess it will probably only mostly confuse your Windows programmers. :-)
I'd find another method to pass status between processes. There are many to choose from, some quite simple. You say "at the cost of a few WTF's on the programmers side" as if that's a small cost, but it sounds like a huge cost to me. Re-using an int in C is a miniscule benefit to be gained from confusing other programmers.
You need to go on a case by case basis. Think about the API and what you need to return. If your function only needs to return success or failure, I'd say give it an explicit type of bool (C99 has a bool type now) and return true for success and false for failure. That way things like:
if (!doSomething())
{
// failure processing
}
read naturally.
In many cases, however, you want to return some data value, in which case some specific unused or unlikely to be used value must be used as the failure case. For example the Unix system call open() has to return a file descriptor. 0 is a valid file descriptor as is theoretically any positive number (up to the maximum a process is allowed), so -1 is chosen as the failure case.
In other cases, you need to return a pointer. NULL is an obvious choice for failure of pointer returning functions. This is because it is highly unlikely to be valid and on most systems can't even be dereferenced.
One of the most important considerations is whether the caller and the called function or program will be updated by the same person at any given time. If you are maintaining an API where a function will return the value to a caller written by someone who may not even have access to your source code, or when it is the return code from a program that will be called from a script, only violate conventions for very strong reasons.
You are talking about passing information across a boundary between different layers of abstraction. Violating the convention ties both the caller and the callee to a different protocol increasing the coupling between them. If the different convention is fundamental to what you are communicating, you can do it. If, on the other hand, it is exposing the internals of the callee to the caller, consider whether you can hide the information.

Resources