Manual optimization in the past (C Language) [closed] - c

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Back in the 70s when C just started, I guess compiler level optimization wasn't that advanced like modern compilers (clang, gcc, etc...) and the computers themselves were limited hardware-wise, was it common to prefer optimizations at the source code level over readability?
Example:
int arrayOfItems[30]; // Global variable
int GetItemAt(int index)
{
return globalArrayOfThings[index];
}
int main()
{
// Code
// ... arrayOfItems intialized somewhere
// More code
GetSomethingByItem(GetItemAt(4)); // Get at index 4
return 0;
}
Now this can be optimized to this:
int arrayOfItems[30]; // Global variable
int main()
{
// Code
// ... arrayOfItems intialized somewhere
// More code
GetSomethingByItem(arrayOfItems[4]); // Get at index 4
return 0;
}
Completely omitting the functionGetItemAt and thus saving time by accessing the value straight from it's address instead of entering a function, creating a stack frame, accessing the value and pushing the result to some register. Do people used to prefer to write the second, 'optimized' version straight into the source code or use the first version so the code would be more readable?
I know that in this example you can use a processor to "mimic" this optimization (e.g #define GetItemAt(x) arrayOfItems[x]), but you get my point.
Also, maybe this exact optimization feature was present from the start, if so, I should find another example, suggestions are welcome.
TL;DR -
Was it common in the past to prefer optimizations at the source code level over readability?
Bonus question:
Are there optimizations that are included only so the source code can be more readable?

I don't think many developers have ever prefered optimization over readability, but sometimes it might be argued that there were optimizations that harmed readability but were necessary for performance. Something like Duff's Device (a loop unrolling optimization)
From
do { /* count > 0 assumed */
*to = *from++; /* "to" pointer is NOT incremented, see explanation below */
} while(--count > 0);
to
register n = (count + 7) / 8;
switch(count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while(--n > 0);
}
Of course, it turns out that compilers got smarter and it has been reported on the LKML that removing Duff's Device improved performance and reduced memory usage. From the linked wikipedia,
For the purpose of memory-to-memory copies (which was not the original use of Duff's device, although it can be modified to serve this purpose as described in section below), the standard C library provides function memcpy; it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that will make it significantly faster
and from the LKML (in 2000)
... this effect in the X server.
It turns out that with branch predictions and the relative speed of CPU
vs. memory changing over the past decade, loop unrolling is pretty much
pointless. In fact, by eliminating all instances of Duff's Device from
the XFree86 4.0 server, the server shrunk in size by half a megabyte, and was faster to boot, because the elimination of all
that excess code meant that the X server wasn't thrashing the cache
lines as much.
As for optimizations that only improve readability, it would require that your code first be unreadable. Then anything that makes it more readable would seem to qualify. Finally, remember that premature optimization is the root of all evil.

Related

Assembly code's length can indicate execution speed?

I'm learning C, consider the following code snippet:
#include <stdio.h>
int main(void) {
int fahr;
float calc;
for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
calc = (5.0 / 9.0) * (fahr - 32);
printf("%3d %6.1f\n", fahr, calc);
}
return 0;
}
Which is printing Celsius to Fahrenheit conversion table from 300 to 0. I compile this with:
$ clang -std=c11 -Wall -g -O3 -march=native main.c -o main
I also generate assembly code with this command:
$ clang -std=c11 -Wall -S -masm=intel -O3 -march=native main.c -o main
Which is generating 1.26kb file and 71 lines.
I slightly edited the code and moved the logic into another function which is initalized at main():
#include <stdio.h>
void foo(void) {
int fahr;
float calc;
for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
calc = (5.0 / 9.0) * (fahr - 32);
printf("%3d %6.1f\n", fahr, calc);
}
}
int main(void) {
foo();
return 0;
}
This will generate 2.33kb assembly code with 128 lines.
Running both programs with time ./main I see no difference in execution speed.
My question is, does it matter anything trying to optimize your C programs by assembly code's length?
It seems that you are comparing the sizes of the .S files generated by GCC, since that obviously make no sense, I'm just pretending you were confronting the binary size of two, GCC generated, code snippets.
While, having all other conditions the same, a shorter code size may gives an increase in speed (due to an higher code density), in general x86 CPUs are complex enough to require a decoupling between optimizations for code size and optimizations for code speed.
Specifically if you aim at code speed you should optimize for... code speed. Sometime this require choosing the shortest snippet, sometime it doesn't.
Consider the classic example of compiler optimization, multiplication by powers of two:
int i = 4;
i = i * 8;
This may be badly translated as:
;NO optimizations at all
mov eax, 4 ;i = 4 B804000000 0-1 clocks
imul eax, 8 ;i = i * 8 6BC009 3 clocks
;eax = i 8 bytes total 3-4 clocks total
;Slightly optimized
;4*8 gives no sign issue, we can use shl
mov eax, 4 ;i = 4 B804000000 0-1 clocks
shl eax, 3 ;i = i * 8 C1E003 1 clock
;eax = i 8 bytes total 1-2 clocks total
Both snippets have the same code length but the second performs nearly as twice as faster.
This is a very basic example1, where there is not even much need to take the micro-architecture into account.
Another more subtle example is the following, taken from Agner Fog discussion of Partial register stalls2:
;Version A Version B
mov al, byte ptr [mem8] movzx ebx, byte ptr [mem8]
mov ebx, eax and eax, 0ffffff00h
or ebx, eax
;7 bytes 14 bytes
Both versions give the same result but Version B is 5-6 clocks faster than Version A despite the former being twice the size of the latter.
The answer is then no, code size is not enough; it may be a tie-breaker though.
If you really are interested into optimizing assembly you will enjoy these two readings:
Agner Fog's classics.
Intel optimization manual
The first link also have a manual to optimize C and C++ code.
If you write in C remember that the most impacting optimizations are 1) How data is represented/stored, i.e. Data structures 2) How data is processed, i.e. Algorithms.
There are the macro optimizations.
Taking into account the generated assembly is shifting into micro optimization and there the most useful tools are 1) A smart compiler 2) A good set of intrinsics3.
1 So simple to be optimized out in practice.
2 Maybe a little obsolete now but it serves the purpose.
3 Built-in, non standard, functions that translate into specific assembly instructions.
As always, the answer is "it depends". Sometimes making code longer makes it more efficient: for example, the CPU doesn't have to waste extra instructions jumping after every loop. A classic example (literally 'classic': 1983!) is "Duff's Device". The following code
register short *to, *from;
register count;
{
do { /* count > 0 assumed */
*to = *from++;
} while(--count > 0);
}
was made much faster by using this much larger, and more complicated, code:
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}
}
But that can be taken to extremes: making code too large increases cache misses and all sorts of other issues. In short: "premature optimisation is evil" - you need to test your before-and-after, and often on multiple platforms, before deciding that it is a good idea.
And I'll ask you: is the second version of the above code "better" than the first version? It's less readable, less maintainable, and far more complex than the code it replaces.
The code that actually runs is the same in both cases, after inlining. The 2nd way is bigger because it also has to emit a stand-alone definition of the function, not inlined into main.
You would have avoided this if you'd used static on the function, so the compiler would know that nothing could call it from outside the compilation unit, and thus a stand-alone definition wasn't needed if it was inlined into its only caller.
Also, most of the .s lines in compiler output are comments or assembler directives, not instructions. So you're not even counting instructions.
The Godbolt compiler explorer is a good way to look at compiler asm output, with just the instructions and actually-used labels. Have a look at your code there.
Counting the total number of instructions in the executable is totally bogus if there are loops or branches. Or especially function calls inside loops, like in this case. Dynamic instruction count (how many instructions actually ran, i.e. counting each time through loops and so on) is very roughly correlated with performance, but some code runs at 4 instructions per cycle, while some runs at well below 1 (e.g. lots of div or sqrt, cache misses, and/or branch mispredicts).
To learn more about what makes code run slow or fast, see the x86 tag wiki, especially Agner Fog's stuff.
I also recently wrote an answer to Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Thinking of diabolical ways to make a program run slower is a fun exercise.

For loop is not incrementing

I have been working on this code class today and assure you I have gone through it a number of times. For some reason whenever I set my breakpoints to determine the value of "channelsel" all I get is "0". I never get 1,2,3 or 4 (my MAXCHANNELS is 5).
I'm using: P18F45K22 microcontroller, and mplab c18.
Please take a look at the following code, and thank you in advance
int channelsel = 0;
for (channelsel = 0; channelsel < MAXCHANNELS; channelsel++)
{
switch(channelsel)
{
case 0:
SetChanADC(ADC_CH0);
break;
case 1:
SetChanADC(ADC_CH1);
break;
case 2:
SetChanADC(ADC_CH2);
break;
case 3:
SetChanADC(ADC_CH3);
break;
case 4:
SetChanADC(ADC_CH4);
break;
default:
SetChanADC(ADC_CH0);
break;
}
ConvertADC();
while(BusyADC() == TRUE) Delay1TCY();
sampledValue = ReadADC();
setCurrentTemperatureForChannel(channelsel, sampledValue);
sprintf (buf, "current Temp of channel %i is %x \n\r", channelsel, sampledValue);
puts1USART(buf);
Delay10KTCYx(10);
}
Declare channelsel as volatile
volatile int channelsel
It is quite likely that your compiler is optimizing away the rest of the statements so that they are not even in the disassembly. When dealing with values that update extremely fast or conditional statements who are in close proximity to the the control values declaration and assignment, volatile tells the compiler that we always want a fresh value for this variable and to take any shortcut. Variables that depend on IO should always be declared volatile and cases like this are good candidates for its use. Compilers are all different and your mileage may vary.
If you are sure that your hardware is configured correctly, this would be my suggestion. If in doubt, please post your disassembled code for this segment.
I have been working on PIC18, its a bug that my co-worker discovered, for loops don't work with the c18 compiler, if you change it to a while loop it will work fine.

Performance of array of functions over if and switch statements

I am writing a very performance critical part of the code and I had this crazy idea about substituting case statements (or if statements) with array of function pointers.
Let me demonstrate; here goes the normal version:
while(statement)
{
/* 'option' changes on every iteration */
switch(option)
{
case 0: /* simple task */ break;
case 1: /* simple task */ break;
case 2: /* simple task */ break;
case 3: /* simple task */ break;
}
}
And here is the "callback function" version:
void task0(void) {
/* simple task */
}
void task1(void) {
/* simple task */
}
void task2(void) {
/* simple task */
}
void task3(void) {
/* simple task */
}
void (*task[4]) (void);
task[0] = task0;
task[1] = task1;
task[2] = task2;
task[3] = task3;
while(statement)
{
/* 'option' changes on every iteration */
/* and now we call the function with 'case' number */
(*task[option]) ();
}
So which version will be faster? Is the overhead of the function call eliminating speed benefit over normal switch (or if) statement?
Ofcourse the latter version is not so readable but I am looking for all the speed I can get.
I am about to benchmark this when I get things set up but if someone has an answer already, I wont bother.
I think at the end of the day your switch statements will be the fastest, because function pointers have the "overhead" of the lookup of the function and the function call itself. A switch is just a jmp table straight. It of course depends on different things which only testing can give you an answer to. That's my two cent worth.
The switch statement should be compiled into a branch table, which is essentially the same thing as your array of functions, if your compiler has at least basic optimization capability.
Which version will be faster depends. The naive implementation of switch is a huge if ... else if ... else if ... construction meaning it takes on average O(n) time to execute where n is the number of cases. Your jump table is O(1) so the more different cases there are and the more the later cases are used, the more likely the jump table is to be better. For a small number of cases or for switches where the first case is chosen more frequently than others, the naive implementation is better. The matter is complicated by the fact that the compiler may choose to use a jump table even when you have written a switch if it thinks that will be faster.
The only way to know which you should choose is to performance test your code.
First, I would randomly-pause it a few times, to make certain enough time is spent in this dispatching to even bother optimizing it.
Second, if it is, since each branch spends very few cycles, you want a jump table to get to the desired branch. The reason switch statements exist is to suggest to the compiler that it can generate one if the switch values are compact.
How long is the list of switch values? If it's short, the if-ladder could still be faster, especially if you put the most frequently used codes at the top. An alternative to an if-ladder (that I've never actually seen anyone use) is an if-tree, the code equivalent of a binary tree.
You probably don't want an array of function pointers. Yes, it's an array reference to get the function pointer, but there's several instructions' overhead in calling a function, and it sounds like that could overwhelm the small amount being done inside each function.
In any case, looking at the assembly language, or single-stepping at the instruction level, will give you a good idea how efficient it's being.
A good compiler will compile a switch with cases in a small numerical range as a single conditional to see if the value is in that range (which can sometimes be optimized out) followed by a jumptable jump. This will almost surely be faster than a function call (direct or indirect) because:
A jump is a lot less expensive than a call (which must save call-clobbered registers, adjust the stack, etc.).
The code in the switch statement cases can make use of expression values already cached in registers in the caller.
It's possible that an extremely advanced compiler could determine that the call-via-function pointer only refers to one of a small set of static-linkage functions, and thereby optimize things heavily, maybe even eliminating the calls and replacing them by jumps. But I wouldn't count on it.
I arrived at this post recently since I was wondering the same. I ended up taking the time to try it. It certainly depends greatly on what you're doing, but for my VM it was a decent speed up (15-25%), and allowed me to simplify some code (which is probably where a lot of the speedup came from). As an example (code simplified for clarity), a "for" loop was able to be easily implemented using a for loop:
void OpFor( Frame* frame, Instruction* &code )
{
i32 start = GET_OP_A(code);
i32 stop_value = GET_OP_B(code);
i32 step = GET_OP_C(code);
// instruction count (ie. block size)
u32 i_count = GET_OP_D(code);
// pointer to end of block (NOP if it branches)
Instruction* end = code + i_count;
if( step > 0 )
{
for( u32 i = start; i < stop_value; i += step )
{
// rewind instruction pointer
Instruction* cur = code;
// execute code inside for loop
while(cur != end)
{
cur->func(frame, cur);
++cur;
}
}
}
else
// same with <=
}

Switch case optimization scenario

I am aware of various switch case opimization techniques, but as per my understanding most of the modern compilers do not care about how you write switch cases, they optimize them anyway.
Here is the issue:
void func( int num)
set = 1,2,3,4,6,7,8,10,11,15
{
if (num is not from set )
regular_action();
else
unusual_stuff();
}
The set would always have values mentioned above or something resembling with many of the elements closely spaced.
E.g.
set = 0,2,3,6,7,8,11,15,27 is another possible value.
The passed no is not from this set most of the times during my program run, but when it is from the set I need to take some actions.
I have tried to simulate the above behavior with following functions just to figure out which way the switch statement should be written. Below functions do not do anything except the switch case - jump tables - comparisons.
I need to determine whether compare_1 is faster or compare_2 is faster. On my dual core machine, compare_2 always looks faster but I am unable to figure out why does this happen? Is the compiler so smart that it optimizes in such cases too?
There is no way of feeling that one function is faster than the other. Do measurements (without the printf) and also compare the assembler that is produced (use the option -S to the compiler).
Here are some suggestions for optimizing a switch statement:
Remove the switch statement
Redesign your code so that a switch statement is not necessary. For example, implementing virtual base methods in a base class. Or using an array.
Filter out common choices. If there are many choices in a range, reduce the choices to the first item in the range (although the compiler may do this automagically for you.)
Keep choices contiguous
This is very easy for the compiler to implement as a single indexed jump table.
Many choices, not contiguous
One method is to implement an associated array (key, function pointer). The code may search the table or for larger tables, they could be implemented as a linked list. Other options are possible.
Few choices, not contiguous
Often implemented by compilers as an if-elseif ladder.
Profiling
The real proof is in setting compiler optimization switches and profiling.
The Assembly Listing
You may want to code up some switch statements and see how the compiler generates the assembly code. See which version generates the optimal assembly code for your situation.
If your set really consists of numbers in the range 0 to 63, use:
#define SET 0x.......ULL
if (num < 64U && (1ULL<<num & SET)) foo();
else bar();
Looking at your comparison functions, the second one is always faster because it is optimized to always execute the default statement. The default statement is execute "in order" as it appears in the switch, so in the second function it is immediately executed. It is very efficiently giving you the same answer for every switch!
Default case must always appear as the last case in a switch. See http://www.tutorialspoint.com/cplusplus/cpp_switch_statement.htm
for example, where it states "A switch statement can have an optional default case, which must appear at the end of the switch. The default case can be used for performing a task when none of the cases is true. No break is needed in the default case."
Here are the functions mentioned above
#define MAX 100000000
void compare_1(void)
{
unsigned long i;
unsigned long j;
printf("%s\n", __FUNCTION__);
for(i=0;i<MAX;i++)
{
j = rand()%100;
switch(j)
{
case 1:
case 2:
case 3:
case 4:
case 6:
case 7:
case 8:
case 10:
case 11:
case 15:
break ;
default:
break ;
}
}
}
void unreg(void)
{
int i;
int j;
printf("%s\n", __FUNCTION__);
for(i=0;i<MAX;i++)
{
j = rand()%100;
switch(j)
{
default:
break ;
case 1:
case 2:
case 3:
case 4:
case 6:
case 7:
case 8:
case 10:
case 11:
case 15:
break ;
}
}
}

Why was the switch statement designed to need a break?

Given a simple switch statement
switch (int)
{
case 1 :
{
printf("1\n");
break;
}
case 2 :
{
printf("2\n");
}
case 3 :
{
printf("3\n");
}
}
The absence of a break statement in case 2, implies that execution will continue inside the code for case 3.
This is not an accident; it was designed that way. Why was this decisions made? What benefit does this provide vs. having an automatic break semantic for the blocks? What was the rationale?
Many answers seem to focus on the ability to fall through as the reason for requiring the break statement.
I believe it was simply a mistake, due largely because when C was designed there was not nearly as much experience with how these constructs would be used.
Peter Van der Linden makes the case in his book "Expert C Programming":
We analyzed the Sun C compiler sources
to see how often the default fall
through was used. The Sun ANSI C
compiler front end has 244 switch
statements, each of which has an
average of seven cases. Fall through
occurs in just 3% of all these cases.
In other words, the normal switch
behavior is wrong 97% of the time.
It's not just in a compiler - on the
contrary, where fall through was used
in this analysis it was often for
situations that occur more frequently
in a compiler than in other software,
for instance, when compiling operators
that can have either one or two
operands:
switch (operator->num_of_operands) {
case 2: process_operand( operator->operand_2);
/* FALLTHRU */
case 1: process_operand( operator->operand_1);
break;
}
Case fall through is so widely
recognized as a defect that there's
even a special comment convention,
shown above, that tells lint "this is
really one of those 3% of cases where
fall through was desired."
I think it was a good idea for C# to require an explicit jump statement at the end of each case block (while still allowing multiple case labels to be stacked - as long as there's only a single block of statements). In C# you can still have one case fall through to another - you just have to make the fall thru explicit by jumping to the next case using a goto.
It's too bad Java didn't take the opportunity to break from the C semantics.
In a lot of ways c is just a clean interface to standard assembly idioms. When writing jump table driven flow control, the programmer has the choice of falling through or jumping out of the "control structure", and a jump out requires an explicit instruction.
So, c does the same thing...
To implement Duff's device, obviously:
dsend(to, from, count)
char *to, *from;
int count;
{
int n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}
}
If cases were designed to break implicitly then you couldn't have fallthrough.
case 0:
case 1:
case 2:
// all do the same thing.
break;
case 3:
case 4:
// do something different.
break;
default:
// something else entirely.
If the switch was designed to break out implicitly after every case you wouldn't have a choice about it. The switch-case structure was designed the way it is to be more flexible.
The case statements in a switch statements are simply labels.
When you switch on a value, the switch statement essentially does a goto to the label with the matching value.
This means that the break is necessary to avoid passing through to the code under the next label.
As for the reason why it was implemented this way - the fall-through nature of a switch statement can be useful in some scenarios. For example:
case optionA:
// optionA needs to do its own thing, and also B's thing.
// Fall-through to optionB afterwards.
// Its behaviour is a superset of B's.
case optionB:
// optionB needs to do its own thing
// Its behaviour is a subset of A's.
break;
case optionC:
// optionC is quite independent so it does its own thing.
break;
To allow things like:
switch(foo) {
case 1:
/* stuff for case 1 only */
if (0) {
case 2:
/* stuff for case 2 only */
}
/* stuff for cases 1 and 2 */
case 3:
/* stuff for cases 1, 2, and 3 */
}
Think of the case keyword as a goto label and it comes a lot more naturally.
It eliminates code duplication when several cases need to execute the same code (or the same code in sequence).
Since on the assembly language level it doesn't care whether you break between each one or not there is zero overhead for fall through cases anyways, so why not allow them since they offer significant advantages in certain cases.
I happened to run in to a case of assigning values in vectors to structs: it had to be done in such a manner that if the data vector was shorter than the number of data members in the struct, the rest of the members would remain in their default value. In that case omitting break was quite useful.
switch (nShorts)
{
case 4: frame.leadV1 = shortArray[3];
case 3: frame.leadIII = shortArray[2];
case 2: frame.leadII = shortArray[1];
case 1: frame.leadI = shortArray[0]; break;
default: TS_ASSERT(false);
}
As many here have specified, it's to allow a single block of code to work for multiple cases. This should be a more common occurrence for your switch statements than the "block of code per case" you specify in your example.
If you have a block of code per case without fall-through, perhaps you should consider using an if-elseif-else block, as that would seem more appropriate.

Resources