Why does dereferencing makes my program faster?

Why does dereferencing makes my program faster? - c

Considering the following test programs :
Loop value on the stack
int main( void ) {
int iterations = 1000000000;
while ( iterations > 0 )
-- iterations;
}
Loop value on the stack (dereferenced)
int main( void ) {
int iterations = 1000000000;
int * p = & iterations;
while ( * p > 0 )
-- * p;
}
Loop value on the heap
#include <stdlib.h>
int main( void ) {
int * p = malloc( sizeof( int ) );
* p = 1000000000;
while ( *p > 0 )
-- * p;
}
By compiling them with -O0, I get the following execution times :
case1.c
real 0m2.698s
user 0m2.690s
sys 0m0.003s
case2.c
real 0m2.574s
user 0m2.567s
sys 0m0.000s
case3.c
real 0m2.566s
user 0m2.560s
sys 0m0.000s
[edit] Following is the average on 10 executions :
case1.c
2.70364
case2.c
2.57091
case3.c
2.57000
Why is the execution time bigger with the first test case, which seems to be the simplest ?
My current architecture is a x86 virtual machine (Archlinux). I get these results both with gcc (4.8.0) and clang (3.3).
[edit 1] Generated assembler codes are almost identical except that the second and third ones have more instructions than the first one.
[edit 2] These performances are reproducible (on my system). Each execution will have the same order of magnitude.
[edit 3] I don't really care about performances of a non-optimized program, but I don't understand why it would be slower, and I'm curious.

It's hard to say if this is the reason since I'm doing some guessing and you haven't given some specifics (like which target you're using). But what I see when I compile without optimziations with an x86 target is the following sequences for decrementign the iterations variable:
Case 1:
L3:
sub DWORD PTR [esp+12], 1
L2:
cmp DWORD PTR [esp+12], 0
jg L3
Case 2:
L3:
mov eax, DWORD PTR [esp+12]
mov eax, DWORD PTR [eax]
lea edx, [eax-1]
mov eax, DWORD PTR [esp+12]
mov DWORD PTR [eax], edx
L2:
mov eax, DWORD PTR [esp+12]
mov eax, DWORD PTR [eax]
test eax, eax
jg L3
One big difference that you see in case 1 is that the instruction at L3 reads and writes the memory location. It is followed immediately byu an instruction that reads the same memory location that was just written. This sort of sequence of instructions (the same memory location written then immediate used in the next instruction) often causes some sort of pipeline stall in modern CPUs.
You'll note that the write followed immediately by a read of the same location is not present in case 2.
Again - this answer is a bit of informed speculation.

Related

Search for all occurrences of a substring within a string

This program must search for all occurrences of string 2 in string 1.
It works fine with all the strings i have tried except with
s1="Ciao Cia Cio Ociao ciao Ocio CiCiao CieCiaCiu CiAo eeCCia"
s2="Cia"
in this case the correct result would be: 0 5 31 39 54
instead, it prints 0 5 39.
I don't understand why, the operation seems the same as
s1="Sette scettici sceicchi sciocchi con la sciatica a Shanghai"
s2="icchi"
with which the program works correctly.
I can't find the error!
The code:
#include <stdio.h>
void main()
{
#define MAX_LEN 100
// Input
char s1[] = "Ciao Cia Cio Ociao ciao Ocio CiCiao CieCiaCiu CiAo eeCCia";
unsigned int lengthS1 = sizeof(s1) - 1;
char s2[] = "Cia";
unsigned int lengthS2 = sizeof(s2) - 1;
// Output
unsigned int positions[MAX_LEN];
unsigned int positionsLen;
// Blocco assembler
__asm
{
MOV ECX, 0
MOV EAX, 0
DEC lenghtS1
DEC lengthS2
MOV EBX, lengthS1
CMP EBX, 0
JZ fine
MOV positionsLen, 0
XOR EBX, EBX
XOR EDX, EDX
uno: CMP ECX, lengthS1
JG fine
CMP EAX, lengthS2
JNG restart
XOR EAX, EAX
restart : MOV BH, s1[ECX]
CMP BH, s2[EAX]
JE due
JNE tre
due : XOR EBX, EBX
CMP EAX, 0
JNE duedue
MOV positions[EDX * 4], ECX
INC ECX
INC EAX
JMP uno
duedue : CMP EAX, lengthS2
JNE duetre
INC ECX
INC EDX
INC positionsLen
XOR EAX, EAX
JMP uno
duetre : INC EAX
INC ECX
JMP uno
tre : XOR EBX, EBX
XOR EAX, EAX
INC ECX
JMP uno
fine:
}
// Stampa su video
{
unsigned int i;
for (i = 0; i < positionsLen; i++)
printf("Sottostringa in posizione=%d\n", positions[i]);
}
}
please,help.

The trickier programming gets, the more systematic and thoughtful your approach should be. If you programmed x86 assembly for a decade, you will be able to skip a few of the steps I line out below. But especially if you are a beginner, you are well advised to not expect from yourself, that you can just hack in assembly with confidence and without safety nets.
The code below is just a best guess (I did not compile or run or debug the C-code). It is there, to give the idea.
Make a plan for your implementation
So you will have 2 nested loops, comparing the characters and then collecting matches.
Implement the "assembly" in low level C, which already resembles the end product.
C is nearly an assembly language itself...
Write yourself tests, debug and analyze your "pseudo assembly" C-version.
Translate the C lines step by step by assembly lines, "promoting" the c-lines to comments.
This is my first shot at doing that - the initial c-version, which might or might not work. But it is still faster and easier to write (with the assembly code in mind). And easier to debug and step through. Once this works, it is time to "translate".
#include <stdint.h>
#include <stddef.h>
#include <string.h>
size_t substring_positions(const char *s, const char* sub_string, size_t* positions, size_t positions_capacity) {
size_t positions_index = 0;
size_t i = 0;
size_t j = 0;
size_t i_max = strlen(s) - strlen(sub_string);
size_t j_max = strlen(sub_string) - 1;
loop0:
if (i > i_max)
goto end;
j = 0;
loop1:
if (j == j_max)
goto match;
if (s[i+j] == sub_string[j])
goto go_on;
i++;
goto loop0;
go_on:
j++;
goto loop1;
match:
positions[positions_index] = i;
positions_index++;
if (positions_index < positions_capacity)
goto loop0;
goto end;
end:
return positions_index;
}
As you can see, I did not use "higher level language features" for this function (does C even have such things?! :)). And now, you can start to "assemble". If RAX is supposed to hold your i variable, you could replace size_t i = 0; with XOR RAX,RAX. And so on.
With that approach, other people even have a chance to read the assembly code and with the comments (the former c-code), you state the intent of your instructions.

Value of eax register in the assembly program

I am completing an assignment related to c programming and assembly language. Here is the simple c program :
int multiply(int a, int b) {
int k = 4;
int c,d, e;
c = a*b ;
d = a*b + k*c;
return d;
}
And it's optimised assembly is
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
_multiply PROC
mov eax, DWORD PTR _a$[esp-4]
imul eax, DWORD PTR _b$[esp-4]
lea eax, DWORD PTR [eax+eax*4]
ret 0
_multiply ENDP
I want to know the value of eax register after this line of code in assembly
lea eax, DWORD PTR [eax+eax*4]
I know when add integers in assembly, it stores result in the destination. and when we multiply it stores in eax. so if I call the function multiply( 3 , 8 ), the value of eax register after that line should be 120. Am I correct?

lea is "load effective address".
Instruction sets can have some quite complex multi-register address calculation modes that are generally used just for reading and writing data to memory, but lea allows the programmer to get the address that would be accessed by the instruction.
Effectively, it performs the calculation inside the bracket, returns that value - it doesn't access the memory (which is what bracket usually implies).
In this case it is being used as a quick way to multiply by 5, because the rest of the function has been optimised away!

Is any performance gained by using an if/else statement for an assignment?

I have simple question that I just wasn't sure about.
Consider the code below:
#include <stdio.h>
static void turnOn(int *power);
static void turnOff(int *power);
int main(void)
{
int powerIsOn = 0;
turnOn(&powerIsOn);
printf("Power Status: %d\n", powerIsOn);
turnOff(&powerIsOn);
printf("Power Status: %d\n", powerIsOn);
return 0;
}
static void turnOn(int *power)
{
if (!*power)
*power = 1;
// Or
//*power = 1;
return;
}
static void turnOff(int *power)
{
if (*power)
*power = 0;
// Or
// *power = 0;
return;
}
I know that this wouldn't cause a noticeable difference in something this small. But In methods that do some sort of assignment, is it more efficient to check if a Boolean or whatever is already true/false before re-assigning it's value?
For example, the turnOn() function is set to only turn the power on if it is off. Would it be any slower or faster just to set it to 1 regardless of the value?
Thanks for your time.

In any case your code accessing the memory, it does inside of "if" and it does when you assign it to 1. In addition "if" statement adds few more lines to the binary code, so if you can just assign without using "if", it is better and more efficient.
if (!*power) #one memory access and addition if actions
*power = 1; #one more memory access and assignment
Looking at the compiled assembly code for
static void turnOn(int *power)
{
if (!*power)
*power = 1;
return;
}
We will see the next code
turnOn:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
test eax, eax
jne .L4
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 1
nop
.L4:
nop
pop rbp
ret
And for:
static void turnOn(int *power)
{
*power = 1;
return;
}
Next code:
turnOn:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 1
nop
pop rbp
ret
It seem to me that the machine will run more operations in the first case.
I was using the https://godbolt.org/ compiler.

Both operations involve accessing variables and therefore depend on memory management. Assignment involves re-writing to memory, and therefore is more "expensive" than comparing constant values (booleans, or 1 and 0).
With that being said, with modern hardware these differences are negligible and therefore considered micro-optimizations, which aren't recommended.

I think that can only be answered profiling the code in a specific scenario.
Reading a variable in C can be more efficient than assigning a value to it if the compiler generates code that reads the value directly (however it is depends on the implementation of a given compiler).
However to determine whether it executes faster or not it is circumstantial e.g. if the both functions are called sequentially a lot of times then the additional if statement you created is useless as the condition will always return true and will generate additional instructions (and by that the CPU will spend additional clocks executing them) to perform a check that is always returning true. I would say it really need to be profiled and is context-dependent.

Do optimization like this make sense?

You have two arrays and a function that counts differences between them:
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
As branching downgrades performance, it can be optimized to:
for( i = 0; i < len; ++i ) {
num_differences += !!(vector1[i] != vector2[i])
}
// !!(..) is to be sure that the result is boolean 0 or 1
so there is no if clause. But does it practically make sense? With GCC (and other compilers) being so smart, does it make sense to play with such optimizations?

The short answer is: "Trust your Compiler".
In general you're not going to see much benefit from optimisations like this unless you're working with really huge datasets. Even then you really need to benchmark the code to see if there is any improvement.

Unless len is several millions large or you're comparing a lot of arrays, then no. The second version is less readable (not so much to an experienced programmer), so I'd prefer the first variant, unless this is the bottleneck (doubtful).
The following codes are generated, with optimizations:
for( i = 0; i < 4; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
00401000 mov ecx,dword ptr [vector1 (40301Ch)]
00401006 xor eax,eax
00401008 cmp ecx,dword ptr [vector2 (40302Ch)]
0040100E je wmain+15h (401015h)
00401010 mov eax,1
00401015 mov edx,dword ptr [vector1+4 (403020h)]
0040101B cmp edx,dword ptr [vector2+4 (403030h)]
00401021 je wmain+26h (401026h)
00401023 add eax,1
00401026 mov ecx,dword ptr [vector1+8 (403024h)]
0040102C cmp ecx,dword ptr [vector2+8 (403034h)]
00401032 je wmain+37h (401037h)
00401034 add eax,1
00401037 mov edx,dword ptr [vector1+0Ch (403028h)]
0040103D cmp edx,dword ptr [vector2+0Ch (403038h)]
00401043 je wmain+48h (401048h)
00401045 add eax,1
}
for( i = 0; i < 4; ++i ) {
num_differences += !!(vector1[i] != vector2[i]);
00401064 mov edx,dword ptr [vector1+0Ch (403028h)]
0040106A xor eax,eax
0040106C cmp edx,dword ptr [vector2+0Ch (403038h)]
00401072 mov edx,dword ptr [vector1+8 (403024h)]
00401078 setne al
0040107B xor ecx,ecx
0040107D cmp edx,dword ptr [vector2+8 (403034h)]
00401083 mov edx,dword ptr [vector1+4 (403020h)]
00401089 setne cl
0040108C add eax,ecx
0040108E xor ecx,ecx
00401090 cmp edx,dword ptr [vector2+4 (403030h)]
00401096 mov edx,dword ptr [vector1 (40301Ch)]
0040109C setne cl
0040109F add eax,ecx
004010A1 xor ecx,ecx
004010A3 cmp edx,dword ptr [vector2 (40302Ch)]
004010A9 setne cl
004010AC add eax,ecx
}
So, actually, the second version is slightly slower (theoretically). 19 instructions for the second vs. 17 for the first.

You should compare the code the compiler generates. It may be equivalent.
The compiler's very smart, but a good engineer can certainly improve a program's performance.

I dont think you are going to do much better, your second example is hard to read/understand for the average programmer which means two things one hard to understand and maintain, two you may be creeping into dark, less tested/supported, corners of the compiler. Drive down the road between the lines, dont wander about on the shoulder or in the wrong lane.
Go with this
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
or this
for( i = 0; i < len; ++i ) {
if( vector1[i] != vector2[i] ) ++num_differences;
}
if it really is bothering you and you have properly concluded this is your performance bottleneck then time the difference between them. From the disassembly shown, and the nature of this platform, it is very difficult to properly time such things and draw the right conclusions. Too many caches, and other factors that cloud over the results, leading to false conclusions, etc. and no two x86 implementations have the same performance so if you happen to tune for your computer you are likely detuning it for another model of x86 or even the same make on a different motherboard with different I/O characteristics.

Translate a FOR to assembler

I need to translate what is commented within the method, to assembler. I have a roughly idea, but can't.
Anyone can help me please? Is for an Intel x32 architecture:
int
secuencia ( int n, EXPRESION * * o )
{
int a, i;
//--- Translate from here ...
for ( i = 0; i < n; i++ ){
a = evaluarExpresion( *o );
o++;
}
return a ;
//--- ... until here.
}
Translated code must be within __asm as:
__asm {
translated code
}
Thank you,
FINAL UPDATE:
This is the final version, working and commented, thanks to all for your help :)
int
secuencia ( int n, EXPRESION * * o )
{
int a = 0, i;
__asm
{
mov dword ptr [i],0 ; int i = 0
jmp salto1
ciclo1:
mov eax,dword ptr [i]
add eax,1 ; increment in 1 the value of i
mov dword ptr [i],eax ; i++
salto1:
mov eax,dword ptr [i]
cmp eax,dword ptr [n] ; Compare i and n
jge final ; If is greater goes to 'final'
mov eax,dword ptr [o]
mov ecx,dword ptr [eax] ; Recover * o (its value)
push ecx ; Make push of * o (At the stack, its value)
call evaluarExpresion ; call evaluarExpresion( * o )
add esp,4 ; Recover memory from the stack (4KB corresponding to the * o pointer)
mov dword ptr [a],eax ; Save the result of evaluarExpresion as the value of a
mov eax,dword ptr [o] ; extract the pointer to o
add eax,4 ; increment the pointer by a factor of 4 (next of the actual pointed by *o)
mov dword ptr [o],eax ; o++
jmp ciclo1 ; repeat
final: ; for's final
mov eax,dword ptr [a] ; return a - it save the return value at the eax registry (by convention this is where the result must be stored)
}
}

Essentially in assembly languages, strictly speaking there isn't a notion of a loop the same way there would be in a higher level language. It's all implemented with jumps (eg. as a "goto"...)
That said, x86 has some instructions with the assumption that you'll be writing "loops", implicitly using the register ECX as a loop counter.
Some examples:
mov ecx, 5 ; ecx = 5
.label:
; Loop body code goes here
; ECX will start out as 5, then 4, then 3, then 1...
loop .label ; if (--ecx) goto .label;
Or:
jecxz .loop_end ; if (!ecx) goto .loop_end;
.loop_start:
; Loop body goes here
loop .loop_start ; if (--ecx) goto .loop_start;
.loop_end:
And, if you don't like this loop instruction thing counting backwards... You can write something like:
xor ecx, ecx ; ecx = 0
.loop_start:
cmp ecx, 5 ; do (ecx-5) discarding result, then set FLAGS
jz .loop_end ; if (ecx-5) was zero (eg. ecx == 5), jump to .loop_end
; Loop body goes here.
inc ecx ; ecx++
jmp .loop_start
.loop_end:
This would be closer to the typical for (int i=0; i<5; ++i) { }

Note that
for (init; cond; advance) {
...
}
is essentially syntactic sugar for
init;
while(cond) {
...
advance;
}
which should be easy enough to translate into assembly language if you've been paying any attention in class.

Use gcc to generate the assembly code
gcc -S -c sample.c
man gcc is your friend

For that you would probably use the loop instruction that decrements the ecx (often called, extended counter) at each loop and goes out when ecx reaches zero.But why use inline asm for it anyway? I'm pretty sure something as simple as that will be optimized correctly by the compiler...
(We say x86 architecture, because it's based on 80x86 computers, but it's an "ok" mistake =p)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does dereferencing makes my program faster? - c

Related

Search for all occurrences of a substring within a string

Value of eax register in the assembly program

Is any performance gained by using an if/else statement for an assignment?

Do optimization like this make sense?

Translate a FOR to assembler

Categories

Resources