Do optimization like this make sense?

Do optimization like this make sense? - c

You have two arrays and a function that counts differences between them:
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
As branching downgrades performance, it can be optimized to:
for( i = 0; i < len; ++i ) {
num_differences += !!(vector1[i] != vector2[i])
}
// !!(..) is to be sure that the result is boolean 0 or 1
so there is no if clause. But does it practically make sense? With GCC (and other compilers) being so smart, does it make sense to play with such optimizations?

The short answer is: "Trust your Compiler".
In general you're not going to see much benefit from optimisations like this unless you're working with really huge datasets. Even then you really need to benchmark the code to see if there is any improvement.

Unless len is several millions large or you're comparing a lot of arrays, then no. The second version is less readable (not so much to an experienced programmer), so I'd prefer the first variant, unless this is the bottleneck (doubtful).
The following codes are generated, with optimizations:
for( i = 0; i < 4; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
00401000 mov ecx,dword ptr [vector1 (40301Ch)]
00401006 xor eax,eax
00401008 cmp ecx,dword ptr [vector2 (40302Ch)]
0040100E je wmain+15h (401015h)
00401010 mov eax,1
00401015 mov edx,dword ptr [vector1+4 (403020h)]
0040101B cmp edx,dword ptr [vector2+4 (403030h)]
00401021 je wmain+26h (401026h)
00401023 add eax,1
00401026 mov ecx,dword ptr [vector1+8 (403024h)]
0040102C cmp ecx,dword ptr [vector2+8 (403034h)]
00401032 je wmain+37h (401037h)
00401034 add eax,1
00401037 mov edx,dword ptr [vector1+0Ch (403028h)]
0040103D cmp edx,dword ptr [vector2+0Ch (403038h)]
00401043 je wmain+48h (401048h)
00401045 add eax,1
}
for( i = 0; i < 4; ++i ) {
num_differences += !!(vector1[i] != vector2[i]);
00401064 mov edx,dword ptr [vector1+0Ch (403028h)]
0040106A xor eax,eax
0040106C cmp edx,dword ptr [vector2+0Ch (403038h)]
00401072 mov edx,dword ptr [vector1+8 (403024h)]
00401078 setne al
0040107B xor ecx,ecx
0040107D cmp edx,dword ptr [vector2+8 (403034h)]
00401083 mov edx,dword ptr [vector1+4 (403020h)]
00401089 setne cl
0040108C add eax,ecx
0040108E xor ecx,ecx
00401090 cmp edx,dword ptr [vector2+4 (403030h)]
00401096 mov edx,dword ptr [vector1 (40301Ch)]
0040109C setne cl
0040109F add eax,ecx
004010A1 xor ecx,ecx
004010A3 cmp edx,dword ptr [vector2 (40302Ch)]
004010A9 setne cl
004010AC add eax,ecx
}
So, actually, the second version is slightly slower (theoretically). 19 instructions for the second vs. 17 for the first.

You should compare the code the compiler generates. It may be equivalent.
The compiler's very smart, but a good engineer can certainly improve a program's performance.

I dont think you are going to do much better, your second example is hard to read/understand for the average programmer which means two things one hard to understand and maintain, two you may be creeping into dark, less tested/supported, corners of the compiler. Drive down the road between the lines, dont wander about on the shoulder or in the wrong lane.
Go with this
for( i = 0; i < len; ++i ) {
int value1 = vector1[i];
int value2 = vector2[i];
if( value1 != value2 ) ++num_differences;
}
or this
for( i = 0; i < len; ++i ) {
if( vector1[i] != vector2[i] ) ++num_differences;
}
if it really is bothering you and you have properly concluded this is your performance bottleneck then time the difference between them. From the disassembly shown, and the nature of this platform, it is very difficult to properly time such things and draw the right conclusions. Too many caches, and other factors that cloud over the results, leading to false conclusions, etc. and no two x86 implementations have the same performance so if you happen to tune for your computer you are likely detuning it for another model of x86 or even the same make on a different motherboard with different I/O characteristics.

Related

Is any performance gained by using an if/else statement for an assignment?

I have simple question that I just wasn't sure about.
Consider the code below:
#include <stdio.h>
static void turnOn(int *power);
static void turnOff(int *power);
int main(void)
{
int powerIsOn = 0;
turnOn(&powerIsOn);
printf("Power Status: %d\n", powerIsOn);
turnOff(&powerIsOn);
printf("Power Status: %d\n", powerIsOn);
return 0;
}
static void turnOn(int *power)
{
if (!*power)
*power = 1;
// Or
//*power = 1;
return;
}
static void turnOff(int *power)
{
if (*power)
*power = 0;
// Or
// *power = 0;
return;
}
I know that this wouldn't cause a noticeable difference in something this small. But In methods that do some sort of assignment, is it more efficient to check if a Boolean or whatever is already true/false before re-assigning it's value?
For example, the turnOn() function is set to only turn the power on if it is off. Would it be any slower or faster just to set it to 1 regardless of the value?
Thanks for your time.

In any case your code accessing the memory, it does inside of "if" and it does when you assign it to 1. In addition "if" statement adds few more lines to the binary code, so if you can just assign without using "if", it is better and more efficient.
if (!*power) #one memory access and addition if actions
*power = 1; #one more memory access and assignment
Looking at the compiled assembly code for
static void turnOn(int *power)
{
if (!*power)
*power = 1;
return;
}
We will see the next code
turnOn:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
test eax, eax
jne .L4
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 1
nop
.L4:
nop
pop rbp
ret
And for:
static void turnOn(int *power)
{
*power = 1;
return;
}
Next code:
turnOn:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 1
nop
pop rbp
ret
It seem to me that the machine will run more operations in the first case.
I was using the https://godbolt.org/ compiler.

Both operations involve accessing variables and therefore depend on memory management. Assignment involves re-writing to memory, and therefore is more "expensive" than comparing constant values (booleans, or 1 and 0).
With that being said, with modern hardware these differences are negligible and therefore considered micro-optimizations, which aren't recommended.

I think that can only be answered profiling the code in a specific scenario.
Reading a variable in C can be more efficient than assigning a value to it if the compiler generates code that reads the value directly (however it is depends on the implementation of a given compiler).
However to determine whether it executes faster or not it is circumstantial e.g. if the both functions are called sequentially a lot of times then the additional if statement you created is useless as the condition will always return true and will generate additional instructions (and by that the CPU will spend additional clocks executing them) to perform a check that is always returning true. I would say it really need to be profiled and is context-dependent.

Is this function ok to convert binary to integer in c

I'm using this function to convert an 8-bit binary number represented as a boolean array to an integer. Is it efficient? I'm using it in an embedded system. It performs ok but I'm interested in some opinions or suggestions for improvement (or replacement) if there are any.
uint8_t b2i( bool *bs ){
uint8_t ret = 0;
ret = bs[7] ? 1 : 0;
ret += bs[6] ? 2 : 0;
ret += bs[5] ? 4 : 0;
ret += bs[4] ? 8 : 0;
ret += bs[3] ? 16 : 0;
ret += bs[2] ? 32 : 0;
ret += bs[1] ? 64 : 0;
ret += bs[0] ? 128 : 0;
return ret;
}

It is not possible to say without a specific system in mind. Disassemble the code and see what you got. Benchmark your code on a specific system. This is the key to understanding manual optimization.
Generally, there are lots of considerations. The CPU's data word size, instruction set, compiler optimizer performance, branch prediction (if any), data cache (if any) etc etc.
To make the code perform optimally regardless of data word size, you can change uint8_t to uint_fast8_t. That is unless you need exactly 8 bits, then leave it as uint8_t.
Cache use may or may not be more efficient if given an up-counting loop. At any rate, loop unrolling is an old kind of manual optimization that we shouldn't use in modern programming - the compiler is more capable of making that call than the programmer.
The worst problem with the code is the numerous branches. These might cause a bottleneck.
Your code results in the following x86 machine code gcc -O2:
b2i:
cmp BYTE PTR [rdi+6], 0
movzx eax, BYTE PTR [rdi+7]
je .L2
add eax, 2
.L2:
cmp BYTE PTR [rdi+5], 0
je .L3
add eax, 4
.L3:
cmp BYTE PTR [rdi+4], 0
je .L4
add eax, 8
.L4:
cmp BYTE PTR [rdi+3], 0
je .L5
add eax, 16
.L5:
cmp BYTE PTR [rdi+2], 0
je .L6
add eax, 32
.L6:
cmp BYTE PTR [rdi+1], 0
je .L7
add eax, 64
.L7:
lea edx, [rax-128]
cmp BYTE PTR [rdi], 0
cmovne eax, edx
ret
Whole lot of potentially inefficient branching. We can make the code both faster and more readable by using a loop:
uint8_t b2i (const bool bs[8])
{
uint8_t result = 0;
for(size_t i=0; i<8; i++)
{
result |= bs[8-1-i] << i;
}
return result;
}
(ideally the bool array should be arranged from LSB first but that would change the meaning of the code compared to the original)
Which gives this machine code instead:
b2i:
lea rsi, [rdi-8]
mov rax, rdi
xor r8d, r8d
.L2:
movzx edx, BYTE PTR [rax+7]
mov ecx, edi
sub ecx, eax
sub rax, 1
sal edx, cl
or r8d, edx
cmp rax, rsi
jne .L2
mov eax, r8d
ret
More instructions but less branching. It will likely perform better than your code on x86 and other high end CPUs with branch prediction and instruction cache. But worse than your code on a 8 bit microcontroller where only the total number of instructions count.

You can also do this with a loop and bit shifts to reduce code repetition:
int b2i(bool *bs) {
int ret = 0;
for (int i = 0; i < 8; i++) {
ret = ret << 1;
ret += bs[i];
}
return ret;
}

F# Breakable Array Iteration With Bounds Checking Elided?

I am interested in trying F# in a high-performance application. I do not want to have a large array's bounds checked during iteration and the lack of break/return statements is concerning.
This is a contrived example that will break upon finding a value, but can someone tell me if bounds checking is also elided?
let innerExists (item: Char) (items: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < items.Length do
state <- item = items.[i]
i <- i + 1
state
let exists (input: Char array)(illegalChars: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < input.Length do
state <- innerExists input.[i] illegalChars
i <- i + 1
state
exists [|'A'..'z'|] [|'.';',';';'|]
Here is the relevant disassembly:
while not state && i < input.Length do
000007FE6EB4237A cmp dword ptr [rbp-14h],0
000007FE6EB4237E jne 000007FE6EB42383
000007FE6EB42380 nop
000007FE6EB42381 jmp 000007FE6EB42386
000007FE6EB42383 nop
000007FE6EB42384 jmp 000007FE6EB423A9
000007FE6EB42386 nop
000007FE6EB42387 mov r8d,dword ptr [rbp-18h]
000007FE6EB4238B mov rdx,qword ptr [rbp+18h]
000007FE6EB4238F cmp r8d,dword ptr [rdx+8]
000007FE6EB42393 setl r8b
000007FE6EB42397 movzx r8d,r8b
000007FE6EB4239B mov dword ptr [rbp-24h],r8d
000007FE6EB4239F mov r8d,dword ptr [rbp-24h]
000007FE6EB423A3 mov dword ptr [rbp-1Ch],r8d
000007FE6EB423A7 jmp 000007FE6EB423B1
000007FE6EB423A9 nop
000007FE6EB423AA xor r8d,r8d
000007FE6EB423AD mov dword ptr [rbp-1Ch],r8d
000007FE6EB423B1 cmp dword ptr [rbp-1Ch],0
000007FE6EB423B5 je 000007FE6EB42409
state <- innerExists input.[i] illegalChars
000007FE6EB423B7 mov r8d,dword ptr [rbp-18h]
000007FE6EB423BB mov rdx,qword ptr [rbp+18h]
000007FE6EB423BF cmp r8,qword ptr [rdx+8]
000007FE6EB423C3 jb 000007FE6EB423CA
000007FE6EB423C5 call 000007FECD796850
000007FE6EB423CA lea rdx,[rdx+r8*2+10h]
000007FE6EB423CF movzx r8d,word ptr [rdx]
000007FE6EB423D3 mov rdx,qword ptr [rbp+10h]
000007FE6EB423D7 mov rdx,qword ptr [rdx+8]
000007FE6EB423DB mov r9,qword ptr [rbp+20h]
000007FE6EB423DF mov rcx,7FE6EEE0640h
000007FE6EB423E9 call 000007FE6EB41E40
000007FE6EB423EE mov dword ptr [rbp-20h],eax
000007FE6EB423F1 mov eax,dword ptr [rbp-20h]
000007FE6EB423F4 movzx eax,al
000007FE6EB423F7 mov dword ptr [rbp-14h],eax
i <- i + 1
000007FE6EB423FA mov eax,dword ptr [rbp-18h]

Others pointed out to use existing function FSharp.Core to achieve the same result but I think that OP asks if in loops like the boundary check of arrays are elided (as it is checked in the loop condition).
For simple code like above the jitter should be able to elide the checks. To see this it is correct to check the assembly code but it is important to not run with the VS debugger attached as the jitter don't optimize the code then. The reason that it can be impossible to show correct values in the debugger.
First let's look at exists optimized x64:
; not state?
00007ff9`1cd37551 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd37553 7521 jne 00007ff9`1cd37576
; i < input.Length?
00007ff9`1cd37555 395e08 cmp dword ptr [rsi+8],ebx
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd37558 0f9fc1 setg cl
00007ff9`1cd3755b 0fb6c9 movzx ecx,cl
00007ff9`1cd3755e 85c9 test ecx,ecx
; if we have reached end of the array then exit
00007ff9`1cd37560 7414 je 00007ff9`1cd37576
; mov i in ebx to rcx, unnecessary but moves like these are very cheap
00007ff9`1cd37562 4863cb movsxd rcx,ebx
; input.[i] (note we don't check the boundary again)
00007ff9`1cd37565 0fb74c4e10 movzx ecx,word ptr [rsi+rcx*2+10h]
; move illegalChars pointer to rdx
00007ff9`1cd3756a 488bd7 mov rdx,rdi
; call innerExists
00007ff9`1cd3756d e8ee9affff call 00007ff9`1cd31060
; i <- i + 1
00007ff9`1cd37572 ffc3 inc ebx
; Jump top of loop
00007ff9`1cd37574 ebdb jmp 00007ff9`1cd37551
; We are done!
00007ff9`1cd37576
So the code looks a bit too complex for what it should need to be but it seems it only checks the array condition once.
Now let's look at innerExists optimized x64:
# let mutable state = false
00007ff9`1cd375a0 33c0 xor eax,eax
# let mutable i = 0
00007ff9`1cd375a2 4533c0 xor r8d,r8d
; not state?
00007ff9`1cd375a5 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd375a7 752b jne 00007ff9`1cd375d4
; i < items.Length
00007ff9`1cd375a9 44394208 cmp dword ptr [rdx+8],r8d
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd375ad 410f9fc1 setg r9b
00007ff9`1cd375b1 450fb6c9 movzx r9d,r9b
00007ff9`1cd375b5 4585c9 test r9d,r9d
; if we have reached end of the array then exit
00007ff9`1cd375b8 741a je 00007ff9`1cd375d4
; mov i in r8d to rax, unnecessary but moves like these are very cheap
00007ff9`1cd375ba 4963c0 movsxd rax,r8d
; items.[i] (note we don't check the boundary again)
00007ff9`1cd375bd 0fb7444210 movzx eax,word ptr [rdx+rax*2+10h]
; mov item in cx to r9d, unnecessary but moves like these are very cheap
00007ff9`1cd375c2 440fb7c9 movzx r9d,cx
; item = items.[i]?
00007ff9`1cd375c6 413bc1 cmp eax,r9d
00007ff9`1cd375c9 0f94c0 sete al
; state <- ?
00007ff9`1cd375cc 0fb6c0 movzx eax,al
; i <- i + 1
00007ff9`1cd375cf 41ffc0 inc r8d
; Jump top of loop
00007ff9`1cd375d2 ebd1 jmp 00007ff9`1cd375a5
; We are done!
00007ff9`1cd375d4 c3 ret
So looks overly complex for what it should be but at least it looks like it only checks the array condition once.
So finally, it looks like the jitter eliminates the array boundary checks because it can prove this has already been checked successfully in the loop condition which I believe is what the OP wondered.
The x64 code doesn't look as clean as it could but from my experimentation x64 code that is cleaned up doesn't perform that much better, I suspect the CPU vendors optimize the CPU for the crappy code jitters produce.
An interesting exercise would be to code up an equivalent program in C++ and run it through https://godbolt.org/, choose x86-64 gcc (trunk) (gcc seems to do best right now) and specify the options -O3 -march=native and see the resulting x64 code.
Update
The code rewritten in https://godbolt.org/ to allow us seeing the assembly code generated by a c++ compiler:
template<int N>
bool innerExists(char item, char const (&items)[N]) {
for (auto i = 0; i < N; ++i) {
if (item == items[i]) return true;
}
return false;
}
template<int N1, int N2>
bool exists(char const (&input)[N1], char const (&illegalCharacters)[N2]) {
for (auto i = 0; i < N1; ++i) {
if (innerExists(input[i], illegalCharacters)) return true;
}
return false;
}
char const separators[] = { '.', ',', ';' };
char const str[58] = { };
bool test() {
return exists(str, separators);
}
x86-64 gcc (trunk) with options -O3 -march=native the following code is generated
; Load the string to test into edx
mov edx, OFFSET FLAT:str+1
.L2:
; Have we reached the end?
cmp rdx, OFFSET FLAT:str+58
; If yes, then jump to the end
je .L7
; Load a character
movzx ecx, BYTE PTR [rdx]
; Comparing the 3 separators are encoded in the assembler
; because the compiler detected the array is always the same
mov eax, ecx
and eax, -3
cmp al, 44
sete al
cmp cl, 59
sete cl
; increase outer i
inc rdx
; Did we find a match?
or al, cl
; If no then loop to .L2
je .L2
; We are done!
ret
.L7:
; No match found, clear result
xor eax, eax
; We are done!
ret
Looks pretty good but what I missing in the code above is using AVX to test multiple characters at once.

Bound check are eliminated by the JIT compiler, so it works the same for F# as for C#. You can expect elimination for code as in your example, as well as for
for i = 0 to data.Lenght - 1 do
...
and also for tail recursive functions, which compiles down to loops.
The built in Array.contains and Array.exists (source code) are written so that the JIT compiler can eliminate bounds checks.

What's wrong with the Array.contains and Array.exists functions ?
let exists input illegalChars =
input |> Array.exists (fun c -> illegalChars |> Array.contains c)

Does gcc automatically "unroll" if-statements?

Say I have a loop that looks like this:
for(int i = 0; i < 10000; i++) {
/* Do something computationally expensive */
if (i < 200 && !(i%20)) {
/* Do something else */
}
}
wherein some trivial task gets stuck behind an if-statement that only runs a handful of times.
I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:
for(int i = 0; i < 200; i++) {
/* Do something computationally expensive */
if (!(i%20)) {
/* Do something else */
}
}
for(int i = 200; i < 10000; i++) {
/* Do something computationally expensive */
}
Will gcc (with the appropriate flags, like -O3) automatically break the one loop into two, or does it only unroll to decrease the number of iterations?

Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:
int main() {
int sum = 0;
int i;
for(i = 0; i < 10000; i++) {
if (i < 200 && !(i%20)) {
sum += 0xC0DE;
}
sum += 0xCAFE;
}
printf("%d\n", sum);
return 0;
}
and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:
0x08048404 <main+20>: xor ebx,ebx
0x08048406 <main+22>: push ecx
0x08048407 <main+23>: xor ecx,ecx
0x08048409 <main+25>: sub esp,0xc
0x0804840c <main+28>: lea esi,[esi+eiz*1+0x0]
0x08048410 <main+32>: cmp ecx,0xc7
0x08048416 <main+38>: jg 0x8048436 <main+70>
0x08048418 <main+40>: mov eax,ecx
0x0804841a <main+42>: imul esi
0x0804841c <main+44>: mov eax,ecx
0x0804841e <main+46>: sar eax,0x1f
0x08048421 <main+49>: sar edx,0x3
0x08048424 <main+52>: sub edx,eax
0x08048426 <main+54>: lea edx,[edx+edx*4]
0x08048429 <main+57>: shl edx,0x2
0x0804842c <main+60>: cmp ecx,edx
0x0804842e <main+62>: jne 0x8048436 <main+70>
0x08048430 <main+64>: add ebx,0xc0de
0x08048436 <main+70>: add ecx,0x1
0x08048439 <main+73>: add ebx,0xcafe
0x0804843f <main+79>: cmp ecx,0x2710
0x08048445 <main+85>: jne 0x8048410 <main+32>
0x08048447 <main+87>: mov DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>: mov DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>: mov DWORD PTR [esp],0x1
0x0804845a <main+106>: call 0x8048308 <__printf_chk#plt>
So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.
But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.
Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?

Translate a FOR to assembler

I need to translate what is commented within the method, to assembler. I have a roughly idea, but can't.
Anyone can help me please? Is for an Intel x32 architecture:
int
secuencia ( int n, EXPRESION * * o )
{
int a, i;
//--- Translate from here ...
for ( i = 0; i < n; i++ ){
a = evaluarExpresion( *o );
o++;
}
return a ;
//--- ... until here.
}
Translated code must be within __asm as:
__asm {
translated code
}
Thank you,
FINAL UPDATE:
This is the final version, working and commented, thanks to all for your help :)
int
secuencia ( int n, EXPRESION * * o )
{
int a = 0, i;
__asm
{
mov dword ptr [i],0 ; int i = 0
jmp salto1
ciclo1:
mov eax,dword ptr [i]
add eax,1 ; increment in 1 the value of i
mov dword ptr [i],eax ; i++
salto1:
mov eax,dword ptr [i]
cmp eax,dword ptr [n] ; Compare i and n
jge final ; If is greater goes to 'final'
mov eax,dword ptr [o]
mov ecx,dword ptr [eax] ; Recover * o (its value)
push ecx ; Make push of * o (At the stack, its value)
call evaluarExpresion ; call evaluarExpresion( * o )
add esp,4 ; Recover memory from the stack (4KB corresponding to the * o pointer)
mov dword ptr [a],eax ; Save the result of evaluarExpresion as the value of a
mov eax,dword ptr [o] ; extract the pointer to o
add eax,4 ; increment the pointer by a factor of 4 (next of the actual pointed by *o)
mov dword ptr [o],eax ; o++
jmp ciclo1 ; repeat
final: ; for's final
mov eax,dword ptr [a] ; return a - it save the return value at the eax registry (by convention this is where the result must be stored)
}
}

Essentially in assembly languages, strictly speaking there isn't a notion of a loop the same way there would be in a higher level language. It's all implemented with jumps (eg. as a "goto"...)
That said, x86 has some instructions with the assumption that you'll be writing "loops", implicitly using the register ECX as a loop counter.
Some examples:
mov ecx, 5 ; ecx = 5
.label:
; Loop body code goes here
; ECX will start out as 5, then 4, then 3, then 1...
loop .label ; if (--ecx) goto .label;
Or:
jecxz .loop_end ; if (!ecx) goto .loop_end;
.loop_start:
; Loop body goes here
loop .loop_start ; if (--ecx) goto .loop_start;
.loop_end:
And, if you don't like this loop instruction thing counting backwards... You can write something like:
xor ecx, ecx ; ecx = 0
.loop_start:
cmp ecx, 5 ; do (ecx-5) discarding result, then set FLAGS
jz .loop_end ; if (ecx-5) was zero (eg. ecx == 5), jump to .loop_end
; Loop body goes here.
inc ecx ; ecx++
jmp .loop_start
.loop_end:
This would be closer to the typical for (int i=0; i<5; ++i) { }

Note that
for (init; cond; advance) {
...
}
is essentially syntactic sugar for
init;
while(cond) {
...
advance;
}
which should be easy enough to translate into assembly language if you've been paying any attention in class.

Use gcc to generate the assembly code
gcc -S -c sample.c
man gcc is your friend

For that you would probably use the loop instruction that decrements the ecx (often called, extended counter) at each loop and goes out when ecx reaches zero.But why use inline asm for it anyway? I'm pretty sure something as simple as that will be optimized correctly by the compiler...
(We say x86 architecture, because it's based on 80x86 computers, but it's an "ok" mistake =p)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Do optimization like this make sense? - c

The short answer is: "Trust your Compiler". In general you're not going to see much benefit from optimisations like this unless you're working with really huge datasets. Even then you really need to benchmark the code to see if there is any improvement.

You should compare the code the compiler generates. It may be equivalent. The compiler's very smart, but a good engineer can certainly improve a program's performance.

Related

Is any performance gained by using an if/else statement for an assignment?

Is this function ok to convert binary to integer in c

F# Breakable Array Iteration With Bounds Checking Elided?

Does gcc automatically "unroll" if-statements?

Translate a FOR to assembler

Categories

Resources