Within the following block of code, num_insts is re-assigned to 0 following the first iteration of the loop.
inst_t buf[5] = {0};
num_insts = 10;
int i = 5;
for( ; i > 0; i-- )
{
buf[i] = buf[i-1];
}
buf[0] = next;
I cannot think of any possible valid reason for this behavior, but I'm also sleep deprived so a second opinion would be appreciated.
The assembly being executed for the buf shift is this:
004017ed: mov 0x90(%esp),%eax
004017f4: lea -0x1(%eax),%ecx
004017f7: mov 0x90(%esp),%edx
004017fe: mov %edx,%eax
00401800: shl $0x2,%eax
00401803: add %edx,%eax
00401805: shl $0x2,%eax
00401808: lea 0xa0(%esp),%edi
0040180f: lea (%edi,%eax,1),%eax
00401812: lea -0x7c(%eax),%edx
00401815: mov %ecx,%eax
00401817: shl $0x2,%eax
0040181a: add %ecx,%eax
0040181c: shl $0x2,%eax
0040181f: lea 0xa0(%esp),%ecx
And the register contents prior to executing the first assembly instruction above is this:
eax 0
ecx 0
edx 0
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665432
eip 0x4017ed <main+1593>
Following those instructions, this:
eax 0
ecx 0
edx 2665432
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665456
eip 0x401848 <main+1684>
I don't know nearly enough assembly to make sense of any of this, but maybe someone answering this will benefit from it.
For first iteration with i = 5 you code:
for( ; i > 0; i-- ) // i = 5 > 0 = true
{
buf[i] = buf[i-1]; // b[5] = b [5 - 1]
}
Is buf[5] = buf[4]; because buf is just of size 5, maximum index value can be 4 so bug in your code = array out of index problem => rhs buf[5].
Related
I am interested in trying F# in a high-performance application. I do not want to have a large array's bounds checked during iteration and the lack of break/return statements is concerning.
This is a contrived example that will break upon finding a value, but can someone tell me if bounds checking is also elided?
let innerExists (item: Char) (items: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < items.Length do
state <- item = items.[i]
i <- i + 1
state
let exists (input: Char array)(illegalChars: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < input.Length do
state <- innerExists input.[i] illegalChars
i <- i + 1
state
exists [|'A'..'z'|] [|'.';',';';'|]
Here is the relevant disassembly:
while not state && i < input.Length do
000007FE6EB4237A cmp dword ptr [rbp-14h],0
000007FE6EB4237E jne 000007FE6EB42383
000007FE6EB42380 nop
000007FE6EB42381 jmp 000007FE6EB42386
000007FE6EB42383 nop
000007FE6EB42384 jmp 000007FE6EB423A9
000007FE6EB42386 nop
000007FE6EB42387 mov r8d,dword ptr [rbp-18h]
000007FE6EB4238B mov rdx,qword ptr [rbp+18h]
000007FE6EB4238F cmp r8d,dword ptr [rdx+8]
000007FE6EB42393 setl r8b
000007FE6EB42397 movzx r8d,r8b
000007FE6EB4239B mov dword ptr [rbp-24h],r8d
000007FE6EB4239F mov r8d,dword ptr [rbp-24h]
000007FE6EB423A3 mov dword ptr [rbp-1Ch],r8d
000007FE6EB423A7 jmp 000007FE6EB423B1
000007FE6EB423A9 nop
000007FE6EB423AA xor r8d,r8d
000007FE6EB423AD mov dword ptr [rbp-1Ch],r8d
000007FE6EB423B1 cmp dword ptr [rbp-1Ch],0
000007FE6EB423B5 je 000007FE6EB42409
state <- innerExists input.[i] illegalChars
000007FE6EB423B7 mov r8d,dword ptr [rbp-18h]
000007FE6EB423BB mov rdx,qword ptr [rbp+18h]
000007FE6EB423BF cmp r8,qword ptr [rdx+8]
000007FE6EB423C3 jb 000007FE6EB423CA
000007FE6EB423C5 call 000007FECD796850
000007FE6EB423CA lea rdx,[rdx+r8*2+10h]
000007FE6EB423CF movzx r8d,word ptr [rdx]
000007FE6EB423D3 mov rdx,qword ptr [rbp+10h]
000007FE6EB423D7 mov rdx,qword ptr [rdx+8]
000007FE6EB423DB mov r9,qword ptr [rbp+20h]
000007FE6EB423DF mov rcx,7FE6EEE0640h
000007FE6EB423E9 call 000007FE6EB41E40
000007FE6EB423EE mov dword ptr [rbp-20h],eax
000007FE6EB423F1 mov eax,dword ptr [rbp-20h]
000007FE6EB423F4 movzx eax,al
000007FE6EB423F7 mov dword ptr [rbp-14h],eax
i <- i + 1
000007FE6EB423FA mov eax,dword ptr [rbp-18h]
Others pointed out to use existing function FSharp.Core to achieve the same result but I think that OP asks if in loops like the boundary check of arrays are elided (as it is checked in the loop condition).
For simple code like above the jitter should be able to elide the checks. To see this it is correct to check the assembly code but it is important to not run with the VS debugger attached as the jitter don't optimize the code then. The reason that it can be impossible to show correct values in the debugger.
First let's look at exists optimized x64:
; not state?
00007ff9`1cd37551 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd37553 7521 jne 00007ff9`1cd37576
; i < input.Length?
00007ff9`1cd37555 395e08 cmp dword ptr [rsi+8],ebx
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd37558 0f9fc1 setg cl
00007ff9`1cd3755b 0fb6c9 movzx ecx,cl
00007ff9`1cd3755e 85c9 test ecx,ecx
; if we have reached end of the array then exit
00007ff9`1cd37560 7414 je 00007ff9`1cd37576
; mov i in ebx to rcx, unnecessary but moves like these are very cheap
00007ff9`1cd37562 4863cb movsxd rcx,ebx
; input.[i] (note we don't check the boundary again)
00007ff9`1cd37565 0fb74c4e10 movzx ecx,word ptr [rsi+rcx*2+10h]
; move illegalChars pointer to rdx
00007ff9`1cd3756a 488bd7 mov rdx,rdi
; call innerExists
00007ff9`1cd3756d e8ee9affff call 00007ff9`1cd31060
; i <- i + 1
00007ff9`1cd37572 ffc3 inc ebx
; Jump top of loop
00007ff9`1cd37574 ebdb jmp 00007ff9`1cd37551
; We are done!
00007ff9`1cd37576
So the code looks a bit too complex for what it should need to be but it seems it only checks the array condition once.
Now let's look at innerExists optimized x64:
# let mutable state = false
00007ff9`1cd375a0 33c0 xor eax,eax
# let mutable i = 0
00007ff9`1cd375a2 4533c0 xor r8d,r8d
; not state?
00007ff9`1cd375a5 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd375a7 752b jne 00007ff9`1cd375d4
; i < items.Length
00007ff9`1cd375a9 44394208 cmp dword ptr [rdx+8],r8d
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd375ad 410f9fc1 setg r9b
00007ff9`1cd375b1 450fb6c9 movzx r9d,r9b
00007ff9`1cd375b5 4585c9 test r9d,r9d
; if we have reached end of the array then exit
00007ff9`1cd375b8 741a je 00007ff9`1cd375d4
; mov i in r8d to rax, unnecessary but moves like these are very cheap
00007ff9`1cd375ba 4963c0 movsxd rax,r8d
; items.[i] (note we don't check the boundary again)
00007ff9`1cd375bd 0fb7444210 movzx eax,word ptr [rdx+rax*2+10h]
; mov item in cx to r9d, unnecessary but moves like these are very cheap
00007ff9`1cd375c2 440fb7c9 movzx r9d,cx
; item = items.[i]?
00007ff9`1cd375c6 413bc1 cmp eax,r9d
00007ff9`1cd375c9 0f94c0 sete al
; state <- ?
00007ff9`1cd375cc 0fb6c0 movzx eax,al
; i <- i + 1
00007ff9`1cd375cf 41ffc0 inc r8d
; Jump top of loop
00007ff9`1cd375d2 ebd1 jmp 00007ff9`1cd375a5
; We are done!
00007ff9`1cd375d4 c3 ret
So looks overly complex for what it should be but at least it looks like it only checks the array condition once.
So finally, it looks like the jitter eliminates the array boundary checks because it can prove this has already been checked successfully in the loop condition which I believe is what the OP wondered.
The x64 code doesn't look as clean as it could but from my experimentation x64 code that is cleaned up doesn't perform that much better, I suspect the CPU vendors optimize the CPU for the crappy code jitters produce.
An interesting exercise would be to code up an equivalent program in C++ and run it through https://godbolt.org/, choose x86-64 gcc (trunk) (gcc seems to do best right now) and specify the options -O3 -march=native and see the resulting x64 code.
Update
The code rewritten in https://godbolt.org/ to allow us seeing the assembly code generated by a c++ compiler:
template<int N>
bool innerExists(char item, char const (&items)[N]) {
for (auto i = 0; i < N; ++i) {
if (item == items[i]) return true;
}
return false;
}
template<int N1, int N2>
bool exists(char const (&input)[N1], char const (&illegalCharacters)[N2]) {
for (auto i = 0; i < N1; ++i) {
if (innerExists(input[i], illegalCharacters)) return true;
}
return false;
}
char const separators[] = { '.', ',', ';' };
char const str[58] = { };
bool test() {
return exists(str, separators);
}
x86-64 gcc (trunk) with options -O3 -march=native the following code is generated
; Load the string to test into edx
mov edx, OFFSET FLAT:str+1
.L2:
; Have we reached the end?
cmp rdx, OFFSET FLAT:str+58
; If yes, then jump to the end
je .L7
; Load a character
movzx ecx, BYTE PTR [rdx]
; Comparing the 3 separators are encoded in the assembler
; because the compiler detected the array is always the same
mov eax, ecx
and eax, -3
cmp al, 44
sete al
cmp cl, 59
sete cl
; increase outer i
inc rdx
; Did we find a match?
or al, cl
; If no then loop to .L2
je .L2
; We are done!
ret
.L7:
; No match found, clear result
xor eax, eax
; We are done!
ret
Looks pretty good but what I missing in the code above is using AVX to test multiple characters at once.
Bound check are eliminated by the JIT compiler, so it works the same for F# as for C#. You can expect elimination for code as in your example, as well as for
for i = 0 to data.Lenght - 1 do
...
and also for tail recursive functions, which compiles down to loops.
The built in Array.contains and Array.exists (source code) are written so that the JIT compiler can eliminate bounds checks.
What's wrong with the Array.contains and Array.exists functions ?
let exists input illegalChars =
input |> Array.exists (fun c -> illegalChars |> Array.contains c)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'd like to present you my program in c and assembler code attached to his one. also, I've got some questions.
here is a piece of code in c
#include <stdio.h>
void podaj_znak(int tab[], int n);
int main()
{
int tab[7] = {4, 5, 6, 2, -80, 0, 56};
printf("Przed: ");
for (int i = 0; i < 7; i++)
printf("%d ", tab[i]);
printf("\n");
podaj_znak(tab, 7);
printf("Po: %d %d %d %d %d %d %d", tab[0], tab[1], tab[2], tab[3], tab[4], tab[5], tab[6]);
printf("\n");
return 0;
}
and asm right here
.686
.model flat
public _podaj_znak
.code
_podaj_znak PROC
push ebp
mov ebp, esp
mov edx, [ebp+8]
mov ecx, [ebp+12]
ptl:
mov eax, [edx]
cmp eax, 0
jl minus
ja plus
mov ebx, 0
mov [edx], ebx
jmp dalej
minus: mov ebx, -1
mov [edx], ebx
jmp dalej
plus: mov ebx, 1
mov [edx], ebx
jmp dalej
dalej: add edx, 4
sub ecx, 1
jnz ptl
pop ebp
ret
_podaj_znak ENDP
END
my question is, how can I simplify/condense the code?
edit: posting what the program does and what I like it to be like. it is just for me to train and to get used to assembler. the program is like you've got numbers from -inf to inf and it when the actual number is equal 0, it stays as it is, when it is something less than 0, it is replaced by -1, and when the number is more than 0, it is replaced by 1. the thing is, that I wanted to somehow optimize assembler code, but I don't know whether it is even possible to condense it.
Not really a good fit for this forum, but still:
For the C code, I'd create a PrintTab function that accepts tab and count and prints the table. Then invoke it both before and after the podaj_znak call.
For the asm code:
PLEASE add comments. I know this is probably just a class project, but still, get in the habit.
Why move [edx] to eax instead of just cmp [edx],0?
If perf matters, perhaps skip prolog/epilog and use a 'fastcall' calling convention.
Why repeat "mov [edx], ebx" for each case? Move it down to dalej.
As a 'trick' you might try checking for -1, but then handle the other 2 cases with setnz.
nasm syntax, may need subtle fixing for other asm, my solution:
; converts values in tab into [-1, 0, 1] as sgn()
; arguments: two on stack(int tab[], int n)
; modified registers: esi, edi, eax, ebx
; "no branch" version (except loop itself)
_podaj_znak:
mov esi,[esp+4] ; tab ptr
mov eax,[esp+8] ; count
xor ebx,ebx
lea edi,[esi+eax*4] ; tab.end() ptr
sgn_loop:
lodsd ; eax = [ds:esi], esi += 4
; change eax to [-1, 0, 1] by sgn(eax)
test eax,eax
setnz bl
sar eax,31
or eax,ebx
; overwrite original value with sgn() result
cmp esi,edi ; test if end of tab was reached
mov [esi-4],eax
jb sgn_loop
ret
And then for the curiosity googling Internet (just the loop part is different), 3 instructions version (my is 4):
...
; modifies also edx in this variant
sgn_loop:
lodsd ; eax = [ds:esi], esi += 4
; set edx to [-1, 0, 1] by sgn(eax)
cdq
cmp edx,eax
adc edx,ebx
; overwrite original value with sgn() result
cmp esi,edi
mov [esi-4],edx
jb sgn_loop
ret
Both variants are branch-less, so they should have superior performance to any branch variant (but I'm not going to profile it).
It is possible to optimize a little bit the assembly by calling only one time the mov [edx], ebx as follow:
ptl:
mov eax, [edx]
cmp eax, 0
jl minus
ja plus
mov ebx, 0 ; only set to 0
jmp dalej
minus: mov ebx, -1 ; only set to -1
jmp dalej
plus: mov ebx, 1 ; only set to 1
jmp dalej
dalej: mov [edx], ebx ; update the array[edx]
add edx, 4
sub ecx, 1
jnz ptl
I am implementing a function for a bubblesort algorithm in assembly language (Linux, 64-bit, yasm). The function is called from within a C file where the array and the array size are passed through to assembly via rdi and rsi respectively.
xor rax, rax
xor rbx, rbx
xor r14, r14 ; r14 : int j = 0
xor r15, r15 ; r15 : boolean swapped
inc r15 ; swapped = true (=> swapped = 1)
while:
cmp r15, 1 ; while (swapped) (=> check if swapped == 1)
jne end_while
dec r15 ; swapped = false (=> swapped = 0)
inc r14 ; j++
mov rdx, rsi ; rdx = size
sub rdx, r14 ; size - j
xor rcx, rcx ; int i = 0
for:
cmp rcx, rdx ; i < size - j
je end_for
mov rax, [rdi+rcx*4+4] ; rax = rdi+rcx*4+4 => arr[i+1]
mov rbx, [rdi+rcx*4] ; rbx = rdi+rcx*4 => temp = arr[i]
cmp rbx, rax ; if(arr[i] > arr[i+1])
jng done_if
mov [rdi+rcx*4], rax ; arr[i] = arr[i+1]
mov [rdi+rcx*4+4], rbx ; arr[i+1] = temp
inc r15 ; swapped = true (=> swapped = 1)
done_if:
inc rcx ; ++i
jmp for
end_for:
end_while:
ret
The array sorts integers only. I coded the bubblesort in Java and tested it there - it works fine. However, when I pass the array {9,8,7,6,5,4,3,2,1,0} via the C file the output is {8,8,8,8,8,8,8,8,8,9}. I debugged with gdb but still can't see where the issue is. The for-loop construction works fine (rcx and rdx function correctly). It seems that there might be an issue with the way the array elements are accessed.
Any advice would be appreciated.
Your problem is that you are using quadwords (64-bit integers) everywhere but your array is full of doublewords (32-bit integers). In particular, the part where you used mov rax, [rdi+rcx*4+4] should be changed to movl eax, [rdi+rcx*4+4], and the other mov instructions should similarly be changed to movl.
This is the code:
section .data
v dw 4, 6, 8, 12
len equ 4
section .text
global main
main:
mov eax, 0 ;this is i
mov ebx, 0 ;this is j
cycle:
cmp eax, 2 ;i < len/2
jge exit
mov ebx, 0
jmp inner_cycle
continue:
inc eax
jmp cycle
inner_cycle:
cmp ebx, 2
jge continue
mov di, [v + eax * 2 * 2 + ebx * 2]
inc ebx
jmp inner_cycle
exit:
push dword 0
mov eax, 0
sub esp, 4
int 0x80
I'm using an array and scanning it as a matrix, this is the C translation of the above code
int m[4] = {1,2,3,4};
for(i = 0; i < 2; i++){
for(j = 0; j < 2; j++){
printf("%d\n", m[i*2 + j]);
}
}
When I try to compile the assembly code I get this error:
DoubleForMatrix.asm:20: error: beroset-p-592-invalid effective address
which refers to this line
mov di, [v + eax * 2 * 2 + ebx * 2]
can someone explain me what is wrong with this line? I think that it's because of the register dimensions, I tried with
mov edi, [v + eax * 2 * 2 + ebx * 2]
but I've got the same error.
This is assembly for Mac OS X, to make it work on another SO you have to change the exit syscall.
You can't use arbitrary expressions in assembler. Only a few addressingmodes are allowed.
basically the most complex form is register/imm+register*scale with scale 1,2,4,8
Of course constants (like 2*2) will probably be folded to 4, so that counts as a single scale with 4 (not as two multiplications)
Your example tries to do two multiplies at once.
Solution: insert an extra LEA instruction to calculate v+ebx*2 and use the result in the mov.
lea regx , [v+ebx*2]
mov edi, [eax*2*2+regx]
where regx is a free register.
The SIB (Scale Immediate Base) addressing mode takes only one Scale argument (1,2,4 or 8) to be applied to exactly one register.
The proposed solution is to premultiply eax by 4 (also has to modify the comparison). Then inc eax can be replaced with add eax,4 and the illegal instruction by mov di,[v+eax+ebx*2]
A higher level optimization would be just to for (i=0;i<4;i++) printf("%d\n",m[i]);
I need to translate what is commented within the method, to assembler. I have a roughly idea, but can't.
Anyone can help me please? Is for an Intel x32 architecture:
int
secuencia ( int n, EXPRESION * * o )
{
int a, i;
//--- Translate from here ...
for ( i = 0; i < n; i++ ){
a = evaluarExpresion( *o );
o++;
}
return a ;
//--- ... until here.
}
Translated code must be within __asm as:
__asm {
translated code
}
Thank you,
FINAL UPDATE:
This is the final version, working and commented, thanks to all for your help :)
int
secuencia ( int n, EXPRESION * * o )
{
int a = 0, i;
__asm
{
mov dword ptr [i],0 ; int i = 0
jmp salto1
ciclo1:
mov eax,dword ptr [i]
add eax,1 ; increment in 1 the value of i
mov dword ptr [i],eax ; i++
salto1:
mov eax,dword ptr [i]
cmp eax,dword ptr [n] ; Compare i and n
jge final ; If is greater goes to 'final'
mov eax,dword ptr [o]
mov ecx,dword ptr [eax] ; Recover * o (its value)
push ecx ; Make push of * o (At the stack, its value)
call evaluarExpresion ; call evaluarExpresion( * o )
add esp,4 ; Recover memory from the stack (4KB corresponding to the * o pointer)
mov dword ptr [a],eax ; Save the result of evaluarExpresion as the value of a
mov eax,dword ptr [o] ; extract the pointer to o
add eax,4 ; increment the pointer by a factor of 4 (next of the actual pointed by *o)
mov dword ptr [o],eax ; o++
jmp ciclo1 ; repeat
final: ; for's final
mov eax,dword ptr [a] ; return a - it save the return value at the eax registry (by convention this is where the result must be stored)
}
}
Essentially in assembly languages, strictly speaking there isn't a notion of a loop the same way there would be in a higher level language. It's all implemented with jumps (eg. as a "goto"...)
That said, x86 has some instructions with the assumption that you'll be writing "loops", implicitly using the register ECX as a loop counter.
Some examples:
mov ecx, 5 ; ecx = 5
.label:
; Loop body code goes here
; ECX will start out as 5, then 4, then 3, then 1...
loop .label ; if (--ecx) goto .label;
Or:
jecxz .loop_end ; if (!ecx) goto .loop_end;
.loop_start:
; Loop body goes here
loop .loop_start ; if (--ecx) goto .loop_start;
.loop_end:
And, if you don't like this loop instruction thing counting backwards... You can write something like:
xor ecx, ecx ; ecx = 0
.loop_start:
cmp ecx, 5 ; do (ecx-5) discarding result, then set FLAGS
jz .loop_end ; if (ecx-5) was zero (eg. ecx == 5), jump to .loop_end
; Loop body goes here.
inc ecx ; ecx++
jmp .loop_start
.loop_end:
This would be closer to the typical for (int i=0; i<5; ++i) { }
Note that
for (init; cond; advance) {
...
}
is essentially syntactic sugar for
init;
while(cond) {
...
advance;
}
which should be easy enough to translate into assembly language if you've been paying any attention in class.
Use gcc to generate the assembly code
gcc -S -c sample.c
man gcc is your friend
For that you would probably use the loop instruction that decrements the ecx (often called, extended counter) at each loop and goes out when ecx reaches zero.But why use inline asm for it anyway? I'm pretty sure something as simple as that will be optimized correctly by the compiler...
(We say x86 architecture, because it's based on 80x86 computers, but it's an "ok" mistake =p)