I have a problem with asm code that works when mixed with C, but does not when used in asm code with proper parameters.
;; array - RDI, x- RSI, y- RDX
getValue:
mov r13, rsi
sal r13, $3
mov r14, rdx
sal r14, $2
mov r15, [rdi+r13]
mov rax, [r15+r14]
ret
Technically I want to keep the rdi, rsi and rdx registers untouched and thus I use other ones.
I am using an x64 machine and thus my pointers have 8 bytes. Technically speaking this code is supposed to do:
int getValue(int** array, int x, int y) {
return array[x][y];
}
it somehow works inside my C code, but does not when used in asm in this way:
mov rdi, [rdi] ;; get first pointer - first row
mov r9, $4 ;; we want second element from the row
mov rax, [rdi+r9] ;; get the element (4 bytes vs 8 bytes???)
mov rdi, FMT ;; prepare printf format "%d", 10, 0
mov rsi, rax ;; we want to print the element we just fetched
mov eax, $0 ;; say we have no non-integer argument
call printf ;; always gives 0, no matter what's in the matrix
Can someone see into this and help me? Thanks in advance.
The sal r14, $2 implies the elements are dwords, so the last line before the ret shouldn't load a qword. Besides, x86 has nice scaling addressing modes, so you can do this:
mov rax, [rdi + rsi * 8] ; load pointer to column
mov eax, [rax + rdx * 4] ; note this loads a dword
ret
That implies that you have an array of pointers to columns, which is unusual. You can do that, but was it intended?
This is a standard matrix of integers.
int** array;
sizeof(int*) == 8
sizeof(int) == 4
How I see it is that when I have that array at first, I have a pointer to a space of memory without "blanks" that holds all pointers one by one (index-by-index), so I say "let's go to the element rsi-th of the array" and that's why I shift by rsi-th * 8 bytes. So now I get the same situation, but the pointer should point to a space of integers, so 4-byte items. That's why I shift by 4 bytes there.
Is my thinking wrong?
Related
I have the following disassembly of a main function in which a user input is stored using scanf function (at address 0x0000089c). Due to the comparison that is made, I suppose that the user input is stored into the rsp register but I cannot figure out why, as rsp doesn't seem to be pushed on the stack (at least, not near the call to the scanf function).
Here is the disassembly:
0x00000850 sub rsp, 0x18
0x00000854 mov rax, qword fs:[0x28]
0x0000085d mov qword [canary], rax
0x00000862 xor eax, eax
0x00000864 call fcn.00000a3c
0x00000869 lea rsi, str.Insert_input:
0x00000870 mov edi, 1
0x00000875 xor eax, eax
0x00000877 mov dword [rsp], 0
0x0000087e mov dword [var_4h], 0
0x00000886 call sym.imp.__printf_chk
0x0000088b lea rdx, [var_4h]
0x00000890 lea rdi, str.u__u ; "%u %u" ;const char *format
0x00000897 xor eax, eax
0x00000899 mov rsi, rsp
0x0000089c call sym.imp.__isoc99_scanf ; int scanf(const char *format)
0x000008a1 mov eax, dword [rsp]
0x000008a4 cmp eax, 0x1336
0x000008a9 jg 0x867
On x86_64, parameters are passed in registers, so your call to scanf has 3 parameters stored in 3 registers:
rdi pointer to the string "%u %u", the format to parse (two unsigned integers)
rsi should be a unsigned *, pointer to where to put the first parsed integer
rdx pointer to where to put the second parsed integer.
If you look just before the call, rsi is set to rsp (the stack pointer) while rdx is set to point at the global variable var_4h (an extern symbol not defined here).
The stack is used to hold local variables, and in this case rsp points at a block 0x18 "free" bytes (allocated in the first instruction in your block), which is enough space for 6 integers. The one at offset 0 from rsp is what rsi points to, and it is the value read by the mov instruction immediately after the call.
I'm in the process of learning Assembly language using NASM, and have run into a programming problem that I can't seem to figure out. The goal of the program is to solve this equation:
Picture of Equation
For those unable to see the photo, the equation says that for two arrays of length n, array a and array b, find: for i=0 to n-1, ((ai + 3) - (bi - 4))
I'm only supposed to use three general registers, and I've figured out a code sample I think could possibly work, but I keep running into comma and operand errors with lines 16 and 19. I understand that in order to iterate through the array you need to move a pointer to each index, but since both arrays are of different values (array 1 is dw and array 2 is db) I am unsure how to account for that. I'm still very new to Assembly, and any help or pointers would be appreciated.
Here is a picture of my current code:
Code Sample
segment .data
a dw 12, 14, 16 ; array of three values
b db 2, 4, 5 ; array of three values
n dw 3 ; length of both arrays
result dq 0 ; memory to result
segment .text
global main
main:
mov rax, 0
mov rbx, 0
mov rdx, 0
loop_start:
cmp rax, [n]
jge loop_end
add rbx, a[rax*4] ; adding element of a at current index to rbx
add rbx, 3 ; adding 3 to current index value of array a in rbx
add rdx, BYTE b[rax]
sub rdx, 4
sub rbx, [rdx]
add [result], rbx
xor rbx, rbx
xor rdx, rdx
add rax, 1
loop_end:
ret
You are using 16-bit and 8-bit data, but 64-bit registers. Generally speaking, the processor requires the same data size though out the operands of any single instruction.
cmp rax,[n] has varying data size, which is not allowed: rax is a 64-bit register, and [n] is a 16 bit data item. So, we can change this to cmp ax,[n], and now everything is 16-bit.
add rbx,a[rax*4] is also mixing different size operands (not allowed). rbx is 64-bits and a[] is 16-bits. You can change the register to bx and this will be allowed. But also let's note that *4 is too much it should be *2 since dw is 16-bit data (2-byte), not 32-bit (4-byte). Since you're clearing rbx, you don't need an add here you can simply mov.
add rdx, BYTE b[rax] is also mixing different sizes. rax is 64-bits wide whereas b[] is 8-bits wide. Use dl instead of rdx. There is nothing to add to with this so you should use a mov instead of add. Now that there's a value in dl, and you previously cleared rdx, you can switch to using dx (from dl) this will have the 16-bit value of b[i].
sub rbx, [rdx] has an erroneous deference. Here you just want to sub bx,dx.
You are not using the label loop_start, so there is no loop. (Add a backward branch at the end of the loop.)
...but since both arrays are of different values (array 1 is dw and array 2 is db) I am unsure how to account for that
Erik Eidt's answer explaines why you "keep running into comma and operand errors". Although you can revert to using the smaller registers (adding operand size prefixes), my answer takes a different approach.
The instruction set has the movzx (move with zero extension) and movsx (move with sign extension) instructions to deal with these varying sizes. See below how to use these.
I've applied a few changes too.
Don't miss an opportunity to simplify your calculation:
((a[i] + 3) - (b[i] - 4)) is equivalent to (a[i] - b[i] + 7)
None of these arrays is empty, so you can just put the loop condition below its body.
You can process the arrays starting at the end if it's convenient. The summation operation doesn't mind!
segment .data
a dw 12, 14, 16 ; array of three values
b db 2, 4, 5 ; array of three values
n dw 3 ; length of both arrays
result dq 0 ; memory to result
segment .text
global main
main:
movzx rcx, word [n]
loop_start:
movzx rax, word [a + rcx * 2 - 2]
movzx rbx, byte [b + rcx - 1]
lea rax, [rax + rbx + 7]
add [result], rax
dec rcx
jnz loop_start
ret
Please notice that the additional negative offsets - 2 and - 1 exist to compensate for the fact that the loop control takes on the values {3, 2, 1} when {2, 1, 0} would have been perfect. This does not introduce an extra displacement component to the instruction since the mention of the a and b arrays is in fact already the displacement.
Although this is tagged x86-64, you can write the whole thing using 32-bit registers and not require the REX prefixes. Same result.
segment .data
a dw 12, 14, 16 ; array of three values
b db 2, 4, 5 ; array of three values
n dw 3 ; length of both arrays
result dq 0 ; memory to result
segment .text
global main
main:
movzx ecx, word [n]
loop_start:
movzx eax, word [a + ecx * 2 - 2]
movzx ebx, byte [b + ecx - 1]
lea eax, [eax + ebx + 7]
add [result], eax
dec ecx
jnz loop_start
ret
I'm working on a c based program that works on assembly for image twisting. Thepseudocode that is supposed to work is this one(always using images of 240x320
voltearHorizontal(imgO, imgD){
dirOrig = imgO;
dirDest = imgD;
dirOrig = dirOrig + 239*320; //bring the pointer to the first pixel of the last row
for(f=0; f<240; f++){
for(c=0; c<320; c++){
[dirDest]=[dirOrig];
dirOrig++;
dirDest++;
}
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
}
}
But when applied to assembly, on the result, the first rows are not read, leaving the space in black.
https://gyazo.com/7a76f147da96ae2bc27e109593ed6df8
this is the code I've written, that's supposed to work, and this one is what really happens to the image:
https://gyazo.com/2e389248d9959a786e736eecd3bf1531
Why are, with this code, not written/read the upper lines of pixels of the origen image to the second image? what part of code did I get wrong?
I think I have no tags left to put for my problem, thanks for any help that can be given (on where I am wrong).Also, the horitzontal flip (the oneabove is the vertical) simply finishes the program unexpectedly:
https://gyazo.com/a7a18cf10ac3c06fc73a93d9e55be70c
Any special reason, why you write it as slow assembler?
Why don't you just keep it in fast C++? https://godbolt.org/g/2oIpzt
#include <cstring>
void voltearHorizontal(const unsigned char* imgO, unsigned char* imgD) {
imgO += 239*320; //bring the pointer to the first pixel of the last row
for(unsigned f=0; f<240; ++f) {
memcpy(imgD, imgO, 320);
imgD += 320;
imgO -= 320;
}
}
Will be compiled with gcc6.3 -O3 to:
voltearHorizontal(unsigned char const*, unsigned char*):
lea rax, [rdi+76480]
lea r8, [rdi-320]
mov rdx, rsi
.L2:
mov rcx, QWORD PTR [rax]
lea rdi, [rdx+8]
mov rsi, rax
sub rax, 320
and rdi, -8
mov QWORD PTR [rdx], rcx
mov rcx, QWORD PTR [rax+632]
mov QWORD PTR [rdx+312], rcx
mov rcx, rdx
add rdx, 320
sub rcx, rdi
sub rsi, rcx
add ecx, 320
shr ecx, 3
cmp rax, r8
rep movsq
jne .L2
rep ret
Ie. like 800% more efficient than your inline asm.
Anyway, in your question the problem is:
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
You need to do -= 640 to return two lines up.
About those inline asm in screens... put them as text into question, but from a quick look on them I would simply rewrite it in C++ and keep it to compiler, you are doing many performance-wrong things in your asm, so I don't see any point in doing that, plus inline asm is ugly and hard to maintain, and hard to write correctly.
I did check even that asm in picture. You have lines counter in eax, but you use al to copy the pixel, so it does destroy the line counter value.
Use debugger next time.
BTW, your pictures are 320x240, not 240x320.
I'm trying to write a little program in assembler which takes three char arrays as input, calculates the avarage of each element in the first to arrays and stores the result in the third array like below.
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
global avgArray
avgArray:
prologue
mov [a1], rdi
mov [a2], rsi
mov [avg], rdx
mov [avgL], rcx
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; array length
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop:
mov al, [rsi]
mov dl, [r9]
add ax, dx
shr ax, 1
mov [rdi], al
add rsi, [offset]
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
When replacing [offset] with 1 it works perfectly fine. However when using [offset] to determine the next array element it seems that it wont add its value to rsi, rdi and r9.
I allready checked it using gdb. The adress stored in rsi is still the same after calling add rsi, [offset].
Can someone tell me why using [offset] wont work but adding a simple 1 does?
By the way: Linux x86_64 machine
So i found the solution for that problem.
The adresses of avgL and offset where stored directly behind each other. When reading from rcx and storing it to avgL it also overwrote the value of offset. Declaring avgL as a QWORD instead of a DWORD prevents mov from overwriting offset data.
The new data and bss segments look like this
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resq 1
Nice work on debugging your problem yourself. Since I already started to look at the code, I'll give you some efficiency / style critique as added comments:
%macro prologue 0
push rbp
mov rbp,rsp ; you can drop this and the LEAVE.
; Stack frames were useful before debuggers could keep track of things without them, and as a convenience
; so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss ; These should really be locals on the stack (or in regs!), not globals
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
; usually a comment with a C function prototype and description is a good idea for functions
global avgArray
avgArray:
prologue
mov [a1], rdi ; what is this sillyness? you have 16 registers for a reason.
mov [a2], rsi ; shuffling the values you want into the regs you want them in
mov [avg], rdx ; is best done with reg-reg moves.
mov [avgL], rcx ; I like to just put a comment at the top of a block of code
; to document what goes in what reg.
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; This could be lea rcx, [rsi+rcx]
; (since avgL is in rcx anyway as a function arg).
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop: ; you can use a local label here, starting with a .
; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
mov al, [rsi] ; there's a data dependency on the old value of ax
mov dl, [r9] ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm
add ax, dx ; Avoid ALU ops on 16bit regs whenever possible. (8bit is fine, they have diff opcodes instead of a prefix)
; to avoid decode stalls on Intel
shr ax, 1 ; Better to use 32bit regs (movsx/movzx)
mov [rdi], al
add rsi, [offset] ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
You have tons of registers free, why are you keeping the loop increment in memory? I hope it just ended up that way while debugging / trying stuff.
Also, 1-reg addressing modes are only more efficient when used as mem operands for ALU ops. Just increment a single counter and used base+offset*scale addressing when you have a lot of pointers (unless you're unrolling the loop), esp. if you load them with mov.
Here's how I'd do it (with perf analysis for Intel SnB and later):
scalar
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; if you can choose your prototype, do it so args go where you want them anyway.
; prologue
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; mov [rsp-8], rcx ; if I wanted to spill len to memory
add rcx, rdi
add rcx, rsi
add rcx, rdx
neg rcx ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
; We could also have just counted down towards zero
; but HW memory prefetchers have more stream slots for forward patterns than reverse.
ALIGN 16
.loop:
; use movsx for signed char
movzx eax, [rsi+rcx] ; dependency-breaker
movzx r8d, [rdx+rcx] ; Using r8d to save push/pop of rbx
; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
; using ebx or ebp would save a REX prefix (1 insn byte).
add eax, r8d
shr eax, 1
mov [rdi+rcx], al
inc rcx ; No cmp needed: this is the point of counting up towards zero
jl .loop ; inc/jl can Macro-fuse into one uop
; nothing to pop, we only used caller-saved regs.
ret
On Intel, the loop is 7 uops, (the store is 2 uops: store address and store data, and can't micro-fuse), so a CPU that can issue 4 uops per cycle will do it at 2 cycles per byte. movzx (to a 32 or 64bit reg) is 1 uop regardless, because there isn't a port 0/1/5 uop for it to micro-fuse with or not. (It's a read, not read-modify).
7 uops takes 2 chunks of up-to-4 uops, so the loop can issue in 2 cycles. There are no other bottlenecks that should prevent the execution units from keeping up with that, so it should run one per 2 cycles.
vector
There's a vector instruction to do exactly this operation: PAVGB is packed avg of unsigned bytes (with a 9-bit temporary to avoid overflow, same as your add/shr).
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
; use movsx for signed char
movdqu xmm0, [rsi+rcx] ; 1 uop
pavgb xmm0, [rdx+rcx] ; 2 uops (no micro-fusion)
movdqu [rdi+rcx], xmm0 ; 2 uops: no micro-fusion
add rcx, 16
jl .loop ; 1 macro-fused uop add/branch
; TODO: scalar cleanup.
ret
Getting the loop-exit condition right is tricky, since you need to end the vector loop if the next 16B goes off the end of the array. Prob. best to handle that by decrementing rcx by 15 or something before adding it to the pointers.
So again, 6 uops / 2 cycles per iteration, but each iteration will do 16 bytes. It's ideal to unroll so your loop is a multiple of 4 uops, so you're not losing out on issue rate with a cycle of less-than-4 uops at the end of a loop. 2 loads / 1 store per cycle is our bottleneck here, since PAVGB has a throughput of 2 per cycle.
16B / cycle shouldn't be difficult on Haswell and later. With AVX2 using ymm registers, you'd get 32B / cycle. (SnB/IvB can only do two memory ops per cycle, at most one of them a store, unless you use 256b loads/stores). Anyway, at this point you've gained a massive 16x speedup from vectorizing, and usually that's good enough. I just enjoy tuning things for theoretical-max throughput by counting uops and unrolling. :)
If you were going to unroll the loop at all, then it would be worth incrementing pointers instead of just an index. (So there would be two uses of [rdx] and one add, vs. two uses of [rdx+rcx]).
Either way, cleaning up the loop setup and keeping everything in registers saves a decent amount of instruction bytes, and overhead for short arrays.
I'm trying to input values into an array in x86-64 Intel assembly, but I can't quite figure it out.
I'm creating an array in segement .bss. Then I try to pass the address of the array along to another module by using r15. Inside that module I prompt the user for a number that I then insert into the array. But it doesn't work.
I'm trying to do the following
segment .bss
dataArray resq 15 ; Array that will be manipulated
segment .text
mov rdi, dataArray ; Store memory address of array so the next module can use it.
call inputqarray ; Calling inputqarray module
Inside of inputqarary I have:
mov r15, rdi ; Move the memory address of the array into r15 for safe keeping
push qword 0 ; Make space on the stack for the value we are reading
mov rsi, rsp ; Set the second argument to point to the new locaiton on the stack
mov rax, 0 ; No SSE input
mov rdi, oneFloat ; "%f", 0
call scanf ; Call C Standard Library scanf function
call getchar ; Clean the input stream
pop qword [r15]
I then try to output the value entered by the use by doing
push qword 0
mov rax, 1
mov rdi, oneFloat
movsd xmm0, [dataArray]
call printf
pop rax
Unfortunately, all I get for output is 0.00000
Output is 0 because you are using the wrong format specifier. It should be "%lf"
Next, no need to push and pop in your procedure. Since your going to pass the address of the data array to scanf, and that will be in rsi, just pass it in rsi; one less move.
You declared your array as 15 QWORDS, is that correct - 120 bytes? or did you mean resb 15?
This works and should get you on your way:
extern printf, scanf, exit
global main
section .rodata
fmtFloatIn db "%lf", 0
fmtFloatOut db `%lf\n`, 0
section .bss
dataArray resb 15
section .text
main:
sub rsp, 8 ; stack pointer 16 byte aligned
mov rsi, dataArray
call inputqarray
movsd xmm0, [dataArray]
mov rdi, fmtFloatOut
mov rax, 1
call printf
call exit
inputqarray:
sub rsp, 8 ; stack pointer 16 byte aligned
; pointer to buffer is in rsi
mov rdi, fmtFloatIn
mov rax, 0
call scanf
add rsp, 8
ret
Since you are passing params in rdi to the C functions, this is not on Windows.