NASM - Using Labels as Array offsets

NASM - Using Labels as Array offsets - arrays

I'm trying to write a little program in assembler which takes three char arrays as input, calculates the avarage of each element in the first to arrays and stores the result in the third array like below.
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
global avgArray
avgArray:
prologue
mov [a1], rdi
mov [a2], rsi
mov [avg], rdx
mov [avgL], rcx
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; array length
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop:
mov al, [rsi]
mov dl, [r9]
add ax, dx
shr ax, 1
mov [rdi], al
add rsi, [offset]
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
When replacing [offset] with 1 it works perfectly fine. However when using [offset] to determine the next array element it seems that it wont add its value to rsi, rdi and r9.
I allready checked it using gdb. The adress stored in rsi is still the same after calling add rsi, [offset].
Can someone tell me why using [offset] wont work but adding a simple 1 does?
By the way: Linux x86_64 machine

So i found the solution for that problem.
The adresses of avgL and offset where stored directly behind each other. When reading from rcx and storing it to avgL it also overwrote the value of offset. Declaring avgL as a QWORD instead of a DWORD prevents mov from overwriting offset data.
The new data and bss segments look like this
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resq 1

Nice work on debugging your problem yourself. Since I already started to look at the code, I'll give you some efficiency / style critique as added comments:
%macro prologue 0
push rbp
mov rbp,rsp ; you can drop this and the LEAVE.
; Stack frames were useful before debuggers could keep track of things without them, and as a convenience
; so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss ; These should really be locals on the stack (or in regs!), not globals
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
; usually a comment with a C function prototype and description is a good idea for functions
global avgArray
avgArray:
prologue
mov [a1], rdi ; what is this sillyness? you have 16 registers for a reason.
mov [a2], rsi ; shuffling the values you want into the regs you want them in
mov [avg], rdx ; is best done with reg-reg moves.
mov [avgL], rcx ; I like to just put a comment at the top of a block of code
; to document what goes in what reg.
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; This could be lea rcx, [rsi+rcx]
; (since avgL is in rcx anyway as a function arg).
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop: ; you can use a local label here, starting with a .
; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
mov al, [rsi] ; there's a data dependency on the old value of ax
mov dl, [r9] ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm
add ax, dx ; Avoid ALU ops on 16bit regs whenever possible. (8bit is fine, they have diff opcodes instead of a prefix)
; to avoid decode stalls on Intel
shr ax, 1 ; Better to use 32bit regs (movsx/movzx)
mov [rdi], al
add rsi, [offset] ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
You have tons of registers free, why are you keeping the loop increment in memory? I hope it just ended up that way while debugging / trying stuff.
Also, 1-reg addressing modes are only more efficient when used as mem operands for ALU ops. Just increment a single counter and used base+offset*scale addressing when you have a lot of pointers (unless you're unrolling the loop), esp. if you load them with mov.
Here's how I'd do it (with perf analysis for Intel SnB and later):
scalar
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; if you can choose your prototype, do it so args go where you want them anyway.
; prologue
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; mov [rsp-8], rcx ; if I wanted to spill len to memory
add rcx, rdi
add rcx, rsi
add rcx, rdx
neg rcx ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
; We could also have just counted down towards zero
; but HW memory prefetchers have more stream slots for forward patterns than reverse.
ALIGN 16
.loop:
; use movsx for signed char
movzx eax, [rsi+rcx] ; dependency-breaker
movzx r8d, [rdx+rcx] ; Using r8d to save push/pop of rbx
; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
; using ebx or ebp would save a REX prefix (1 insn byte).
add eax, r8d
shr eax, 1
mov [rdi+rcx], al
inc rcx ; No cmp needed: this is the point of counting up towards zero
jl .loop ; inc/jl can Macro-fuse into one uop
; nothing to pop, we only used caller-saved regs.
ret
On Intel, the loop is 7 uops, (the store is 2 uops: store address and store data, and can't micro-fuse), so a CPU that can issue 4 uops per cycle will do it at 2 cycles per byte. movzx (to a 32 or 64bit reg) is 1 uop regardless, because there isn't a port 0/1/5 uop for it to micro-fuse with or not. (It's a read, not read-modify).
7 uops takes 2 chunks of up-to-4 uops, so the loop can issue in 2 cycles. There are no other bottlenecks that should prevent the execution units from keeping up with that, so it should run one per 2 cycles.
vector
There's a vector instruction to do exactly this operation: PAVGB is packed avg of unsigned bytes (with a 9-bit temporary to avoid overflow, same as your add/shr).
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
; use movsx for signed char
movdqu xmm0, [rsi+rcx] ; 1 uop
pavgb xmm0, [rdx+rcx] ; 2 uops (no micro-fusion)
movdqu [rdi+rcx], xmm0 ; 2 uops: no micro-fusion
add rcx, 16
jl .loop ; 1 macro-fused uop add/branch
; TODO: scalar cleanup.
ret
Getting the loop-exit condition right is tricky, since you need to end the vector loop if the next 16B goes off the end of the array. Prob. best to handle that by decrementing rcx by 15 or something before adding it to the pointers.
So again, 6 uops / 2 cycles per iteration, but each iteration will do 16 bytes. It's ideal to unroll so your loop is a multiple of 4 uops, so you're not losing out on issue rate with a cycle of less-than-4 uops at the end of a loop. 2 loads / 1 store per cycle is our bottleneck here, since PAVGB has a throughput of 2 per cycle.
16B / cycle shouldn't be difficult on Haswell and later. With AVX2 using ymm registers, you'd get 32B / cycle. (SnB/IvB can only do two memory ops per cycle, at most one of them a store, unless you use 256b loads/stores). Anyway, at this point you've gained a massive 16x speedup from vectorizing, and usually that's good enough. I just enjoy tuning things for theoretical-max throughput by counting uops and unrolling. :)
If you were going to unroll the loop at all, then it would be worth incrementing pointers instead of just an index. (So there would be two uses of [rdx] and one add, vs. two uses of [rdx+rcx]).
Either way, cleaning up the loop setup and keeping everything in registers saves a decent amount of instruction bytes, and overhead for short arrays.

Related

Stuck at summing two arrays using MMX instructions using NASM

I was given the following task:
Given two arrays with 16 elements: NIZA RESW 16 and NIZB RESW 16
store in the third array (NIZC RESW 16) the following values: NIZC[i]=NIZA[i]+NIZB[i] using MMX instructions and compiling it with NASM
This is what I got so far:
%include "apicall.inc"
%include "print.inc"
segment .data
unos1 db "Array A: ", 0
unos2 db "Array B: ", 0
ispisC db "Array C : ", 0
segment .bss
NIZA RESW 16
NIZB RESW 16
NIZC RESW 16
segment .text
global start
start:
call init_console
mov esi,0
mov ecx, 16
mov eax, unos1
call print_string
call print_nl
unos_a:
call read_int
mov [NIZA+esi], eax
add esi, 2
loop unos_a
mov esi,0
mov ecx, 16
mov eax, unos2
call print_string
call print_nl
unos_b:
call read_int
mov [NIZB+esi], eax
add esi, 2
loop unos_b
movq mm0, qword [NIZA]
movq mm1, qword [NIZB]
paddq mm0, mm1
movq qword [NIZC], mm0
mov esi,NIZC
mov ecx,16
mov eax, ispisC
call print_string
call print_nl
ispis_c:
mov ax, [esi]
movsx eax, ax
call print_int
call print_nl
add esi, 2
loop ispis_c
APICALL ExitProcess, 0
After compiling the given array, and testing it with the following two arrays, the third array only stores 4 elements out of 16. (given in the following picture)
Does anybody know why it only stores 4 elements out of 16? Any help is appreciated.
If you have any question for the functions print_string print_int print_nl are functions for printing out a string, new line and a integer by pushing it in the EAX register, and also note this is a 32-bit program.

Does anybody know why it only stores 4 elements out of 16?
Because you let your MMX instructions only operate on the first 4 array elements. You need a loop to process all 16 array elements.
Your task description doesn't say it, but I see you sign-extend the values from NIZC before printing, so you seem expecting signed results. I also see that you use PADDQ to operate on 4 word-sized inputs. This will then not always give correct results! eg. If NIZA[0]=-1 and NIZB[0]=5, then you will get NIZC[0]=4 but there will have happened a carry from the first word into the second word, leaving NIZC[1] wrong. This will not happen if you use the right version of the packed addition: PADDW.
You got lucky with the size errors on mov [NIZA+esi], eax and mov [NIZB+esi], eax. Because NIZA and NIZB follow each other in memory in the same order that you assign to them, no harm was done. If NIZB would have been placed before NIZA, then assigning NIZB[15] would have corrupted NIZA[0].
Below is a partial rewrite where I used an input subroutine in order to not have to repeat myself.
mov eax, unos1
mov ebx, NIZA
call MyInput
mov eax, unos2
mov ebx, NIZB
call MyInput
xor esi, esi
more:
movq mm0, qword [NIZA + esi]
paddw mm0, qword [NIZB + esi]
movq qword [NIZC + esi], mm0
add esi, 8
cmp esi, 32
jb more
emms ; (*)
...
MyInput:
call print_string
call print_nl
xor esi, esi
.more:
call read_int ; -> EAX
mov [ebx + esi], ax
add esi, 2
cmp esi, 32 ; Repeat 16 times
jb .more
ret
(*) For info about emms (Empty MMX State) see https://www.felixcloutier.com/x86/emms
Tip: You can write mov ax, [esi] movsx eax, ax in one instruction: movsx eax, word [esi].

ASSEMBLY - output an array with 32 bit register vs 16 bit

I'm was working on some homework to print out an array as it's sorting some integers from an array. I have the code working fine, but decided to try using EAX instead of AL in my code and ran into errors. I can't figure out why that is. Is it possible to use EAX here at all?
; This program sorts an array of signed integers, using
; the Bubble sort algorithm. It invokes a procedure to
; print the elements of the array before, the bubble sort,
; once during each iteration of the loop, and once at the end.
INCLUDE Irvine32.inc
.data
myArray BYTE 5, 1, 4, 2, 8
;myArray DWORD 5, 1, 4, 2, 8
currentArray BYTE 'This is the value of array: ' ,0
startArray BYTE 'Starting array. ' ,0
finalArray BYTE 'Final array. ' ,0
space BYTE ' ',0 ; BYTE
.code
main PROC
MOV EAX,0 ; clearing registers, moving 0 into each, and initialize
MOV EBX,0 ; clearing registers, moving 0 into each, and initialize
MOV ECX,0 ; clearing registers, moving 0 into each, and initialize
MOV EDX,0 ; clearing registers, moving 0 into each, and initialize
PUSH EDX ; preserves the original edx register value for future writeString call
MOV EDX, OFFSET startArray ; load EDX with address of variable
CALL writeString ; print string
POP EDX ; return edx to previous stack
MOV ECX, lengthOf myArray ; load ECX with # of elements of array
DEC ECX ; decrement count by 1
L1:
PUSH ECX ; save outer loop count
MOV ESI, OFFSET myArray ; point to first value
L2:
MOV AL,[ESI] ; get array value
CMP [ESI+1], AL ; compare a pair of values
JGE L3 ; if [esi] <= [edi], don't exch
XCHG AL, [ESI+1] ; exchange the pair
MOV [ESI], AL
CALL printArray ; call printArray function
CALL crlf
L3:
INC ESI ; increment esi to the next value
LOOP L2 ; inner loop
POP ECX ; retrieve outer loop count
LOOP L1 ; else repeat outer loop
PUSH EDX ; preserves the original edx register value for future writeString call
MOV EDX, OFFSET finalArray ; load EDX with address of variable
CALL writeString ; print string
POP EDX ; return edx to previous stack
CALL printArray
L4 : ret
exit
main ENDP
printArray PROC uses ESI ECX
;myArray loop
MOV ESI, OFFSET myArray ; address of myArray
MOV ECX, LENGTHOF myArray ; loop counter (5 values within array)
PUSH EDX ; preserves the original edx register value for future writeString call
MOV EDX, OFFSET currentArray ; load EDX with address of variable
CALL writeString ; print string
POP EDX ; return edx to previous stack
L5 :
MOV AL, [ESI] ; add an integer into eax from array
CALL writeInt
PUSH EDX ; preserves the original edx register value for future writeString call
MOV EDX, OFFSET space
CALL writeString
POP EDX ; restores the original edx register value
ADD ESI, TYPE myArray ; point to next integer
LOOP L5 ; repeat until ECX = 0
CALL crlf
RET
printArray ENDP
END main
END printArray
; output:
;Starting array. This is the value of array: +1 +5 +4 +2 +8
;This is the value of array: +1 +4 +5 +2 +8
;This is the value of array: +1 +4 +2 +5 +8
;This is the value of array: +1 +2 +4 +5 +8
;Final array. This is the value of array: +1 +2 +4 +5 +8
As you can see the output sorts the array just fine from least to greatest. I was trying to see if I could move AL into EAX, but that gave me a bunch of errors. Is there a work around for this so I can use a 32 bit register and get the same output?

Using EAX is definitely possible, in fact you already are. You asked "I was trying to see if I could move AL into EAX, but that gave me a bunch of errors." Think about what that means. EAX is the extended AX register, and AL is the lower partition of AX. Take a look at this diagram:image of EAX register
. As you can see, moving AL into EAX using perhaps the MOVZX instruction would simply put the value in AL into EAX and fill zeroes in from right to left. You'd be moving AL into AL, and setting the rest of EAX to 0. You could actually move everything into EAX and run the program just the same and there'd be no difference because it's using the same part of memory.
Also, why are you pushing and popping EAX so much? The only reason to push/pop things from the runtime stack is to recover them later, but you never do that, so you can just let whatever is in EAX at the time just die.

If you still want to do an 8-bit store, you need to use an 8-bit register. (AL is an 8-bit register. IDK why you mention 16 in the title).
x86 has widening loads (movzx and movsx), but integer stores from a register operand always take a register the same width as the memory operand. i.e. the way to store the low byte of EAX is with mov [esi], al.
In printArray, you should use movzx eax, byte ptr [esi] to zero-extend into EAX. (Or movsx to sign-extend, if you want to treat your numbers as int8_t instead of uint8_t.) This avoids needing the upper 24 bits of EAX to be zeroed.
BTW, your code has a lot of unnecessary instructions. e.g.
MOV EAX,0 ; clearing registers, moving 0 into each, and initialize
totally pointless. You don't need to "init" or "declare" a register before using it for the first time, if your first usage is write-only. What you do with EDX is amusing:
MOV EDX,0 ; clearing registers, moving 0 into each, and initialize
PUSH EDX ; preserves the original edx register value for future writeString call
MOV EDX, OFFSET startArray ; load EDX with address of variable
CALL writeString ; print string
POP EDX ; return edx to previous stack
"Caller-saved" registers only have to be saved if you actually want the old value. I prefer the terms "call-preserved" and "call-clobbered". If writeString destroys its input register, then EDX holds an unknown value after the function returns, but that's fine. You didn't need the value anyway. (Actually I think Irvine32 functions at most destroy EAX.)
In this case, the previous instruction only zeroed the register (inefficiently). That whole block could be:
MOV EDX, OFFSET startArray ; load EDX with address of variable
CALL writeString ; print string
xor edx,edx ; edx = 0
Actually you should omit the xor-zeroing too, because you don't need it to be zeroed. You're not using it as counter in a loop or anything, all the other uses are write-only.
Also note that XCHG with memory has an implicit lock prefix, so it does the read-modify-write atomically (making it much slower than separate mov instructions to load and store).
You could load a pair of bytes using movzx eax, word ptr [esi] and use a branch to decide whether to rol ax, 8 to swap them or not. But store-forwarding stalls from byte stores forwarding to word loads isn't great either.
Anyway, this is getting way off topic from the title question, and this isn't codereview.SE.

looping over an array in NASM

I want to learn programming in assembly to write fast and efficient code.
How ever I stumble over a problem I can't solve.
I want to loop over an array of double words and add its components like below:
%include "asm_io.inc"
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
string1 db "result: ",0
array dd 1, 2, 3, 4, 5
segment .bss
segment .text
global sum
sum:
prologue
mov rdi, string1
call print_string
mov rbx, array
mov rdx, 0
mov ecx, 5
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
mov rdi, rdx
call print_int
call print_nl
epilogue
Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:
section .rodata
int_format db "%i",0
string_format db "%s",0
section .text
global print_string, print_nl, print_int, read_int
extern printf, scanf, putchar
print_string:
prologue
; string address has to be passed in rdi
mov rsi,rdi
mov rdi,dword string_format
xor rax,rax
call printf
epilogue
print_nl:
prologue
mov rdi,0xA
xor rax,rax
call putchar
epilogue
print_int:
prologue
;integer arg is in rdi
mov rsi, rdi
mov rdi, dword int_format
xor rax,rax
call printf
epilogue
When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array.
Can somebody tell me why th loop skips the first element?
Edit
I forgot to mention that I'm using a x86_64 Linux system

I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.
If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).
The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like
#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);
Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.
An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.
It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P
My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.
So, on to your loop:
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.
Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).
When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.
;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4] ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi] ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8 ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4
ALIGN 16
.loop:
add eax, [ rsi]
add ebx, [4 + rsi]
add rsi, 8
cmp rsi, rdx
jb .loop ; loop while rsi is Below one-past-the-end
; TODO: handle odd-length
add eax, ebx
ret
Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.
If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)
All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.
segment .rodata ; read-only data
ALIGN 16
array: times 64 dd 1, 2, 3, 4, 5
array_bytes equ $-array
string1 db "result: ",0
segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes] ; array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register. We don't need rdx until later.
movdqu xmm0, [rsi - 32] ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.
ALIGN 16
.loop:
paddd xmm0, [ rsi] ; add packed dwords
paddd xmm1, [16 + rsi]
add rsi, 32
cmp rsi, rdx
jb .loop ; loop: 4 fused-domain uops
paddd xmm0, xmm1
phaddd xmm0, xmm0 ; horizontal add: SSSE3 phaddd is simple but not optimal. Better to pshufd/paddd
phaddd xmm0, xmm0
movd eax, xmm0
; TODO: scalar cleanup loop
ret
Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.
It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.
With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.

How to write in nasm without writing null bytes?

Pretty simple problem.
This nasm is supposed to write a user-written message (i.e. hello) to a file, again determined by user input from an argument. It does this just fine, but the problem is, it writes all the null bytes not used afterwards as well. For example, if I reserve 32 bytes for user input, and the user only uses four for his input, those for bytes will be printed, along with 28 null bytes.
How do I stop printing null bytes?
Code used:
global _start
section .text
_start:
mov rax, 0 ; get input to write to file
mov rdi, 0
mov rsi, msg
mov rdx, 32
syscall
mov rax, 2 ; open the file at the third part of the stack
pop rdi
pop rdi
pop rdi
mov rsi, 1
syscall
mov rdi, rax
mov rax, 1 ; write message to file
mov rsi, msg
mov rdx, 32
syscall
mov rax, 3 ; close file
syscall
mov rax, 1 ; print success message
mov rdi, 1
mov rsi, output
mov rdx, outputL
syscall
mov rax, 60 ; exit
mov rdi, 0
syscall
section .bss
msg: resb 32
section .data
output: db 'Success!', 0x0A
outputL: equ $-output

Well, after doing some digging in header files and experimenting, I figured it out on my own.
Basically, the way it works is that you have to put the user's string through a byte counting process that counts along the string until it finds a null byte, and then stores that number of non-null bytes.
I'll post the workaround I'm using for anyone who's had the same problem as me. Keep in mind that this solution is for 64-bit nasm, NOT 32!
For 32-bit coders, change:
all instances of "rax" with "eax"
all instances of "rdi" with "ebx"
all instances of "rsi" with "ecx"
all instances of "rdx" with "edx"
all instances of "syscall" with "int 80h" (or equivelant)
all instances of "r8" with "edx" (you'll have to juggle this and rdx)
Here's the solution I use, in full:
global _start
; stack: (argc) ./a.out input filename
section .text
_start:
getInput:
mov rax, 0 ; syscall for reading user input
mov rdi, 0
mov rsi, msg ; store user input in the "msg" variable
mov rdx, 32 ; max input size = 32 bytes
xor r8, r8 ; set r8 to zero for counting purposes (this is for later)
getInputLength:
cmp byte [msg + r8], 0 ; compare ((a byte of user input) + 0) to 0
jz open ; if the difference is zero, we've found the end of the string
; so we move on. The length of the string is stored in r9.
inc r8 ; if not, onto the next byte...
jmp getInputLength ; so we jump back up four lines and repeat!
open:
mov rax, 2 ; syscall for opening files
pop rdi
pop rdi
pop rdi ; get the file to open from the stack (third argument)
mov rsi, 1 ; open in write mode
syscall
; the open syscall above has made us a full file descriptor in rax
mov rdi, rax ; so we move it into rdi for later
write:
mov rax, 1 ; syscall for writing to files
; rdi already holds our file descriptor
mov rsi, msg ; set the message we're writing to the msg variable
mov rdx, r8 ; set write length to the string length we measured earlier
syscall
close:
mov rax, 3 ; syscall for closing files
; our file descriptor is still in fd
syscall
exit:
mov rax, 60 ; syscall number for program exit
mov rdi, 0 ; return 0
Keep in mind that this is not a complete program. It totally lacks error handling, offers no user instruction, etc. It is only an illustration of method.

Array value fetching in asm x64

I have a problem with asm code that works when mixed with C, but does not when used in asm code with proper parameters.
;; array - RDI, x- RSI, y- RDX
getValue:
mov r13, rsi
sal r13, $3
mov r14, rdx
sal r14, $2
mov r15, [rdi+r13]
mov rax, [r15+r14]
ret
Technically I want to keep the rdi, rsi and rdx registers untouched and thus I use other ones.
I am using an x64 machine and thus my pointers have 8 bytes. Technically speaking this code is supposed to do:
int getValue(int** array, int x, int y) {
return array[x][y];
}
it somehow works inside my C code, but does not when used in asm in this way:
mov rdi, [rdi] ;; get first pointer - first row
mov r9, $4 ;; we want second element from the row
mov rax, [rdi+r9] ;; get the element (4 bytes vs 8 bytes???)
mov rdi, FMT ;; prepare printf format "%d", 10, 0
mov rsi, rax ;; we want to print the element we just fetched
mov eax, $0 ;; say we have no non-integer argument
call printf ;; always gives 0, no matter what's in the matrix
Can someone see into this and help me? Thanks in advance.

The sal r14, $2 implies the elements are dwords, so the last line before the ret shouldn't load a qword. Besides, x86 has nice scaling addressing modes, so you can do this:
mov rax, [rdi + rsi * 8] ; load pointer to column
mov eax, [rax + rdx * 4] ; note this loads a dword
ret
That implies that you have an array of pointers to columns, which is unusual. You can do that, but was it intended?

This is a standard matrix of integers.
int** array;
sizeof(int*) == 8
sizeof(int) == 4
How I see it is that when I have that array at first, I have a pointer to a space of memory without "blanks" that holds all pointers one by one (index-by-index), so I say "let's go to the element rsi-th of the array" and that's why I shift by rsi-th * 8 bytes. So now I get the same situation, but the pointer should point to a space of integers, so 4-byte items. That's why I shift by 4 bytes there.
Is my thinking wrong?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight