I want to learn programming in assembly to write fast and efficient code.
How ever I stumble over a problem I can't solve.
I want to loop over an array of double words and add its components like below:
%include "asm_io.inc"
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
string1 db "result: ",0
array dd 1, 2, 3, 4, 5
segment .bss
segment .text
global sum
sum:
prologue
mov rdi, string1
call print_string
mov rbx, array
mov rdx, 0
mov ecx, 5
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
mov rdi, rdx
call print_int
call print_nl
epilogue
Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:
section .rodata
int_format db "%i",0
string_format db "%s",0
section .text
global print_string, print_nl, print_int, read_int
extern printf, scanf, putchar
print_string:
prologue
; string address has to be passed in rdi
mov rsi,rdi
mov rdi,dword string_format
xor rax,rax
call printf
epilogue
print_nl:
prologue
mov rdi,0xA
xor rax,rax
call putchar
epilogue
print_int:
prologue
;integer arg is in rdi
mov rsi, rdi
mov rdi, dword int_format
xor rax,rax
call printf
epilogue
When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array.
Can somebody tell me why th loop skips the first element?
Edit
I forgot to mention that I'm using a x86_64 Linux system
I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.
If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).
The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like
#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);
Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.
An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.
It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P
My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.
So, on to your loop:
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.
Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).
When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.
;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4] ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi] ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8 ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4
ALIGN 16
.loop:
add eax, [ rsi]
add ebx, [4 + rsi]
add rsi, 8
cmp rsi, rdx
jb .loop ; loop while rsi is Below one-past-the-end
; TODO: handle odd-length
add eax, ebx
ret
Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.
If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)
All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.
segment .rodata ; read-only data
ALIGN 16
array: times 64 dd 1, 2, 3, 4, 5
array_bytes equ $-array
string1 db "result: ",0
segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes] ; array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register. We don't need rdx until later.
movdqu xmm0, [rsi - 32] ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.
ALIGN 16
.loop:
paddd xmm0, [ rsi] ; add packed dwords
paddd xmm1, [16 + rsi]
add rsi, 32
cmp rsi, rdx
jb .loop ; loop: 4 fused-domain uops
paddd xmm0, xmm1
phaddd xmm0, xmm0 ; horizontal add: SSSE3 phaddd is simple but not optimal. Better to pshufd/paddd
phaddd xmm0, xmm0
movd eax, xmm0
; TODO: scalar cleanup loop
ret
Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.
It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.
With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.
Related
I am currently learning assembly language, and I have a program which outputs "Hello World!" :
section .text
global _start
_start:
mov ebx, 1
mov ecx, string
mov edx, string_len
mov eax, 4
int 0x80
mov eax, 1
int 0x80
section .data
string db "Hello World!", 10, 0
string_len equ $ - string
I understand how this code works. But, now, I want to display 10 times the line. The code which I saw on the internet to loop is the following:
mov ecx, 5
start_loop:
; the code here would be executed 5 times
loop start_loop
Problem : I tried to implement the loop on my code but it outputs an infinite loop. I also noticed that the loop needs ECX and the write function also needs ECX. What is the correct way to display 10 times my "Hello World!" ?
This is my current code (which produces infinite loop):
section .text
global _start
_start:
mov ecx, 10
myloop:
mov ebx, 1 ;file descriptor
mov ecx, string
mov edx, string_len
mov eax, 4 ; write func
int 0x80
loop myloop
mov eax, 1 ;exit
int 0x80
section .data
string db "Hello World!", 10, 0
string_len equ $ - string
Thank you very much
loop uses the ecx register. This is easy the remember because the c stands for counter. You however overwrite the ecx register so that will never work!
The easiest fix is to use a different register for your loop counter, and avoid the loop instruction.
mov edi,10 ;registers are general purpose
loop:
..... do loop stuff as above
;push edi ;no need save the register
int 0x80
.... ;unless you yourself are changing edi
int 0x80
;pop edi ;restore the register. Remember to always match push/pop pairs.
sub edi,1 ;sub sometimes faster than dec and never slower
jnz loop
There is no right or wrong way to loop.
(Unless you're looking for every last cycle, which you are not, because you've got a system call inside the loop.)
The disadvantage of loop is that it's slower than the equivalent sub ecx,1 ; jnz start_of_loop.
The advantage of loop is that it uses less instruction bytes. Think of it as a code-size optimization you can use if ECX happens to be a convenient register for looping, but at the cost of speed on some CPUs.
Note that the use of dec reg + jcc label is discouraged for some CPUs (Silvermont / Goldmont being still relevant, Pentium 4 not). Because dec only alters part of the flags register it can require an extra merging uop. Mainstream Intel and AMD rename parts of EFLAGS separately so there's no penalty for dec / jnz (because jnz only reads one of the flags written by dec), and can even micro-fuse into a dec-and-branch uop on Sandybridge-family. (But so can sub). sub is never worse, except code-size, so you may want to use sub reg,1 for the forseeable future.
A system call using int 80h does not alter any registers other than eax, so as long as you remember not to mess with edi you don't need the push/pop pair. Pick a register you don't use inside the loop, or you could just use a stack location as your loop counter directly instead of push/pop of a register.
Once again I'm jumping back into teaching myself basic assembly language again so I don't completely forget everything.
I made this practice code the other day, and in it, it turned out that I had to plop the results of a vector operation into an array backwards; otherwise it gave a wrong answer. Incidentally, this is also how GCC similarly outputs assembly code for SIMD operation results going back to a memory location, so I presume it's the "correct" way.
However, something occurred to me, from something I've been conscious about as a game developer for quite a long time: cache friendliness. My understanding is that moving forward in a contiguous block of memory is always ideal, else you risk cache misses.
My question is: Even if this example below is nothing more than calculating a couple of four-element vectors and spitting out four numbers before quitting, I have to wonder if this -- putting numbers back into an array in what's technically in reverse order -- has any impact at all on cache misses in the real world, within in a typical production-level program that does hundreds of thousands of SIMD vector calculations (and more specifically, returning them back to memory) per second?
Here is the full code (linux 64-bit NASM) with comments including the original one that prompted me to bring this curiosity of mine to stackexchange:
extern printf
extern fflush
global _start
section .data
outputText: db '[%f, %f, %f, %f]',10,0
align 16
vec1: dd 1.0, 2.0, 3.0, 4.0
vec2: dd 10.0,10.0,10.0,50.0
section .bss
result: resd 4 ; four 32-bit single-precision floats
section .text
_start:
sub rsp,16
movaps xmm0,[vec1]
movaps xmm1,[vec2]
mulps xmm0,xmm1 ; xmm0 = (vec1 * vec2)
movaps [result],xmm0 ; copy 4 floats back to result[]
; printf only accepts 64-bit floats for some dumb reason,
; so convert these 32-bit floats packed within the 128-bit xmm0
; register into four 64-bit floats, each in a separate xmm* reg
movss xmm0,[result+12] ; result[3]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm3,xmm0 ; put double in 4th XMM
movss xmm0,[result+8] ; result[2]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm2,xmm0 ; put double in 3rd XMM
movss xmm0,[result+4] ; result[1]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm1,xmm0 ; put double in 2nd XMM
movss xmm0,[result] ; result[0]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm0,xmm0 ; put double in 1st XMM
; FOOD FOR THOUGHT!
; *****************
; That was done backwards, going from highest element
; of what is technically an array down to the lowest.
;
; This is because when it was done from lowest to
; highest, this garbled bird poop was the answer:
; [13510801139695616.000000, 20.000000, 30.000000, 200.000000]
;
; HOWEVER, if the correct way is this way, in which
; it traipses through an array backwards...
; is that not cache-unfriendly? Or is it too tiny and
; miniscule to have any impact with cache misses?
mov rdi, outputText ; tells printf where is format string
mov rax,4 ; tells printf to print 4 XMM regs
call printf
mov rdi,0
call fflush ; ensure we see printf output b4 exit
add rsp,16
_exit:
mov eax,1 ; syscall id for sys_exit
mov ebx,0 ; exit with ret of 0 (no error)
int 80h
HW prefetchers can recognize streams with descending addresses as well as ascending. Intel's optimization manual documents the HW prefetchers in fair detail. I think AMD's prefetchers are broadly similar in terms of being able to recognize descending patterns as well.
Within a single cache-line, it doesn't matter at all what order you access things in, AFAIK.
See the x86 tag wiki for more links, especially Agner Fog's Optimizing Assembly guide to learn how to write asm that isn't slower than what a compiler could make. The tag wiki also has links to Intel's manuals.
Also, that is some ugly / bad asm. Here's how to do it better:
Printf only accepts double because of C rules for arg promotion to variadic functions. Yes, this is kinda dumb, but FP->base-10-text conversion dwarfs the overhead from an extra float->double conversion. If you need high-performance FP->string, you probably should avoid using a function that has to parse a format string every call.
debug-prints in ASM are usually more trouble than they're worth, compared to using a debugger.
Also:
This is 64-bit code, so don't use the 32-bit int 0x80 ABI to exit.
The UNPCKLPS instructions are pointless, because you only care about the low element anyway. CVTPS2PD produces two results, but you're converting the same number twice in parallel instead of converting two and then unpacking. Only the low double in an XMM matters, when calling a function that takes scalar args, so you can leave high garbage.
Store/reload is also pointless
DEFAULT REL ; use RIP-relative addressing for [vec1]
extern printf
;extern fflush ; just call exit(3) instead of manual fflush
extern exit
section .rodata ; read-only data can be part of the text segment
outputText: db '[%f, %f, %f, %f]',10,0
align 16
vec1: dd 1.0, 2.0, 3.0, 4.0
vec2: dd 10.0,10.0,10.0,50.0
section .bss
;; static scratch space is unwise. Use the stack to reduce cache misses, and for thread safety
; result: resd 4 ; four 32-bit single-precision floats
section .text
global _start
_start:
;; sub rsp,16 ; What was this for? We have a red-zone in x86-64 SysV, and we don't use
movaps xmm2, [vec1]
; fold the load into the mulps
mulps xmm2, [vec2] ; (vec1 * vec2)
; printf only accepts 64-bit doubles, because it's a C variadic function.
; so convert these 32-bit floats packed within the 128-bit xmm0
; register into four 64-bit floats, each in a separate xmm* reg
; xmm2 = [f0,f1,f2,f3]
cvtps2pd xmm0, xmm2 ; xmm0=[d0,d1]
movaps xmm1, xmm0
unpckhpd xmm1, xmm1 ; xmm1=[d1,d1]
unpckhpd xmm2, xmm2 ; xmm2=[f2,f3, f2,f3]
cvtps2pd xmm2, xmm2 ; xmm2=[d2,d3]
movaps xmm3, xmm3
unpckhpd xmm3, xmm3 ; xmm3=[d3,d3]
mov edi, outputText ; static data is in the low 2G, so we can use 32-bit absolute addresses
;lea rdi, [outputText] ; or this is the PIC way to do it
mov eax,4 ; tells printf to print 4 XMM regs
call printf
xor edi, edi
;call fflush ; flush before _exit()
jmp exit ; tailcall exit(3) which does flush, like if you returned from main()
; add rsp,16
;; this is how you would exit if you didn't use the libc function.
_exit:
xor edi, edi
mov eax, 231 ; exit_group(0)
syscall ; 64-bit code should use the 64-bit ABI
You could also use MOVHLPS to move the high 64 bits from one register into the low 64 bits of another reg, but that has a false dependency on the old contents.
cvtps2pd xmm0, xmm2 ; xmm0=[d0,d1]
;movaps xmm1, xmm0
;unpckhpd xmm1, xmm1 ; xmm1=[d1,d1]
;xorps xmm1, xmm1 ; break the false dependency
movhlps xmm1, xmm0 ; xmm1=[d1,??] ; false dependency on old value of xmm1
On Sandybridge, xorps and movhlps would be more efficient, because it can handle xor-zeroing without using an execution unit. IvyBridge and later, and AMD CPUs, can eliminate the MOVAPS the same way: zero latency. But still takes a uop and some frontend throughput resources.
If you were going to store and reload, and convert each float separately, you'd use CVTSS2SD, either as a load (cvtss2sd xmm2, [result + 12]) or after movss.
Using MOVSS first would break the false-dependency on the full register, which CVTSS2SD has because Intel designed it badly, to merge with the old value instead of replacing. Same for int->float or double conversion. The merging case is much rarer than scalar math, and can be done with a reg-reg MOVSS.
Hello Everyone!
I'm a newbie at NASM and I just started out recently. I currently have a program that reserves an array and is supposed to copy and display the contents of a string from the command line arguments into that array.
Now, I am not sure if I am copying the string correctly as every time I try to display this, I keep getting a segmentation error!
This is my code for copying the array:
example:
%include "asm_io.inc"
section .bss
X: resb 50 ;;This is our array
~some code~
mov eax, dword [ebp+12] ; eax holds address of 1st arg
add eax, 4 ; eax holds address of 2nd arg
mov ebx, dword [eax] ; ebx holds 2nd arg, which is pointer to string
mov ecx, dword 0
;Where our 2nd argument is a string eg "abcdefg" i.e ebx = "abcdefg"
copyarray:
mov al, [ebx] ;; get a character
inc ebx
mov [X + ecx], al
inc ecx
cmp al, 0
jz done
jmp copyarray
My question is whether this is the correct method of copying the array and how can I display the contents of the array after?
Thank you!
The loop looks ok, but clunky. If your program is crashing, use a debugger. See the x86 for links and a quick intro to gdb for asm.
I think you're getting argv[1] loaded correctly. (Note that this is the first command-line arg, though. argv[0] is the command name.) https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames says ebp+12 is the usual spot for the 2nd arg to a 32bit functions that bother to set up stack frames.
Michael Petch commented on Simon's deleted answer that the asm_io library has print_int, print_string, print_char, and print_nl routines, among a few others. So presumably you a pointer to your buffer to one of those functions and call it a day. Or you could call sys_write(2) directly with an int 0x80 instruction, since you don't need to do any string formatting and you already have the length.
Instead of incrementing separately for two arrays, you could use the same index for both, with an indexed addressing mode for the load.
;; main (int argc ([esp+4]), char **argv ([esp+8]))
... some code you didn't show that I guess pushes some more stuff on the stack
mov eax, dword [ebp+12] ; load argv
;; eax + 4 is &argv[1], the address of the 1st cmdline arg (not counting the command name)
mov esi, dword [eax + 4] ; esi holds 2nd arg, which is pointer to string
xor ecx, ecx
copystring:
mov al, [esi + ecx]
mov [X + ecx], al
inc ecx
test al, al
jnz copystring
I changed the comments to say "cmdline arg", to distinguish between those and "function arguments".
When it doesn't cost any extra instructions, use esi for source pointers, edi for dest pointers, for readability.
Check the ABI for which registers you can use without saving/restoring (eax, ecx, and edx at least. That might be all for 32bit x86.). Other registers have to be saved/restored if you want to use them. At least, if you're making functions that follow the usual ABI. In asm you can do what you like, as long as you don't tell a C compiler to call non-standard functions.
Also note the improvement in the end of the loop. A single jnz to loop is more efficient than jz break / jmp.
This should run at one cycle per byte on Intel, because test/jnz macro-fuse into one uop. The load is one uop, and the store micro-fuses into one uop. inc is also one uop. Intel CPUs since Core2 are 4-wide: 4 uops issued per clock.
Your original loop runs at half that speed. Since it's 6 uops, it takes 2 clock cycles to issue an iteration.
Another hacky way to do this would be to get the offset between X and ebx into another register, so one of the effective addresses could use a one-register addressing mode, even if the dest wasn't a static array.
mov [X + ebx + ecx], al. (Where ecx = X - start_of_src_buf). But ideally you'd make the store the one that used a one-register addressing mode, unless the load was a memory operand to an ALU instruction that could micro-fuse it. Where the dest is a static buffer, this address-different hack isn't useful at all.
You can't use rep string instructions (like rep movsb) to implement strcpy for implicit-length strings (C null-terminated, rather than with a separately-stored length). Well you could, but only scanning the source twice: once for find the length, again to memcpy.
To go faster than one byte clock, you'd have to use vector instructions to test for the null byte at any of 16 positions in parallel. Google up an optimized strcpy implementation for example. Probably using pcmpeqb against a vector register of all-zeros.
I'm trying to write a little program in assembler which takes three char arrays as input, calculates the avarage of each element in the first to arrays and stores the result in the third array like below.
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
global avgArray
avgArray:
prologue
mov [a1], rdi
mov [a2], rsi
mov [avg], rdx
mov [avgL], rcx
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; array length
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop:
mov al, [rsi]
mov dl, [r9]
add ax, dx
shr ax, 1
mov [rdi], al
add rsi, [offset]
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
When replacing [offset] with 1 it works perfectly fine. However when using [offset] to determine the next array element it seems that it wont add its value to rsi, rdi and r9.
I allready checked it using gdb. The adress stored in rsi is still the same after calling add rsi, [offset].
Can someone tell me why using [offset] wont work but adding a simple 1 does?
By the way: Linux x86_64 machine
So i found the solution for that problem.
The adresses of avgL and offset where stored directly behind each other. When reading from rcx and storing it to avgL it also overwrote the value of offset. Declaring avgL as a QWORD instead of a DWORD prevents mov from overwriting offset data.
The new data and bss segments look like this
segment .data
offset db 1
segment .bss
a1 resq 1
a2 resq 1
avg resq 1
avgL resq 1
Nice work on debugging your problem yourself. Since I already started to look at the code, I'll give you some efficiency / style critique as added comments:
%macro prologue 0
push rbp
mov rbp,rsp ; you can drop this and the LEAVE.
; Stack frames were useful before debuggers could keep track of things without them, and as a convenience
; so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss ; These should really be locals on the stack (or in regs!), not globals
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
; usually a comment with a C function prototype and description is a good idea for functions
global avgArray
avgArray:
prologue
mov [a1], rdi ; what is this sillyness? you have 16 registers for a reason.
mov [a2], rsi ; shuffling the values you want into the regs you want them in
mov [avg], rdx ; is best done with reg-reg moves.
mov [avgL], rcx ; I like to just put a comment at the top of a block of code
; to document what goes in what reg.
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; This could be lea rcx, [rsi+rcx]
; (since avgL is in rcx anyway as a function arg).
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop: ; you can use a local label here, starting with a .
; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
mov al, [rsi] ; there's a data dependency on the old value of ax
mov dl, [r9] ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm
add ax, dx ; Avoid ALU ops on 16bit regs whenever possible. (8bit is fine, they have diff opcodes instead of a prefix)
; to avoid decode stalls on Intel
shr ax, 1 ; Better to use 32bit regs (movsx/movzx)
mov [rdi], al
add rsi, [offset] ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
You have tons of registers free, why are you keeping the loop increment in memory? I hope it just ended up that way while debugging / trying stuff.
Also, 1-reg addressing modes are only more efficient when used as mem operands for ALU ops. Just increment a single counter and used base+offset*scale addressing when you have a lot of pointers (unless you're unrolling the loop), esp. if you load them with mov.
Here's how I'd do it (with perf analysis for Intel SnB and later):
scalar
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; if you can choose your prototype, do it so args go where you want them anyway.
; prologue
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; mov [rsp-8], rcx ; if I wanted to spill len to memory
add rcx, rdi
add rcx, rsi
add rcx, rdx
neg rcx ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
; We could also have just counted down towards zero
; but HW memory prefetchers have more stream slots for forward patterns than reverse.
ALIGN 16
.loop:
; use movsx for signed char
movzx eax, [rsi+rcx] ; dependency-breaker
movzx r8d, [rdx+rcx] ; Using r8d to save push/pop of rbx
; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
; using ebx or ebp would save a REX prefix (1 insn byte).
add eax, r8d
shr eax, 1
mov [rdi+rcx], al
inc rcx ; No cmp needed: this is the point of counting up towards zero
jl .loop ; inc/jl can Macro-fuse into one uop
; nothing to pop, we only used caller-saved regs.
ret
On Intel, the loop is 7 uops, (the store is 2 uops: store address and store data, and can't micro-fuse), so a CPU that can issue 4 uops per cycle will do it at 2 cycles per byte. movzx (to a 32 or 64bit reg) is 1 uop regardless, because there isn't a port 0/1/5 uop for it to micro-fuse with or not. (It's a read, not read-modify).
7 uops takes 2 chunks of up-to-4 uops, so the loop can issue in 2 cycles. There are no other bottlenecks that should prevent the execution units from keeping up with that, so it should run one per 2 cycles.
vector
There's a vector instruction to do exactly this operation: PAVGB is packed avg of unsigned bytes (with a 9-bit temporary to avoid overflow, same as your add/shr).
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
; use movsx for signed char
movdqu xmm0, [rsi+rcx] ; 1 uop
pavgb xmm0, [rdx+rcx] ; 2 uops (no micro-fusion)
movdqu [rdi+rcx], xmm0 ; 2 uops: no micro-fusion
add rcx, 16
jl .loop ; 1 macro-fused uop add/branch
; TODO: scalar cleanup.
ret
Getting the loop-exit condition right is tricky, since you need to end the vector loop if the next 16B goes off the end of the array. Prob. best to handle that by decrementing rcx by 15 or something before adding it to the pointers.
So again, 6 uops / 2 cycles per iteration, but each iteration will do 16 bytes. It's ideal to unroll so your loop is a multiple of 4 uops, so you're not losing out on issue rate with a cycle of less-than-4 uops at the end of a loop. 2 loads / 1 store per cycle is our bottleneck here, since PAVGB has a throughput of 2 per cycle.
16B / cycle shouldn't be difficult on Haswell and later. With AVX2 using ymm registers, you'd get 32B / cycle. (SnB/IvB can only do two memory ops per cycle, at most one of them a store, unless you use 256b loads/stores). Anyway, at this point you've gained a massive 16x speedup from vectorizing, and usually that's good enough. I just enjoy tuning things for theoretical-max throughput by counting uops and unrolling. :)
If you were going to unroll the loop at all, then it would be worth incrementing pointers instead of just an index. (So there would be two uses of [rdx] and one add, vs. two uses of [rdx+rcx]).
Either way, cleaning up the loop setup and keeping everything in registers saves a decent amount of instruction bytes, and overhead for short arrays.
I have this simple binary search member function, where lastIndex, nIter and xi are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?
If not, can anybody suggest how to write a gcc assembler template for this?
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
Thanks
As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).
This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.
As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.
The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
The used constraints are:
hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
xi[mid] and z are in the "read" list;
z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.