I'm trying to learn some basic x86 32-bit assembly programming. So in pursuing this I decided to implement quicksort in assembly (sorting only integers). First I made a C-version of the sorting function and then I made an assembly version.
However, when comparing my assembly version with the my C-version (compiled with gcc on Debian), the C-version performs more then 10 times faster on a array of 10000 integers.
So my question is if anybody can give some feedback on obvious optimizations that can be made on my quick sort assembly routine. It's purely for educational purposes and I'm not expecting to beat the compiler makers in terms of producing high speed code but I'm interested in knowing if I'm making any obvious mistakes that hampers speed.
The C-version:
void myqsort(int* elems, int sidx, int eidx)
{
if (sidx < eidx)
{
int pivot = elems[eidx];
int i = sidx;
for (int j = sidx; j < eidx; j++)
{
if (elems[j] <= pivot)
{
swap(&elems[i], &elems[j]);
i = i + 1;
}
}
swap(&elems[i], &elems[eidx]);
myqsort(elems, sidx, i - 1);
myqsort(elems, i + 1, eidx);
}
}
void swap(int* a, int* b)
{
int tmp = *a;
*a = *b;
*b = tmp;
}
Assembly version (NASM):
;
; void asm_quick_sort(int* elems, int startindex, int endindex)
; Params:
; elems - pointer to elements to sort - [ebp + 0x8]
; sid - start index of items - [ebp + 0xC]
; eid - end index of items - [ebp + 0x10]
asm_quick_sort:
push ebp
mov ebp, esp
push edi
push esi
push ebx
mov eax, dword [ebp + 0xC] ; store start index, = i
mov ebx, dword [ebp + 0x10] ; store end index
mov esi, dword [ebp + 0x8] ; store pointer to first element in esi
cmp eax, ebx
jnl qsort_done
mov ecx, eax ; ecx = j, = sid
mov edx, dword [esi + (0x4 * ebx)] ; pivot element, elems[eid], edx = pivot
qsort_part_loop:
; for j = sid; j < eid; j++
cmp ecx, ebx ; if ecx < end index
jnb qsort_end_part
; if elems[j] <= pivot
cmp edx, dword [esi + (0x4*ecx)]
jb qsort_cont_loop
; do swap, elems[i], elems[j]
push edx ; save pivot for now
mov edx, dword [esi + (0x4*ecx)] ; edx = elems[j]
mov edi, dword [esi + (0x4*eax)] ; edi = elems[i]
mov dword [esi + (0x4*eax)], edx ; elems[i] = elems[j]
mov dword [esi + (0x4*ecx)], edi ; elems[j] = elems[i]
pop edx ; restore pivot
; i++
add eax, 0x1
qsort_cont_loop:
add ecx, 0x1
jmp qsort_part_loop
qsort_end_part:
; do swap, elems[i], elems[eid]
mov edx, dword [esi + (0x4*eax)] ; edx = elems[i]
mov edi, dword [esi + (0x4*ebx)] ; edi = elems[eid]
mov dword [esi + (0x4*ebx)], edx ; elems[eidx] = elems[i]
mov dword [esi + (0x4*eax)], edi ; elems[i] = elems[eidx]
; qsort(elems, sid, i - 1)
; qsort(elems, i + 1, eid)
sub eax, 0x1
push eax
push dword [ebp + 0xC] ; push start idx
push dword [ebp + 0x8] ; push elems vector
call asm_quick_sort
add esp, 0x8
pop eax
add eax, 0x1
push dword [ebp + 0x10] ; push end idx
push eax
push dword [ebp + 0x8] ; push elems vector
call asm_quick_sort
add esp, 0xC
qsort_done:
pop ebx
pop esi
pop edi
mov esp, ebp
pop ebp
ret
I call the assembly routine from C and I use clock() for timing the routines.
EDIT
The difference in performance is no longer an issue after correcting the bugs pointed out by my fellow stackoverflowers.
You have an error in your assembly sort implementation, and speed comparisons are useless until you resolve it. The problem is the recursive call:
myqsort(elems, sidx, i - 1);
Seeing as it is not necessarily the case that i is not sidx, this might pass a value less than sidx to the function, including -1 if sidx is 0. This is handled in your C implementation:
if (sidx < eidx)
But in your assembly version:
cmp eax, ebx
jae qsort_done
That's an unsigned comparison branch instruction! You should be using jge. I see a segfault due to this problem. When fixed, the performance of both implementations appears to be roughly the same according to my quick tests (compiling with -O3). I used the following test driver:
#include <stdlib.h>
#include <stdio.h>
void myqsort(int * elems, int sidx, int eidx);
#define SIZE 100000
int main(int argc, char **argv)
{
int * elems = malloc(SIZE * sizeof(int));
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < SIZE; i++) {
elems[i] = rand();
}
myqsort(elems, 0, SIZE - 1);
}
return 0;
}
With the C version, run-time was approx 5.854 seconds.
With the assembly version, it was 5.829 seconds (i.e. slightly faster).
You can optimize the swapping of elements using only 1 additional register EDI and without the need for pushing and popping the pivot value in EDX:
mov edi, dword [esi + (0x4*eax)] ; edi = elems[i]
xchg dword [esi + (0x4*ecx)], edi ; elems[j] = edi, edi = elems[j]
mov dword [esi + (0x4*eax)], edi ; elems[i] = edi
The second swap can also be shortened:
mov edi, dword [esi + (0x4*ebx)] ; edi = elems[eid]
xchg dword [esi + (0x4*eax)], edi ; elems[i] = edi, edi = elems[i]
mov dword [esi + (0x4*ebx)], edi ; elems[eid] = edi
You can safely remove the mov esp, ebp from your epilog code because it is redundant. If those 3 pop's went well you already know that the stackpointer has the correct value.
qsort_done:
pop ebx
pop esi
pop edi
mov esp, ebp <-- This is useless!
pop ebp
ret
Related
I am trying to write an iterative version of InsertionSort using MASM. After repeatedly getting unexpected errors, I tried running through the code line by line and watching in the debugger that it did what I expected. Sure enough, I the cmp instruction within my 'while loop' doesn't seem to be jumping every time it's supposed to:
; store tmp in edx
movzx edx, word ptr[ebx + esi * 2]; edx: tmp = A[i]
...
//blah blah
while_loop :
...
movzx eax, word ptr[ebx + 2 * edi]
cmp dx, ax
jge exit_while
For example, if I use the data given, after the first 'for loop' iteration, I get to the point where EDX=0000ABAB, and EAX=00003333. I reach the lines:
cmp dx, ax
jge exit_while
Since ABAB>3333, I'd expect it to jump to exit_while, yet it doesn't!
What is going on here??? I'm at a total loss.
.data
arr word 3333h, 1111h, 0ABABh, 1999h, 25Abh, 8649h, 0DEh, 99h
sizeArr dword lengthof arr
printmsg byte "The array is: [ ", 0
comma byte ", ", 0
endmsg byte " ]", 0
.code
main proc
call printArr
call crlf
push sizeArr
push offset arr
call insertionsort
call crlf
call printArr
call crlf
call exitprocess
main endp
insertionsort proc
push ebp
mov ebp, esp
_arr = 8
len = _arr + 4
mov ebx, [ebp + _arr]
mov ecx, [ebp + len]
dec ecx
mov esi, 1; store i in esi, with i=1 initially
; for (i = 1; i < SIZE; i++)
forloop:
; store tmp in edx
movzx edx, word ptr[ebx + esi * 2]; edx: tmp = A[i]
; store j in edi, where in initially j = i
mov edi, esi
;set j=i-1
dec edi
;while (j >= 0 && tmp<arr[j])
while_loop :
cmp edi, 0
jl exit_while
movzx eax, word ptr[ebx + 2 * edi]
; cmp dx, [ebx + 2 * edi]
cmp dx, ax
jge exit_while
; A[j] = A[j-1]
push word ptr [ebx+2*edi]
pop word ptr [ebx+2*edi+2]
; j = j - 1
dec edi
jmp while_loop
exit_while:
push dx
pop word ptr[ebx + 2*edi+2]
; mov[ebx + edi], dx; A[j] = tmp
inc esi; i = i + 1
loop forloop
finished:
mov esp, ebp
pop ebp
ret 8
insertionsort endp
printArr proc
push ebp
mov ebp, esp
mov ebx, offset sizeArr
mov ecx, [ebx]
mov esi, offset arr
mov edx, offset printmsg
call writestring
mov edx, offset comma
loop1 :
movzx eax, word ptr [esi]
call writeHex
call writestring
add esi, 2
loop loop1
mov edx, offset endmsg
call writestring
mov esp, ebp
pop ebp
ret
printArr endp
jge is the version for signed values - as such, a word with the value ABAB is negative - hence the comparison result you see.
Try jae (jump if above or equal) instead - the unsigned equivalent.
I wrote this classic function : (in 32-bit mode)
void ex(size_t a, size_t b)
{
size_t c;
c = a;
a = b;
b = c;
}
I call it inside the main as follows :
size_t a = 4;
size_t b = 5;
ex(a,b);
What I was expecting from the assembly code generated when entering the function is something like this :
1-Push the values of b and a in the stack : (which was done)
mov eax,dword ptr [b]
push eax
mov ecx,dword ptr [a]
push ecx
2-Use the values of a and b in the stack :
push ebp
mov ebp, esp
sub esp, 4
c = a;
mov eax, dword ptr [ebp+8]
mov dword ptr [ebp-4], eax
and so on for the other variables.
However, this is what I find when debugging :
push ebp
mov ebp,esp
sub esp,0CCh // normal since it's in debug with ZI option
push ebx
push esi
push edi
lea edi,[ebp-0CCh]
mov ecx,33h
mov eax,0CCCCCCCCh
rep stos dword ptr es:[edi]
size_t c;
c = a;
mov eax,dword ptr [a]
mov dword ptr [c],eax
Why is it using the variable a directly instead of calling the value stored in the stack? I don't understand...
The debugger doesn't show the instruction using ebp to access a. The same syntax is permitted when you write inline assembly. Otherwise the reason that dword ptr still appears.
It is easy to get it your preferred way, right click > untick "Show Symbol Names".
Using the assembly output option (right click on file name, properties, ...), I get what you expect from debug assembly output. This could depend on which version of VS you use. For this example, I used VS2005. I have VS2015 on a different system, but didn't try it yet.
_c$ = -8 ; size = 4
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
_ex PROC ; COMDAT
push ebp
mov ebp, esp
sub esp, 204 ; 000000ccH
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-204]
mov ecx, 51 ; 00000033H
mov eax, -858993460 ; ccccccccH
rep stosd ;fill with 0cccccccch
mov eax, DWORD PTR _a$[ebp]
mov DWORD PTR _c$[ebp], eax
mov eax, DWORD PTR _b$[ebp]
mov DWORD PTR _a$[ebp], eax
mov eax, DWORD PTR _c$[ebp]
mov DWORD PTR _b$[ebp], eax
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
_ex ENDP
Note this doesn't work, you need to use pointers for the swap to work.
void ex(size_t *pa, size_t *pb)
{
size_t c;
c = *pa;
*pa = *pb;
*pb = c;
}
which gets translated into:
_c$ = -8 ; size = 4
_pa$ = 8 ; size = 4
_pb$ = 12 ; size = 4
_ex PROC ; COMDAT
push ebp
mov ebp, esp
sub esp, 204 ; 000000ccH
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-204]
mov ecx, 51 ; 00000033H
mov eax, -858993460 ; ccccccccH
rep stosd
mov eax, DWORD PTR _pa$[ebp]
mov ecx, DWORD PTR [eax]
mov DWORD PTR _c$[ebp], ecx
mov eax, DWORD PTR _pa$[ebp]
mov ecx, DWORD PTR _pb$[ebp]
mov edx, DWORD PTR [ecx]
mov DWORD PTR [eax], edx
mov eax, DWORD PTR _pb$[ebp]
mov ecx, DWORD PTR _c$[ebp]
mov DWORD PTR [eax], ecx
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
_ex ENDP
I write the following code to read some numbers ranging from -15 to 15 from the user and the user may define how many numbers to enter. Then I bubble sort the array to get the smallest number. (Bubble sort because I will need to print other information) However, the code is not working. Here is my code.
// oops.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
char message0[] = "How many numbers do you want to enter? \n";
char message1[] = "Enter the current reading: \n";
char message2[] = "Error!\n";
char message3[] = "The smallest number is: \n";
char format1[] = "%d";
char format2[] = "%s";;
int myarray[10000];
int No;
int counter;
int *p;
p = myarray - 1;
_asm{
lea eax, message0
push eax
call printf
add esp, 4
//read how many numbers the user would like to input
lea eax,counter
push eax
lea eax, format1
push eax
call scanf_s
add esp,8
mov No, 1
mov ecx, counter
mov ebx, 0
//read user's input
Input: push ecx
push No
lea eax, message1
push eax
call printf
add esp, 8
lea eax, myarray[ebx]
push eax
lea eax, format1
push eax
call scanf_s
add esp,8
//judge if the number is in the range of -15 to 15
JudgeInput: mov eax, myarray[ebx]
cmp eax,-15
jl Illegal
cmp eax,15
jle Legal
Illegal: lea eax,message2
push eax
call printf
add esp,4
pop ecx
jmp Input
Legal: add ebx,4
inc No
pop ecx
loop Input
//bubble sort
mov esi, p
mov ecx, counter
outer : mov edx, ecx
inner : cmp edx, ecx
jz exchangeNo
mov eax, [esi + ecx * 4]
mov ebx, [esi + edx * 4]
cmp eax, ebx
jnb exchangeNo
mov[esi + ecx * 4], ebx
mov[esi + edx * 4], eax
exchangeNo :
dec edx
jnz inner
loop outer
finish:
smallest: //print the smallest number
mov ebx,0
lea eax,message3
push eax
lea eax, format2
push eax
call printf
mov eax,0
lea ebx,myarray
sub ebx,4
add ebx,No
lea eax, [ebx]
push eax
lea eax,format1
call printf
add esp,16
}
return 0;
}
It would not return the smallest number. Sometimes it returns strange characters. I get really confusing. Additionally, when I enter negative numbers, the bubble sort seems not working well.
I have solved the problem. Here is my updated code:
int _tmain(int argc, _TCHAR* argv[])
{
char message0[] = "How many numbers do you want to enter? \n";
char message1[] = "Enter the current reading: \n";
char message2[] = "Error!\n";
char message3[] = "\nThe smallest number is: ";
char format1[] = "%d";
char format2[] = "%s";;
int myarray[10000];
int No;
int counter;
int *p;
p = myarray - 1;
_asm{
lea eax, message0
push eax
call printf
add esp, 4
lea eax,counter
push eax
lea eax, format1
push eax
call scanf_s
add esp,8
//get the user's input into the array
mov No, 1
mov ecx, counter
mov ebx, 0
Input:
push ecx
push No
lea eax, message1
push eax
call printf
add esp, 8
lea eax, myarray[ebx]
push eax
lea eax, format1
push eax
call scanf_s
add esp,8
//judge if the input is between -15 and 15
JudgeInput:
mov eax, myarray[ebx]
cmp eax,-15
jl Illegal
cmp eax,15
jle Legal
//if not, print out error message
Illegal:
lea eax,message2
push eax
call printf
add esp,4
pop ecx
jmp Input
//if yes, loop again
Legal:
add ebx,4
inc No
pop ecx
loop Input
//bubble sort
mov esi, p
mov ecx, counter
//the outer loop
outer : mov edx, ecx
//the inner loop
inner : cmp edx, ecx
je exchangeNo
mov eax, [esi + ecx * 4]
mov ebx, [esi + edx * 4]
cmp eax, ebx
jge exchangeNo
mov[esi + ecx * 4], ebx
mov[esi + edx * 4], eax
exchangeNo :
dec edx
jge inner
loop outer
finish:
//find out the smallest number
smallest :
lea eax, message3
push eax
lea eax, format2
push eax
call printf
lea ebx, myarray
mov eax, [ebx]
push eax
lea eax, format1
push eax
call printf
add esp, 16
}
}
I'm having some trouble with arrays in NASM. I am trying to implement this pseudo code but am obtaining incorrect results.
For instance, when I send in 'abcdefgh' as byte_arr I should get 8 7 6 5 4 3 2 1. However, I actually receive: 5 4 3 2 1 1 1 1. Here's the code:
maxLyn(Z[0 . . n − 1], n, k) : returns integer
if k = n − 1
return 1
p ← 1
for i ← k+1 to n − 1
{
if Z[i−p] = Z[i]
if Z[i−p] > Z[i]
return p
else
p ← i+1−k
}
return p.
Rather than returning p, I wish to store it as a global variable which I can access in my other subroutines. Here is my attempt at coding it:
enter 0, 0 ; creates stack
pusha ; pushes current register values onto stack
mov eax, [ebp+16] ; kth index
mov ebx, [ebp+12] ; length
mov edi, [ebp+8] ; byte_arr
mov [len], ebx ; len = length
sub [len], dword 1 ; len = len - 1
mov ebx, [len] ; ebx contains len - 1
mov [k], eax ; k = epb+16
mov [p], dword 1 ; p = 1
mov [i], eax ; i = k
inc dword [i] ; i = k+1
mov ecx, [i] ; ecx = i
cmp [k], ebx ; checks if kth index is last element
je ENDFOR
FOR: cmp dword [i], ebx ; goes from ith index to len
ja ENDFOR
mov esi, [edi + ecx*1] ; esi = array[i]
mov [i_temp], esi ; store array[i] in variable
sub ecx, [p] ; ecx = i-p
mov edx, [edi + ecx*1] ; edx = array [i-p]
cmp edx, [i_temp]
je REPEAT ; array[i-p] = array[i]
ja ENDFOR ; array[i-p] > array[i]
mov eax, dword [i]
mov [p], eax ;p = i
inc dword [p] ;p = i + 1
mov eax, dword [p]
sub eax, [k] ;eax = p - k
mov [p], eax ;p = i+1-k
REPEAT:
inc dword [i] ; i = i + 1
mov ecx, [i] ; ecx = i + 1
jmp FOR
ENDFOR:
popa ; saves register values and pops them off
leave ; cleans up
ret ; returns to caller
mov esi, [edi + ecx*1] ; esi = array[i]
mov [i_temp], esi ; store array[i] in variable
sub ecx, [p] ; ecx = i-p
mov edx, [edi + ecx*1] ; edx = array [i-p]
cmp edx, [i_temp]
je REPEAT ; array[i-p] = array[i]
ja ENDFOR ; array[i-p] > array[i]
In these lines you read your array elements as dwords but you actually are using an array of bytes!
That's why the comparing fails. Change this cmp into:
cmp dl, byte ptr [i_temp]
Alternatively rework your code to use byte sized variables/registers where appropriate.
Can someone explain me why the following c code is equivalent to the asm code?
void arith1(){
a=(b*10)+(5/(-e));
}
Why do we put the value of b in a ecx register.
ASM code:
mov ecx, DWORD PTR _b
imul ecx, 10 ; $1 = b * 10
mov esi, DWORD PTR _e
neg esi ; $2 = -e
mov eax, 5
cdq
idiv esi ; $3 = 5 / -e
add ecx, eax ; $4 = $1 + $3
mov DWORD PTR _a, ecx ; a = $4
mov ecx, DWORD PTR _b
imul ecx, 10 ; edx:eax = b * 10
mov esi, DWORD PTR _e
neg esi ; esi = -e
mov eax, 5
cdq ; edx:eax = 5
idiv esi ; eax = 5 / -e
add ecx, eax ; ecx = b * 10 + 5 / -e
mov DWORD PTR _a, ecx ; store result to a
The only non-intuitive part is that the imul and idiv instructions combine the edx and eax registers to form a 64-bit value. The upper 32-bits are discarded after the imul instruction, C doesn't perform overflow checks. The cdq instruction (convert double to quad) is required to turn the 32-bit value into a 64-bit value before dividing.
mov ecx, DWORD PTR _b
Move the variable b into ecx register
imul ecx, 10 ; $1 = b * 10
Multiply the ecx register by 10. (b*10)
mov esi, DWORD PTR _e
Move the variable e into the esi register.
neg esi ; $2 = -e
Negate the esi register. (-e)
mov eax, 5
Move 5 into the eax register.
cdq
Convert eax into a quad-word (I think, not sure what it's needed for).
idiv esi ; $3 = 5 / -e
Divide eax by esi (5 / -e)
add ecx, eax ; $4 = $1 + $3
Add eax to ecx (b*10)+(5/-e).
mov DWORD PTR _a, ecx ; a = $4
Move ecx into variable a (a = (b*10)+(5/-e)).