Hardware Processor Counters Incorrectly Resetting

Hardware Processor Counters Incorrectly Resetting - c

I wrote a program which reads the APERF/MPERF counters on an Intel chip (page 2 on http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf).
These counters are readable/writable via the readmsr/writemsr instructions, and I'm currently simply reading them at a regular interval via a device driver in Windows 7. The counters are 64 bits, and increment approximately with each processor clock, so you'd expect them to overflow in a very long amount of time, but when I read the counters, their value jumps around as if they are being reset by another program.
Is there any way to track down what program would be resetting the counters? Could something else be causing incorrect values to be read? The relevant assembly and corresponding C functions I'm using are attached below. The 64-bit result from rdmsr is saved into eax:edx, so to make sure I wasn't missing any numbers in the r_x registers, I run the command multiple times to check them all.
C:
long long test1, test2, test3, test4;
test1 = TST1();
test2 = TST2();
test3 = TST3();
test4 = TST4();
status = RtlStringCbPrintfA(buffer, sizeof(buffer), "Value: %llu %llu %llu %llu\n", test1, test2, test3, test4);
Assembly:
;;;;;;;;;;;;;;;;;;;
PUBLIC TST1
TST1 proc
mov ecx, 231 ; 0xE7
rdmsr
ret ; returns rax
TST1 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST2
TST2 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rbx
ret ; returns rax
TST2 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST3
TST3 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rcx
ret ; returns rax
TST3 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST4
TST4 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rdx
ret ; returns rax
TST4 endp
The result that prints out is something like below, but the only register which ever changes is the rax register, and it doesn't increase monotonically (can jump around):
Value: 312664 37 231 0
Value: 252576 37 231 0
Value: 1051857 37 231 0

I was not able to figure out what was resetting my counters, but I was able to determine the frequency. The Intel docs state that when one counter overflows, the other counter also will. So even though the counters are constantly resetting, the ratio of aperf and mperf still does represent the processor's frequency.

It seems that Windows 7 and Windows 8 read and reset the writeable APERF/MPERF counters on AMD processors. So, you want to access the read-only APERF/MPERF counters at registers 0xc00000E7/E8.
But there is a new issue. On some of the latest AMD processors (Family 0x16 processors), those registers are not always supported. To determine if those registers are supported, you have to read the EffFreqRO bit in CPUID Fn8000_0007_EDX. As stated before, all this applies only to AMD processors.

Related

Performance: Mod and assignment vs conditional and assignment

I have a counter in an ISR (which is triggered by an external IRQ at 50us). The counter increments and wraps around a MAX_VAL (240).
I have the following code:
if(condition){
counter++;
counter %= MAX_VAL;
doStuff(table[counter]);
}
I am considering an alternative implementation:
if(condition){
//counter++;//probably I would increment before the comparison in production code
if(++counter >= MAX_VAL){
counter=0;
}
doStuff(table[counter]);
}
I know people recommend against trying to optimize like this, but it made me wonder. On x86 what is faster? what value of MAX_VAL would justify the second implemenation?
This gets called about every 50us so reducing the instruction set is not a bad idea. The if(++counter >= MAX_VAL) would be predicted false so it would remove the assignment to 0 in the vast majority of cases. For my purposes id prefer the consistency of the %= implementation.

As #RossRidge says, the overhead will mostly be lost in the noise of servicing an interrupt on a modern x86 (probably at least 100 clock cycles, and many many more if this is part of a modern OS with Meltdown + Spectre mitigation set up).
If MAX_VAL is a power of 2, counter %= MAX_VAL is excellent, especially if counter is unsigned (in which case just a simple and, or in this case a movzx byte to dword which can have zero latency on Intel CPUs. It still has a throughput cost of course: Can x86's MOV really be "free"? Why can't I reproduce this at all?)
Is it possible to fill the last 255-240 entries with something harmless, or repeats of something?
As long as MAX_VAL is a compile-time constant, though, counter %= MAX_VAL will compile efficiently to just a couple multiplies, shift, and adds. (Again, more efficient for unsigned.) Why does GCC use multiplication by a strange number in implementing integer division?
But a check for wrap-around is even better. A branchless check (using cmov) has lower latency than the remainder using a multiplicative inverse, and costs fewer uops for throughput.
As you say, a branchy check can take the check off the critical path entirely, at at the cost of a mispredict sometimes.
// simple version that works exactly like your question
// further optimizations assume that counter isn't used by other code in the function,
// e.g. making it a pointer or incrementing it for the next iteration
void isr_countup(int condition) {
static unsigned int counter = 0;
if(condition){
++counter;
counter = (counter>=MAX_VAL) ? 0 : counter; // gcc uses cmov
//if(counter >= MAX_VAL) counter = 0; // gcc branches
doStuff(table[counter]);
}
}
I compiled many versions of this on the Godbolt compiler explorer, with recent gcc and clang.
(For more about static performance analysis of throughput and latency for short blocks of x86 asm, see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, and other links in the x86 tag wiki, especially Agner Fog's guides.)
clang uses branchless cmov for both versions. I compiled with -fPIE in case you're using that in your kernels. If you can use -fno-pie, then the compiler can save an LEA and use mov edi, [table + 4*rcx], assuming you're on a target where static addresses in position-dependent code fit in 32-bit sign-extended constants (e.g. true in the Linux kernel, but I'm not sure if they compile with -fPIE or do kernel ASLR with relocations when the kernel is loaded.)
# clang7.0 -O3 -march=haswell -fPIE.
# gcc's output is the same (with different registers), but uses `mov edx, 0` before the cmov for no reason, because it's also before a cmp that sets flags
isr_countup: # #isr_countup
test edi, edi
je .LBB1_1 # if condition is false
mov eax, dword ptr [rip + isr_countup.counter]
add eax, 1 # counter++
xor ecx, ecx
cmp eax, 239 # set flags based on (counter , MAX_VAL-1)
cmovbe ecx, eax # ecx = (counter <= MAX_VAL-1) ? 0 : counter
mov dword ptr [rip + isr_countup.counter], ecx # store the old counter
lea rax, [rip + table]
mov edi, dword ptr [rax + 4*rcx] # index the table
jmp doStuff#PLT # TAILCALL
.LBB1_1:
ret
The block of 8 instructions starting at the load of the old counter value is a total of 8 uops (on AMD, or Intel Broadwell and later, where cmov is only 1 uop). The critical-path latency from counter being ready to table[++counter % MAX_VAL] being ready is 1 (add) + 1 (cmp) + 1 (cmov) + load-use latency for the load. i.e. 3 extra cycles. That's the latency of 1 mul instruction. Or 1 extra cycle on older Intel where cmov is 2 uops.
By comparison, the version using modulo is 14 uops for that block with gcc, including a 3-uop mul r32. The latency is at least 8 cycles, I didn't count exactly. (For throughput it's only little bit worse, though, unless the higher latency reduces the ability of out-of-order execution to overlap execution of stuff that depends on the counter.)
Other optimizations
Use the old value of counter, and prepare a value for next time (taking the calculation off the critical path.)
Use a pointer instead of a counter. Saves a couple instructions, at the cost of using 8 bytes instead of 1 or 4 of cache footprint for the variable. (uint8_t counter compiles nicely with some versions, just using movzx to 64-bit).
This counts upward, so the table can be in order. It increments after loading, taking that logic off the critical path dependency chain for out-of-order execution.
void isr_pointer_inc_after(int condition) {
static int *position = table;
if(condition){
int tmp = *position;
position++;
position = (position >= table + MAX_VAL) ? table : position;
doStuff(tmp);
}
}
This compiles really nicely with both gcc and clang, especially if you're using -fPIE so the compiler needs the table address in a register anyway.
# gcc8.2 -O3 -march=haswell -fPIE
isr_pointer_inc_after(int):
test edi, edi
je .L29
mov rax, QWORD PTR isr_pointer_inc_after(int)::position[rip]
lea rdx, table[rip+960] # table+MAX_VAL
mov edi, DWORD PTR [rax] #
add rax, 4
cmp rax, rdx
lea rdx, -960[rdx] # table, calculated relative to table+MAX_VAL
cmovnb rax, rdx
mov QWORD PTR isr_pointer_inc_after(int)::position[rip], rax
jmp doStuff(int)#PLT
.L29:
ret
Again, 8 uops (assuming cmov is 1 uop). This has even lower latency than the counter version possibly could, because the [rax] addressing mode (with RAX coming from a load) has 1 cycle lower latency than an indexed addressing mode, on Sandybridge-family. With no displacement, it never suffers the penalty described in Is there a penalty when base+offset is in a different page than the base?
Or (with a counter) count down towards zero: potentially saves an instruction if the compiler can use flags set by the decrement to detect the value becoming negative. Or we can always use the decremented value, and do the wrap around after that: so when counter is 1, we'd use table[--counter] (table[0]), but store counter=MAX_VAL. Again, takes the wrap check off the critical path.
If you wanted a branch version, you'd want it to branch on the carry flag, because sub eax,1 / jc can macro-fuse into 1 uops, but sub eax,1 / js can't macro-fuse on Sandybridge-family.
x86_64 - Assembly - loop conditions and out of order. But with branchless, it's fine. cmovs (mov if sign flag set, i.e. if the last result was negative) is just as efficient as cmovc (mov if carry flag is set).
It was tricky to get gcc to use the flag results from dec or sub without also doing a cdqe to sign-extend the index to pointer width. I guess I could use intptr_t counter, but that would be silly; might as well just use a pointer. With an unsigned counter, gcc and clang both want to do another cmp eax, 239 or something after the decrement, even though flags are already set just fine from the decrement. But we can get gcc to use SF by checking (int)counter < 0:
// Counts downward, table[] entries need to be reversed
void isr_branchless_dec_after(int condition) {
static unsigned int counter = MAX_VAL-1;
if(condition){
int tmp = table[counter];
--counter;
counter = ((int)counter < 0) ? MAX_VAL-1 : counter;
//counter = (counter >= MAX_VAL) ? MAX_VAL-1 : counter;
//counter = (counter==0) ? MAX_VAL-1 : counter-1;
doStuff(tmp);
}
}
# gcc8.2 -O3 -march=haswell -fPIE
isr_branchless_dec_after(int):
test edi, edi
je .L20
mov ecx, DWORD PTR isr_branchless_dec_after(int)::counter[rip]
lea rdx, table[rip]
mov rax, rcx # stupid compiler, this copy is unneeded
mov edi, DWORD PTR [rdx+rcx*4] # load the arg for doStuff
mov edx, 239 # calculate the next counter value
dec eax
cmovs eax, edx
mov DWORD PTR isr_branchless_dec_after(int)::counter[rip], eax # and store it
jmp doStuff(int)#PLT
.L20:
ret
still 8 uops (should be 7), but no extra latency on the critical path. So all of the extra decrement and wrap instructions are juicy instruction-level parallelism for out-of-order execution.

What is the correct way to loop if I use ECX in loop (Assembly)

I am currently learning assembly language, and I have a program which outputs "Hello World!" :
section .text
global _start
_start:
mov ebx, 1
mov ecx, string
mov edx, string_len
mov eax, 4
int 0x80
mov eax, 1
int 0x80
section .data
string db "Hello World!", 10, 0
string_len equ $ - string
I understand how this code works. But, now, I want to display 10 times the line. The code which I saw on the internet to loop is the following:
mov ecx, 5
start_loop:
; the code here would be executed 5 times
loop start_loop
Problem : I tried to implement the loop on my code but it outputs an infinite loop. I also noticed that the loop needs ECX and the write function also needs ECX. What is the correct way to display 10 times my "Hello World!" ?
This is my current code (which produces infinite loop):
section .text
global _start
_start:
mov ecx, 10
myloop:
mov ebx, 1 ;file descriptor
mov ecx, string
mov edx, string_len
mov eax, 4 ; write func
int 0x80
loop myloop
mov eax, 1 ;exit
int 0x80
section .data
string db "Hello World!", 10, 0
string_len equ $ - string
Thank you very much

loop uses the ecx register. This is easy the remember because the c stands for counter. You however overwrite the ecx register so that will never work!
The easiest fix is to use a different register for your loop counter, and avoid the loop instruction.
mov edi,10 ;registers are general purpose
loop:
..... do loop stuff as above
;push edi ;no need save the register
int 0x80
.... ;unless you yourself are changing edi
int 0x80
;pop edi ;restore the register. Remember to always match push/pop pairs.
sub edi,1 ;sub sometimes faster than dec and never slower
jnz loop
There is no right or wrong way to loop.
(Unless you're looking for every last cycle, which you are not, because you've got a system call inside the loop.)
The disadvantage of loop is that it's slower than the equivalent sub ecx,1 ; jnz start_of_loop.
The advantage of loop is that it uses less instruction bytes. Think of it as a code-size optimization you can use if ECX happens to be a convenient register for looping, but at the cost of speed on some CPUs.
Note that the use of dec reg + jcc label is discouraged for some CPUs (Silvermont / Goldmont being still relevant, Pentium 4 not). Because dec only alters part of the flags register it can require an extra merging uop. Mainstream Intel and AMD rename parts of EFLAGS separately so there's no penalty for dec / jnz (because jnz only reads one of the flags written by dec), and can even micro-fuse into a dec-and-branch uop on Sandybridge-family. (But so can sub). sub is never worse, except code-size, so you may want to use sub reg,1 for the forseeable future.
A system call using int 80h does not alter any registers other than eax, so as long as you remember not to mess with edi you don't need the push/pop pair. Pick a register you don't use inside the loop, or you could just use a stack location as your loop counter directly instead of push/pop of a register.

Why does this REPNE SCASB implementation of strlen work?

Why does this code work?
http://www.int80h.org/strlen/ says that the string address has to be in EDI register for scasb to work, but this assembly function doesn't seem to do this.
Assembly code for mystrlen:
global mystrlen
mystrlen:
sub ecx, ecx
not ecx
sub al, al
cld
repne scasb
neg ecx
dec ecx
dec ecx
mov eax, ecx
ret
C main:
int mystrlen(const char *);
int main()
{
return (mystrlen("1234"));
}
Compilation:
nasm -f elf64 test.asm
gcc -c main.c
gcc main.o test.o
Output:
./a.out
echo $?
4

The code from the question is 32 bit version of strlen, which works in 64b environment only partially, sort of "by accident" (as most of the SW works in reality, anyway ;) ).
One of the accidental effects of 64b environment is (in System V ABI, which is used by 64b linux OS, other 64b platforms may follow different calling convention, invalidating this!), that the first argument in function call is passed through rdi register, and the scasb is using es:rdi in 64b mode, so this naturally fits together (as the Jester's answer says).
Rest of the 64b environment effects are less good, that code will return wrong value for 4+G long string (I know, highly unlikely to happen in practical usage, but can be tried by synthetic test providing such long string).
Fixed 64b version (also the end of routine exploits rax=0 to do both neg ecx and mov eax,ecx in single instruction):
global mystrlen
mystrlen:
xor ecx,ecx ; rcx = 0
dec rcx ; rcx = -1 (0xFFFFFFFFFFFFFFFF)
; rcx = maximum length to scan
xor eax,eax ; rax = 0 (al = 0 value to scan for)
repne scasb ; scan the memory for AL
sub rax,rcx ; rax = 0 - rcx_leftover = scanned bytes + 1
sub rax,2 ; fix that into "string length" (-1 for '\0')
ret

The 64 bit sysv calling convention places the first argument into rdi. So the caller main already did that load for you. You can examine its assembly code and see for yourself.
(Answer provided by Jester)

looping over an array in NASM

I want to learn programming in assembly to write fast and efficient code.
How ever I stumble over a problem I can't solve.
I want to loop over an array of double words and add its components like below:
%include "asm_io.inc"
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
string1 db "result: ",0
array dd 1, 2, 3, 4, 5
segment .bss
segment .text
global sum
sum:
prologue
mov rdi, string1
call print_string
mov rbx, array
mov rdx, 0
mov ecx, 5
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
mov rdi, rdx
call print_int
call print_nl
epilogue
Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:
section .rodata
int_format db "%i",0
string_format db "%s",0
section .text
global print_string, print_nl, print_int, read_int
extern printf, scanf, putchar
print_string:
prologue
; string address has to be passed in rdi
mov rsi,rdi
mov rdi,dword string_format
xor rax,rax
call printf
epilogue
print_nl:
prologue
mov rdi,0xA
xor rax,rax
call putchar
epilogue
print_int:
prologue
;integer arg is in rdi
mov rsi, rdi
mov rdi, dword int_format
xor rax,rax
call printf
epilogue
When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array.
Can somebody tell me why th loop skips the first element?
Edit
I forgot to mention that I'm using a x86_64 Linux system

I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.
If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).
The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like
#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);
Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.
An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.
It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P
My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.
So, on to your loop:
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.
Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).
When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.
;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4] ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi] ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8 ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4
ALIGN 16
.loop:
add eax, [ rsi]
add ebx, [4 + rsi]
add rsi, 8
cmp rsi, rdx
jb .loop ; loop while rsi is Below one-past-the-end
; TODO: handle odd-length
add eax, ebx
ret
Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.
If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)
All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.
segment .rodata ; read-only data
ALIGN 16
array: times 64 dd 1, 2, 3, 4, 5
array_bytes equ $-array
string1 db "result: ",0
segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes] ; array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register. We don't need rdx until later.
movdqu xmm0, [rsi - 32] ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.
ALIGN 16
.loop:
paddd xmm0, [ rsi] ; add packed dwords
paddd xmm1, [16 + rsi]
add rsi, 32
cmp rsi, rdx
jb .loop ; loop: 4 fused-domain uops
paddd xmm0, xmm1
phaddd xmm0, xmm0 ; horizontal add: SSSE3 phaddd is simple but not optimal. Better to pshufd/paddd
phaddd xmm0, xmm0
movd eax, xmm0
; TODO: scalar cleanup loop
ret
Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.
It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.
With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.

How to find largest number in an array using NASM

i was doing a program in NASM(x86 assembly), in which user is asked to enter three 32 bit hex numbers(8 digit), which are further stored in an array and the program shows the number which is largest of them all. The program works fine, i.e. it shows the largest of the three numbers. But the problem is, that it shows only 16 bit (4 digit number) as output. For example, if i give three numbers as 11111111h,22222222h and 10000000h, the output comes out to be only 2222. This is the code.
section .data
msg db "Enter the number : ",10d,13d
msglen equ $-msg
show db "The greatest number is : ",10d,13d
showlen equ $-show
%macro display 2
mov eax,4
mov ebx,1
mov ecx,%1
mov edx,%2
int 80h
%endmacro
%macro input 2
mov eax,3
mov ebx,0
mov ecx,%1
mov edx,%2
int 80h
%endmacro
section .bss
large resd 12
num resd 3
section .text
global _start
_start:
mov esi,num
mov edi,3
; Now taking input
nxt_num:
display msg,msglen
input esi,12
add esi,12
dec edi
jnz nxt_num
mov esi,num
mov edi,3
add: mov eax,[esi]
jmp check
next: add esi,12
mov ebx,[esi]
CMP ebx,eax
jg add
check: dec edi
jnz next
mov [large],eax
display show,showlen
display large,12
;exit
mov eax,1
mov ebx,0
int 80h
I even tried changing reserved size of array from doubly byte to quad byte. But the result remains the same.
Also, when i execute the same code in NASM x86_64 assembly, only with the registers and the system calls changed (i.e. eax to rax, ebx to rcx, int 80h to syscall, etc) the output comes out to of 32 bits(8 digits). Why so?
I need help. Thank you. :)

In you little program , you're trying to move the Qword into a 32-bit register which can hold just 4bytes (DWord). Based on your response to Gunner I guess you're misunderstanding this concept.
Actually each byte is represented by 8bits.
a word is 2 bytes (16 bits)
a dword is 4 bytes (32 bits) which is the size of a register in a x86 arch.
So whenever you take a byte , its binary equivalent has always an 8bits size.
So the binary equivalent of "FF" in hex is 00001111.
In your program just try to print your number as a string instead of printing it through a register, you can simply do that by using the pointer to the memory address where you number is stored or simply by printing the input using printf.
P.S : the string should be in ASCII , so to display 11111111 it should be in memory as following 3131313131313131 .

The output 2222 is correct for a 32 bit register. Each number is 8 bits, 4 numbers = 8 * 4 = 32, the max a 32 bit register can hold. This is why if you change to 64 bit registers, the full number is printed. You will need to change the displayed number into a string to display the full number.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight