I have an issue with the C code below into which I have included Sparc Assembly. The code is compiled and running on Debian 9.0 Sparc64. It does a simple summation and print the result of this sum which equals to nLoop.
The problem is that for an initial number of iterations greater than 1e+9, the final sum at the end is systematically equal to 1410065408 : I don't understand why since I put explicitly unsigned long long int type for sum variable and so sum can be in [0, +18,446,744,073,709,551,615] range.
For example, for nLoop = 1e+9, I expect sum to be equal to 1e+9.
Does issue come rather from included Assembly Sparc code which could not handle 64 bits variables (in input or output) ?
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int i;
// Init sum
unsigned long long int sum = 0ULL;
// Number of iterations
unsigned long long int nLoop = 10000000000ULL;
// Loop with Sparc assembly into C source
asm volatile ("clr %%g1\n\t"
"clr %%g2\n\t"
"mov %1, %%g1\n" // %1 = input parameter
"loop:\n\t"
"add %%g2, 1, %%g2\n\t"
"subcc %%g1, 1, %%g1\n\t"
"bne loop\n\t"
"nop\n\t"
"mov %%g2, %0\n" // %0 = output parameter
: "=r" (sum) // output
: "r" (nLoop) // input
: "g1", "g2"); // clobbers
// Print results
printf("Sum = %llu\n", sum);
return 0;
}
How to fix this problem of range and allow to use 64 bits variables into Sparc Assembly code ?
PS: I tried to compile with gcc -m64, issue remains.
Update 1
As requested by #zwol, below is the output Assembly Sparc code generated with : gcc -O2 -m64 -S loop.c -o loop.s
.file "loop.c"
.section ".text"
.section .rodata.str1.8,"aMS",#progbits,1
.align 8
.LC0:
.asciz "Sum = %llu\n"
.section .text.startup,"ax",#progbits
.align 4
.global main
.type main, #function
.proc 04
main:
.register %g2, #scratch
save %sp, -176, %sp
sethi %hi(_GLOBAL_OFFSET_TABLE_-4), %l7
call __sparc_get_pc_thunk.l7
add %l7, %lo(_GLOBAL_OFFSET_TABLE_+4), %l7
sethi %hi(9764864), %o1
or %o1, 761, %o1
sllx %o1, 10, %o1
#APP
! 13 "loop.c" 1
clr %g1
clr %g2
mov %o1, %g1
loop:
add %g2, 1, %g2
subcc %g1, 1, %g1
bne loop
nop
mov %g2, %o1
! 0 "" 2
#NO_APP
mov 0, %i0
sethi %gdop_hix22(.LC0), %o0
xor %o0, %gdop_lox10(.LC0), %o0
call printf, 0
ldx [%l7 + %o0], %o0, %gdop(.LC0)
return %i7+8
nop
.size main, .-main
.ident "GCC: (Debian 7.3.0-15) 7.3.0"
.section .text.__sparc_get_pc_thunk.l7,"axG",#progbits,__sparc_get_pc_thunk.l7,comdat
.align 4
.weak __sparc_get_pc_thunk.l7
.hidden __sparc_get_pc_thunk.l7
.type __sparc_get_pc_thunk.l7, #function
.proc 020
__sparc_get_pc_thunk.l7:
jmp %o7+8
add %o7, %l7, %l7
.section .note.GNU-stack,"",#progbits
UPDATE 2:
As suggested by #Martin Rosenau, I did following modifications :
loop:
add %g2, 1, %g2
subcc %g1, 1, %g1
bpne %icc, loop
bpne %xcc, loop
nop
mov %g2, %o1
But at the compilation, I get :
Error: Unknown opcode: `bpne'
What could be the reason for this compilation error ?
subcc %%g1, 1, %%g1
bne loop
Your problem is the bne instruction:
Unlike the x86-64 CPU Sparc64 CPUs don't have different instructions for 32- and 64-bit subtraction:
If you want subtract 1 from 0x12345678 the result is 0x12345677. If you subtract 1 from 0xF00D12345678 the result is 0xF00D12345677 so if you only use the lower 32 bits of a register a 64-bit subtraction has the same effect as the 32-bit subtraction.
Therefore the Sparc64 CPUs do not have different instructions for 64-bit and 32-bit addition, subtraction, multiplication, left shift etc.
These CPUs have different instructions for 32-bit and 64-bit operations when the upper 32 bits influence the lower 32 bits (e.g. right shift).
However the zero flag depends on the result of the subcc operation.
To solve this problem the Sparc64 CPUs have each of the integer flags (zero, overflow, carry, sign) twice:
The 32-bit zero flag will be set if the lower 32 bits of a register are zero; the 64-bit zero flag will be set if all 64 bits of a register are zero.
To be compatible with existing 32-bit programs the bne instruction will check the 32-bit zero flag, not the 64-bit zero flag.
is systematically equal to 1410065408
1e10 = 0x200000000 + 1410065408 so after 1410065408 steps the value 0x200000000 is reached which has the lower 32 bits set to 0 and bne will not jump any more.
However for 1e11 you should not get 1410065408 but 1215752192 as a result because 1e11 = 0x1700000000 + 1215752192.
bne
There is a new instruction named bpne which has up to 4 arguments!
In the simplest variant (with only two arguments) the instruction should (I have not used Sparc for 5 years now, so I'm not sure) work like this:
bpne %icc, loop # Like bne (based on the 32-bit result)
bpne %xcc, loop # Like bne, but based on the 64-bit result
EDIT
Error: Unknown opcode: 'bpne'
I just tried using GNU assembler:
GNU assembler names the new instruction bne - just like the old one:
bne loop # Old variant
bne %icc, loop # New variant based on the 32-bit result
bne %xcc, loop # (New variant) Based on the 64-bit result
subcc %g1, 1, %g1
bpne %icc, loop
bpne %xcc, loop
nop
The first bpne (or bne) makes no sense: Whenever the first line would do the jump the second line would also jump. And if you don't use .reorder (however this is the default) you would also need to add a nop between the two branch instructions...
The code should look like this (assuming your assembler also names bpne bne):
subcc %g1, 1, %g1
bne %xcc, loop
nop
Try "bne %xcc, loop" which should branch based on the 64 bit result.
Related
I have fed this simple code to gcc
volatile signed char x, y, z;
void test()
{
x = 0x31;
y = x + 3;
}
Volatile has been added just to avoid gcc optimization (set to -O0 anyway).
Resulting mips code was:
x:
y:
z:
test():
addiu $sp,$sp,-8
sw $fp,4($sp)
move $fp,$sp
lui $2,%hi(x)
li $3,49 # 0x31
sb $3,%lo(x)($2)
lui $2,%hi(x)
lbu $2,%lo(x)($2)
seb $2,$2
andi $2,$2,0x00ff
addiu $2,$2,3
andi $2,$2,0x00ff
seb $3,$2
lui $2,%hi(y)
sb $3,%lo(y)($2)
nop
move $sp,$fp
lw $fp,4($sp)
addiu $sp,$sp,8
j $31
nop
For (y= x+3) gcc loads the byte as unsigned and then sign extend it and then or it with 0xff?
why not simply load it using lb (which is supposed to sign extend it)?
GCC does the same for signed half words (using 0xffff of course).
I'm not particularly adept at reading MIPS assembly, but do note that you have compiled with -O0, which is supposed to generate unoptimized code. That more or less means code that implements the exact semantics of the C abstract machine. In particular,
0x31 is a constant of type int
the assignment x = 0x31 includes an implicit conversion (in this case) of the right-hand int operand to the type of the left-hand operand (signed char)
3 is also a constant of type int
evaluating x + 3 involves performing the integer promotions on the arguments, and in particular, converting the signed char value of x to (signed) int, then performing the addition
assigning the result to y involves another implicit conversion from int to signed char
In principle, all of those conversions and promotions that are implicit in the C code need to be performed explicitly by the unoptimized assembly program, and that appears roughly to be what you are seeing.
Overall, it's not very useful to ask why the assembly output of a non-optimizing compilation is not as efficient as you think it could be. If you want more efficient code, enable optimization.
gcc -march=native -O3 gives the following code on an Intel machine:
test:
.seh_endprologue
movb $49, x(%rip)
movzbl x(%rip), %eax
addl $3, %eax
movb %al, y(%rip)
ret
.seh_endproc
.comm z, 1, 0
.comm y, 1, 0
.comm x, 1, 0
.ident "GCC: (GNU) 6.4.0"
There's nothing wrong with your code. gcc/MIPS just isn't very good at code generation, which is not too surprising since it receives a lot less care and attention than gcc/Intel. In my experience clang usually generates better MIPS code than gcc does.
I've been searching the internet for some time now and have come up with an odd problem.
Using a C compiler, I converted the following into assembly to later be converted to Y86:
#include <stdio.h>
int main(void)
{
int j,k,i;
for (i=0; i <5; i++) {
j = i*2;
k = j+1;
}
}
After the conversion, I get the following .s file:
.file "Lab5_1.c"
.section ".text"
.align 4
.global main
.type main, #function
.proc 04
main:
save %sp, -112, %sp
st %g0, [%fp-4]
ba,pt %xcc, .LL2
nop
.LL3:
ld [%fp-4], %g1
add %g1, %g1, %g1
st %g1, [%fp-8]
ld [%fp-8], %g1
add %g1, 1, %g1
st %g1, [%fp-12]
ld [%fp-4], %g1
add %g1, 1, %g1
st %g1, [%fp-4]
.LL2:
ld [%fp-4], %g1
cmp %g1, 4
ble %icc, .LL3
nop
mov %g1, %i0
return %i7+8
nop
.size main, .-main
.ident "GCC: (GNU) 4.8.0"
My question is about the instructions themselves. Many sites I've found have instructions similar to these, such as movl for mov, and cmpl for cmp. But some I can't make heads or tails of the other commands such as st, ba, pt, or ld to convert to Y86.
Any light on these instructions? Could it be a problem with the compiler?
For reference, I'm using Unix and command gcc -S "filename.c"
The st and ld instructions are obviously store-to and load-from memory. For the looks of things, ba is a branch instruction of some description.
In fact, based on the instructions being generated and a bit of quick research, it looks like you might be running on a SPARC architecture. The ld/st pair, ba and save are all instructions on that architecture.
The save instruction is actually the SPARC way of handling register save and restore when calling functions (the in/local/out method).
And that "slightly deranged" ba instruction is actually the branch-prediction version introduced in SPARC version 9, ba,pt %xcc, .LL2 meaning branch always (with a prediction that the branch will be taken) based on condition code (obviously some new definition of the word "always" of which I was previously unaware).
The opposite instruction ba,pn means predict that the branch will not be taken.
The presence of nop instructions following a branch is to do with the fact that SPARC does delayed branching - the instruction following a branch is actually executed before the branch is taken. This has to do with the way it pipelines instructions and would probably be considered a bug on any other (less weird) architecture :-)
All those factors taken together pretty well guarantee you're running on a SPARC, so I'd be looking up opcodes for that to figure out how best to transform it into Y86.
The other alternative is, of course, to generate x86 instruction. That may be possible by using a cross-compiler on your SPARC or simply compiling it on an x86 machine (assuming you have one available).
Pardon me if this question has been posed before. I looked for answers to similar questions, but I'm still puzzled with my problem. So I will shoot the question anyway.
I'm using a C library called libexif for image data. I run my application (which uses this library) both on my Linux desktop and my MIPS board.
For a particular image file when I try to fetch the created time, I was getting an error/invalid value. On debugging further I saw that for this particular image file, I was not getting the tag (EXIF_TAG_DATE_TIME) as expected.
This library has several utility functions. Most functions are structured like below
int16_t
exif_get_sshort (const unsigned char *buf, ExifByteOrder order)
{
if (!buf) return 0;
switch (order) {
case EXIF_BYTE_ORDER_MOTOROLA:
return ((buf[0] << 8) | buf[1]);
case EXIF_BYTE_ORDER_INTEL:
return ((buf[1] << 8) | buf[0]);
}
/* Won't be reached */
return (0);
}
uint16_t
exif_get_short (const unsigned char *buf, ExifByteOrder order)
{
return (exif_get_sshort (buf, order) & 0xffff);
}
When the library tries to investigate the presence of tags in raw data, it calls exif_get_short() and assigns the value returned to a variable which is of type enum (int).
In the error case, exif_get_short() which is supposed to return unsigned value (34687) returns a negative number (-30871) which messes up the whole tag extraction from the image data.
34687 is outside the range of maximum representable int16_t value. And therefore leads to an overflow. When I make this slight modification in code, everything seems to work fine
uint16_t
exif_get_short (const unsigned char *buf, ExifByteOrder order)
{
int temp = (exif_get_sshort (buf, order) & 0xffff);
return temp;
}
But since this is a pretty stable library and in use for quite some time, it led me to believe that I may be missing something here. Moreover this is the general way the code is structured for other utility functions as well. Ex: exif_get_long() calls exif_get_slong(). I would then have to change all utility functions.
What is confusing me is that when I run this piece of code on my linux desktop for the error file, I see no problems and things work fine with the original library code. Which led to me believe that perhaps UINT16_MAX and INT16_MAX macros have different values on my desktop and MIPS board. But unfortunately, thats not the case. Both print identical values on the board and desktop. If this piece of code fails, it should fail also on my desktop.
What am I missing here? Any hints would be much appreciated.
EDIT:
The code which calls exif_get_short() goes something like this:
ExifTag tag;
...
tag = exif_get_short (d + offset + 12 * i, data->priv->order);
switch (tag) {
...
...
The type ExifTag is as follows:
typedef enum {
EXIF_TAG_GPS_VERSION_ID = 0x0000,
EXIF_TAG_INTEROPERABILITY_INDEX = 0x0001,
...
...
}ExifTag ;
The cross compiler being used is mipsisa32r2el-timesys-linux-gnu-gcc
CFLAGS = -pipe -mips32r2 -mtune=74kc -mdspr2 -Werror -O3 -Wall -W -D_REENTRANT -fPIC $(DEFINES)
I'm using libexif within Qt - Qt Media hub (actually libexif comes along with Qt Media hub)
EDIT2: Some additional observations:
I'm observing something bizarre. I have put print statements in exif_get_short(). Just before return
printf("return_value %d\n %u\n",exif_get_sshort (buf, order) & 0xffff, exif_get_sshort (buf, order) & 0xffff);
return (exif_get_sshort (buf, order) & 0xffff);
I see the following o/p:
return_value 34665 34665
I then also inserted print statements in the code which calls exif_get_short()
....
tag = exif_get_short (d + offset + 12 * i, data->priv->order);
printf("TAG %d %u\n",tag,tag);
I see the following o/p:
TAG -30871 4294936425
EDIT3 : Posting assembly code for exif_get_short() and exif_get_sshort() taken on MIPS board
.file 1 "exif-utils.c"
.section .mdebug.abi32
.previous
.gnu_attribute 4, 1
.abicalls
.text
.align 2
.globl exif_get_sshort
.ent exif_get_sshort
.type exif_get_sshort, #function
exif_get_sshort:
.set nomips16
.frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0
.mask 0x00000000,0
.fmask 0x00000000,0
.set noreorder
.set nomacro
beq $4,$0,$L2
nop
beq $5,$0,$L3
nop
li $2,1 # 0x1
beq $5,$2,$L8
nop
$L2:
j $31
move $2,$0
$L3:
lbu $2,0($4)
lbu $3,1($4)
sll $2,$2,8
or $2,$2,$3
j $31
seh $2,$2
$L8:
lbu $2,1($4)
lbu $3,0($4)
sll $2,$2,8
or $2,$2,$3
j $31
seh $2,$2
.set macro
.set reorder
.end exif_get_sshort
.align 2
.globl exif_get_short
.ent exif_get_short
.type exif_get_short, #function
exif_get_short:
.set nomips16
.frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0
.mask 0x00000000,0
.fmask 0x00000000,0
.set noreorder
.cpload $25
.set nomacro
lw $25,%call16(exif_get_sshort)($28)
jr $25
nop
.set macro
.set reorder
.end exif_get_short
Just for completeness, the ASM code taken from my linux machine
.file "exif-utils.c"
.text
.p2align 4,,15
.globl exif_get_sshort
.type exif_get_sshort, #function
exif_get_sshort:
.LFB1:
.cfi_startproc
xorl %eax, %eax
testq %rdi, %rdi
je .L2
testl %esi, %esi
jne .L8
movzbl (%rdi), %edx
movzbl 1(%rdi), %eax
sall $8, %edx
orl %edx, %eax
ret
.p2align 4,,10
.p2align 3
.L8:
cmpl $1, %esi
jne .L2
movzbl 1(%rdi), %edx
movzbl (%rdi), %eax
sall $8, %edx
orl %edx, %eax
.L2:
rep
ret
.cfi_endproc
.LFE1:
.size exif_get_sshort, .-exif_get_sshort
.p2align 4,,15
.globl exif_get_short
.type exif_get_short, #function
exif_get_short:
.LFB2:
.cfi_startproc
jmp exif_get_sshort#PLT
.cfi_endproc
.LFE2:
.size exif_get_short, .-exif_get_short
EDIT4: Hopefully my last update :-)
ASM code with compiler option set to -O1
exif_get_short:
.set nomips16
.frame $sp,32,$31 # vars= 0, regs= 1/0, args= 16, gp= 8
.mask 0x80000000,-4
.fmask 0x00000000,0
.set noreorder
.cpload $25
.set nomacro
addiu $sp,$sp,-32
sw $31,28($sp)
.cprestore 16
lw $25,%call16(exif_get_sshort)($28)
jalr $25
nop
lw $28,16($sp)
andi $2,$2,0xffff
lw $31,28($sp)
j $31
addiu $sp,$sp,32
.set macro
.set reorder
.end exif_get_short
One thing the MIPS assembly shows (though I'm not an expert in MIPS assembly, so there's a decent chance I'm missing something or otherwise wrong) is that the exif_get_short() function is just an alias for the exif_get_sshort() function. All that exif_get_short() does is jump to the address of the exif_get_sshort() function.
The exif_get_sshort() function sign extends the 16 bit value it's returning to the full 32-bit register used for the return. There's nothing wrong with that - it's actually probably what the MIPS ABI specifies (I'm not sure).
However, since the exif_get_short() function just jumps to the exif_get_sshort() function, it has no opportunity to clear the upper 16 bits of the register.
So when the 16 bit value 0x8769 is being returned from the buffer (whether from exif_get_sshort() or exif_get_short()), the $2 register used to return the function result contains 0xffff8769, which can have the following interpretations:
as a 32-bit signed int: -30871
as a 32-bit `unsigned int: 4294936425
as a 16-bit signed int16_t: -30871
as a 16-bit unsigned uint16_t: 34665
If the compiler is supposed to ensure that the $2 return register has a top 16-bit set to zero for a uint16_t return type, then it has a bug in the code it's emitting for exif_get_short() - instead of jumping to exif_get_sshort(), it should call exif_get_sshort() and clear the upper half of $2 before returning.
From the description of the behavior you're seeing, it looks like the code calling exif_get_short() expects that the $2 resister used for the return value will have the upper 16 bits cleared so that the entire 32-bit register can be used as-is for the 16-bit uint16_t value.
I'm not sure what the MIPS ABI specifies (but I'd guess that it specifies that the upper 16 bits of the $2 register should eb cleared by exif_get_short()), but there seems to be either a code generation bug that exif_get_short() doesn't ensure $2 is entirely correct before it returns or a bug where the caller of exif_get_short() assumes that the full 32-bits of $2 are valid when only 16 bits are.
This is broken on so many levels, I don't know where to begin. Just look at what is done here:
The unsigned chars are read from the buffer.
They are assigned to a signed int16_t in exif_get_sshort.
This is assigned to an unsigned uint16_t in exif_get_short.
This is finally assigned to an enum which is of type signed int.
I'd say it's a miracle it works at all.
First, the assignment from the chars to int16_t is done with the values, not with the representation:
return ((buf[0] << 8) | buf[1]);
Which already throws you into the pit of undefined behaviour when the result actually is negative. Plus, it only works when the signed int representation of the implementation is the same as the one used in the file format (two's complement, I guess). It will fail for one's complement and sign-magnitude. So check what's the case for the MIPS implementation.
The clean way would be the other way around: Assign two chars from the buffer to an uint16_t, which will be well defined operations, and use this to return an int16_t. You could then care for further corrections of the value for different representations, if necessary.
Additionally, here:
if (!buf) return 0;
0 is a very bad choice of a return value, because it is a valid enum constant:
EXIF_TAG_GPS_VERSION_ID = 0x0000,
If this is expected to be the default for invalidity, then the constant should be returned, not the magic number. Though this seems to be a generic function to return an int16_t, thus some other error mechanism should be used here.
For your specific question, well, follow the flow of conversions between signed and unsigned on your MIPS implementation, including the default promotions done, and examine all the intermediate values to find the point where it breaks. Your MIPS uses 32 bit ints, not 16 bit, right? Check INT_MAX and UINT_MAX.
I just tried out a simple C program using an if statement and analyzed its assembly. However, its behavior differs a lot when -O2 flag is used for compilation.
The C code for the same is :-
#include<stdio.h>
int main(int argc, char **argv) {
int a;
if(a<0) {
printf("A is less than 0\n");
}
}
And the corresponding assembly is:-
main:
push %ebp
mov %ebp, %esp
sub %esp, 8
and %esp, -16
sub %esp, 16
test %eax, %eax
js .L4
leave
ret
.p2align 4,,15
.L4:
sub %esp, 12
push OFFSET FLAT:.LC0
call puts
add %esp, 16
leave
ret
.size main, .-main
.section .note.GNU-stack,"",#progbits
.ident "GCC: (GNU) 3.4.6"
I read that the test instruction basically just performs the logical AND of the two operands. I also read that the js instruction performs a jump when there is a change in sign in the previous instruction. So, testing eax with eax would give 0 or 1 and the jump would depend on this.
I fail to understand how it is being used here for branching.
Could someone explain how this works?
JS doesn't jump when there is a change in sign, it jumps if the sign flag is 1.
The sign bit is on if the result of the last operation was negative(negative numbers in 2's compliment have the most significant bit in 1).
So if the AND operation was between two negative integers(-1 & -1) the last bit will be 1(sign flag), so the jump is taken. In case the numbers were positive the last bit would be 0, the jump won't be taken.
Well I don't know what you expect, but your program has undefined behavior since a is not initialized. So the assembler output could be literally anything.
The Intel manuals are good for this. This is what it documents for the TEST instruction:
SF is the sign flag, the one that's tested by the JS opcode. It is set to the most significant bit of eax here, the sign bit. The jump is thus taken when eax contains a negative number.
If eax is negative, the flags will indicate that after the test instruction and the jump will be taken. That pushes the PC to .L4 which does the printing. Otherwise, we leave.
test eax, eax will set zero flag if eax = 0
js instruction will check the sign flag basically (a<0)
Looks like when you specify -O2, the compiler is putting your "int a" into a register for speed optimization.
Since you never initialize 'int a', the assembly does not show anything written to eax and it instead has the value that was last assigned to it.
The other answers explain how test is working as a branching mechanism.
I am trying to understand how calculations involving numbers greater than 232 happen on a 32 bit machine.
C code
$ cat size.c
#include<stdio.h>
#include<math.h>
int main() {
printf ("max unsigned long long = %llu\n",
(unsigned long long)(pow(2, 64) - 1));
}
$
gcc output
$ gcc size.c -o size
$ ./size
max unsigned long long = 18446744073709551615
$
Corresponding assembly code
$ gcc -S size.c -O3
$ cat size.s
.file "size.c"
.section .rodata.str1.4,"aMS",#progbits,1
.align 4
.LC0:
.string "max unsigned long long = %llu\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $-1, 8(%esp) #1
movl $-1, 12(%esp) #2
movl $.LC0, 4(%esp) #3
movl $1, (%esp) #4
call __printf_chk
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
$
What exactly happens on the lines 1 - 4?
Is this some kind of string concatenation at the assembly level?
__printf_chk is a wrapper around printf which checks for stack overflow, and takes an additional first parameter, a flag (e.g. see here.)
pow(2, 64) - 1 has been optimised to 0xffffffffffffffff as the arguments are constants.
As per the usual calling conventions, the first argument to __printf_chk() (int flag) is a 32-bit value on the stack (at %esp at the time of the call instruction). The next argument, const char * format, is a 32-bit pointer (the next 32-bit word on the stack, i.e. at %esp+4). And the 64-bit quantity that is being printed occupies the next two 32-bit words (at %esp+8 and %esp+12):
pushl %ebp ; prologue
movl %esp, %ebp ; prologue
andl $-16, %esp ; align stack pointer
subl $16, %esp ; reserve bytes for stack frame
movl $-1, 8(%esp) #1 ; store low half of 64-bit argument (a constant) to stack
movl $-1, 12(%esp) #2 ; store high half of 64-bit argument (a constant) to stack
movl $.LC0, 4(%esp) #3 ; store address of format string to stack
movl $1, (%esp) #4 ; store "flag" argument to __printf_chk to stack
call __printf_chk ; call routine
leave ; epilogue
ret ; epilogue
The compiler has effectively rewritten this:
printf("max unsigned long long = %llu\n", (unsigned long long)(pow(2, 64) - 1));
...into this:
__printf_chk(1, "max unsigned long long = %llu\n", 0xffffffffffffffffULL);
...and, at runtime, the stack layout for the call looks like this (showing the stack as 32-bit words, with addresses increasing from the bottom of the diagram upwards):
: :
: Stack :
: :
+-----------------+
%esp+12 | 0xffffffff | \
+-----------------+ } <-------------------------------------.
%esp+8 | 0xffffffff | / |
+-----------------+ |
%esp+4 |address of string| <---------------. |
+-----------------+ | |
%esp | 1 | <--. | |
+-----------------+ | | |
__printf_chk(1, "max unsigned long long = %llu\n", |
0xffffffffffffffffULL);
similar to the way as we handle numbers greater than 9, with only digits 0 - 9.
(using positional digits). presuming the question is a conceptual one.
In your case, the compiler knows that 2^64-1 is just 0xffffffffffffffff, so it has pushed -1 (low dword) and -1 (high dword) onto the stack as your argument to printf. It's just an optimization.
In general, 64-bit numbers (and even greater values) can be stored with multiple words, e.g. an unsigned long long uses two dwords. To add two 64-bit numbers, two additions are performed - one on the low 32 bits, and one on the high 32 bits, plus the carry:
; Add 64-bit number from esi onto edi:
mov eax, [esi] ; get low 32 bits of source
add [edi], eax ; add to low 32 bits of destination
; That add may have overflowed, and if it did, carry flag = 1.
mov eax, [esi+4] ; get high 32 bits of source
adc [edi+4], eax ; add to high 32 bits of destination, then add carry.
You can repeat this sequence of add and adcs as much as you like to add arbitrarily big numbers. The same thing can be done with subtraction - just use sub and sbb (subtract with borrow).
Multiplication and division are much more complicated, and the compiler usually produces some small helper functions to deal with these whenever you multiply 64-bit numbers together. Packages like GMP which support very, very large integers use SSE/SSE2 to speed things up. Take a look at this Wikipedia article for more information on multiplication algorithms.
As others have pointed out all 64-bit aritmetic in your example has been optimised away. This answer focuses on the question int the title.
Basically we treat each 32-bit number as a digit and work in base 4294967296. In this manner we can work on arbiterally big numbers.
Addition and subtraction are easiest. We work through the digits one at a time starting from the least significant and moving to the most significant. Generally the first digit is done with a normal add/subtract instruction and later digits are done using a specific "add with carry" or "subtract with borrow" instruction. The carry flag in the status register is used to take the carry/borrow bit from one digit to the next. Thanks to twos complement signed and unsigned addition and subtraction are the same.
Multiplication is a little trickier, multiplying two 32-bit digits can produce a 64-bit result. Most 32-bit processors will have instructions that multiply two 32-bit numbers and produces a 64-bit result in two registers. Addition will then be needed to combine the results into a final answer. Thanks to twos complement signed and unsigned multiplication are the same provided the desired result size is the same as the argument size. If the result is larger than the arguments then special care is needed.
For comparision we start from the most significant digit. If it's equal we move down to the next digit until the results are equal.
Division is too complex for me to describe in this post, but there are plenty of examples out there of algorithms. e.g. http://www.hackersdelight.org/hdcodetxt/divDouble.c.txt
Some real-world examples from gcc https://godbolt.org/g/NclqXC , the assembler is in intel syntax.
First an addition. adding two 64-bit numbers and producing a 64-bit result. The asm is the same for both signed and unsigned versions.
int64_t add64(int64_t a, int64_t b) { return a + b; }
add64:
mov eax, DWORD PTR [esp+12]
mov edx, DWORD PTR [esp+16]
add eax, DWORD PTR [esp+4]
adc edx, DWORD PTR [esp+8]
ret
This is pretty simple, load one argument into eax and edx, then add the other using an add followed by an add with carry. The result is left in eax and edx for return to the caller.
Now a multiplication of two 64-bit numbers to produce a 64-bit result. Again the code doesn't change from signed to unsigned. I've added some comments to make it easier to follow.
Before we look at the code lets consider the math. a and b are 64-bit numbers I will use lo() to represent the lower 32-bits of a 64-bit number and hi() to represent the upper 32 bits of a 64-bit number.
(a * b) = (lo(a) * lo(b)) + (hi(a) * lo(b) * 2^32) + (hi(b) * lo(a) * 2^32) + (hi(b) * hi(a) * 2^64)
(a * b) mod 2^64 = (lo(a) * lo(b)) + (lo(hi(a) * lo(b)) * 2^32) + (lo(hi(b) * lo(a)) * 2^32)
lo((a * b) mod 2^64) = lo(lo(a) * lo(b))
hi((a * b) mod 2^64) = hi(lo(a) * lo(b)) + lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
uint64_t mul64(uint64_t a, uint64_t b) { return a*b; }
mul64:
push ebx ;save ebx
mov eax, DWORD PTR [esp+8] ;load lo(a) into eax
mov ebx, DWORD PTR [esp+16] ;load lo(b) into ebx
mov ecx, DWORD PTR [esp+12] ;load hi(a) into ecx
mov edx, DWORD PTR [esp+20] ;load hi(b) into edx
imul ecx, ebx ;ecx = lo(hi(a) * lo(b))
imul edx, eax ;edx = lo(hi(b) * lo(a))
add ecx, edx ;ecx = lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
mul ebx ;eax = lo(low(a) * lo(b))
;edx = hi(low(a) * lo(b))
pop ebx ;restore ebx.
add edx, ecx ;edx = hi(low(a) * lo(b)) + lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
ret
Finally when we try a division we see.
int64_t div64(int64_t a, int64_t b) { return a/b; }
div64:
sub esp, 12
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
call __divdi3
add esp, 28
ret
The compiler has decided that division is too complex to implement inline and instead calls a library routine.
The compiler actually made a static optimization of your code.
lines #1 #2 #3 are parameters for printf()
As #Pafy mentions, the compiler has evaluated this as a constant.
2 to the 64th minus 1 is 0xffffffffffffffff.
As 2 32-bit integers this is: 0xffffffff and 0xffffffff, which if you take that as a pair of 32-bit signed types, ends up as: -1, and -1.
Thus for your compiler the code generated happens to be equivalent to:
printf("max unsigned long long = %llu\n", -1, -1);
In the assembly it's written like this:
movl $-1, 8(%esp) #Second -1 parameter
movl $-1, 12(%esp) #First -1 parameter
movl $.LC0, 4(%esp) #Format string
movl $1, (%esp) #A one. Kind of odd, perhaps __printf_chk
#in your C library expects this.
call __printf_chk
By the way, a better way to calculate powers of 2 is to shift 1 left. Eg. (1ULL << 64) - 1.
No one in this thread noticed that the OP asked to explain the first 4 lines, not lines 11-14.
The first 4 lines are:
.file "size.c"
.section .rodata.str1.4,"aMS",#progbits,1
.align 4
.LC0:
Here's what happens in first 4 lines:
.file "size.c"
This is an assembler directive that says that we are about to start a new logical file called "size.c".
.section .rodata.str1.4,"aMS",#progbits,1
This is also a directive for read only strings in the program.
.align 4
This directive sets the location counter to always be a multiple of 4.
.LC0:
This is a label LC0 that can be jumped to, for example.
I hope I provided the right answer to the question as I answered exactly what OP asked.