For GNU Assembly x64 AT&T syntax: How to add 2 quad numbers? [duplicate]

For GNU Assembly x64 AT&T syntax: How to add 2 quad numbers? [duplicate] - c

I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.

As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.

Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.

Related

Trying to obtain addq, but keep getting leaq

So, I'm trying to get familiar with assembly and trying to reverse-engineer some code. My problem lies in trying to decode addq which I understands performs Source + Destination= Destination.
I am using the assumptions that parameters x, y, and z are passed in registers %rdi, %rsi, and %rdx. The return value is stored in %rax.
long someFunc(long x, long y, long z){
1. long temp=(x-z)*x;
2. long temp2= (temp<<63)>>63;
3. long temp3= (temp2 ^ x);
4. long answer=y+temp3;
5. return answer;
}
So far everything above line 4 is exactly what I am wanting. However, line 4 gives me leaq (%rsi,%rdi), %rax rather than addq %rsi, %rax. I'm not sure if this is something I am doing wrong, but I am looking for some insight.

Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.
lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.
Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.
Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1).
We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.
Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z
Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).

Collatz function in assembly - segmentation fault

I am trying to write a hybrid program between C and x86-64 assembly language. This program should calculate the largest stopping time of a number between 1 and given parameter n using the Collatz function. The main function is written in C and in its for-loop it calls an external function written in assembly.
However, I am getting a segmentation fault when running the compiled hybrid program for values larger than 2. Using gdb I've found the error to be when I make the recursive call. This is the error I am getting:
Program received signal SIGSEGV, Segmentation fault.
0x00000000004006c3 in is_odd ()
C code:
#include <stdio.h>
#include <stdlib.h>
int noOfOp = 0;
extern int collatz(long long n);
// The main function. Main expects one parameter n.
// Then, it computes collatz(1), colllatz(2), ..., collataz(n) and finds the
// a number m, 1 <= m <= n with the maximum stopping time.
int main(int argc, char *argv[]){
if (argc != 2) {
printf("Parameter \"n\" is missing. \n");
return -1;
} else {
int max=0;
long long maxn=0;
int tmp=0;
long long n = atoll(argv[1]);
for (long long i=1 ; i<=n ; i++) {
tmp = collatz(i);
if (tmp > max) {
max = tmp;
maxn=i;
}
}
printf("The largest stopping time between 1 and %lld was %lld ", n,maxn);
printf("with the stopping time of %d. \n", max);
}
}
And this is the x86-64 assembly code I've written. I expect this code to reflect my lack of proper understanding of assembly, yet. This is an assignment in class of which we have been given four days to complete on this new topic. Normally I would have read more documentation but I simple am in lack of the time. And assembly language is hard.
.section .text
.global collatz
collatz:
pushq %rbp # save old base pointer
movq %rsp, %rbp # create new base pointer
subq $16, %rsp # local variable space
cmpq $1, %rdi # compare n to 1
je is_one # if n = 1, return noOfOp
incq noOfOp # else n > 1, then increment noOfOp
movq %rdi, %rdx # move n to register rdx
cqto # sign extend rdx:rax
movq $2, %rbx # move 2 to register rbx
idivq %rbx # n / 2 -- quotient is in rax, remainder in rdx
cmpq $1, %rdx # compare remainder to 1
je is_odd # if n is odd, jump to is_odd
jl is_even # else n is even, jump to is_even
leave # remake stack
ret # return
is_odd:
movq %rdi, %rdx # move n to register rdx
cqto # sign extend rdx:rax
movq $3, %rbx # move 3 to register rbx
imulq %rbx # n * 3 -- result is in rax:rdx
movq %rax, %rdi # move n to register rdi
incq %rdi # n = n + 1
call collatz # recursive call: collatz(3n+1) <---- this is where the segmentation fault seems to happen
leave # remake stack
ret # return
is_even:
movq %rax, %rdi # n = n / 2 (quotient from n/2 is still in rax)
call collatz # recursive call: collatz(n/2) <---- I seem to have gotten the same error here by commenting out most of the stuff in is_odd
leave # remake stack
ret # return
is_one:
movq noOfOp, %rax # set return value to the value of noOfOp variable
leave # remake stack
ret # return
I appreciate any and all the help and suggestions I can get.

Two problems that I see just from inspecting the code:
noOfOp is declared as an int, which will be a 32-bit type on x86-64. Your assembly code, however, is treating it as if it were a 64-bit type. Specifically, where you increment it by one using incq. That should instead be incl noOfOp or addl $1, noOfOp.
Along the same lines, your collatz function is prototyped as returning an int, but your code suggests that you are trying to return a 64-bit value in rax. This won't cause any problems, because the caller will just use only the lower 32 bits, but it may cause correctness problems.
You are ignoring the calling convention when recursively calling the collatz function. Assuming that you are on Linux, the applicable one would be the System V AMD64 calling convention. Here, the RBP and RBX registers are callee-save. Therefore, you need to preserve their contents. Do be sure to familiarize yourself with the calling convention and follow its rules.
As one of the commenters suggested, it may be easiest to write the function first in C or C++, before translating it to assembly. This will also make it easier to debug, and it also makes it possible to see what code the compiler emits. You can check the compiler's output against your own hand-written assembly code.
There may be additional problems with your code that I didn't spot. You can find them for yourself by single-stepping through your code with a debugger. You are already using GDB, so this should be simple to do.

After what Peter suggested in the comments above, I read what him and other brilliant people discussed in another thread of the same topic. This is the code I ended up with after implementing some of those ideas. This is now 30% faster than that compiled with gcc -O3. I cannot believe how much more faster the program can be these different "tricks" - I truly learned a lot about efficiency. Thank you to those who helped.
.section .text
.global collatz
collatz:
pushq %rbp # save old base pointer
movq %rsp, %rbp # create new base pointer
subq $16, %rsp # local variable space
movq $-1, %r10 # start counter at -1
while_loop:
incq %r10 # increment counter
leaq (%rdi, %rdi, 2), %rdx # rdx = 2 * n + n
incq %rdx # rdx = 3n+1
sarq %rdi # rdi = n/2
cmovc %rdx, %rdi # if CF, rdi = rdx
# (if CF was set during right shift (i.e. n is odd) set rdi to 3n+1)
# else keep rdi to n/2
jnz while_loop # if n =/= 1 do loop again:
# Z flag is only set if sarq shifts when n is 1 making result 0.
# else
movq %r10, %rax # set return value to counter
leave # remake stack
ret # return

Thank you for all the answers. I apologize if my question did not follow Stack Overflow guidelines.
I meant that normally I would not bother others with this if I had more time. Instead I sought guidance - not presumed debugging service - that could lead me on the right path.
For anyone interested I got the program to work. I went with a different approach than originally posted and made some changes for speed up. Below is the new assembly code.
.section .text
.global collatz
collatz:
pushq %rbp # save old base pointer
movq %rsp, %rbp # create new base pointer
subq $16, %rsp # local variable space
cmpq $1, %rdi # compare n to 1
je is_one # if n = 1, jump to is_one
# else n > 1
incl noOfOp # increment noOfOp
movq %rdi, %rax # move n to rax
andq $1, %rax # AND 1 with n
jz is_even # if n is even jump to is_even
# else n is odd
movq $3, %rdx # move 3 to rdx
imul %rdx, %rdi # n = 3 * n
incq %rdi # n = 3n + 1
call collatz # recursive call: collatz(3n+1)
leave # remake stack
ret # return
is_even:
sarq %rdi # arithmetic right shift by 1 - divide n by 2
call collatz # recursive call: collatz(n/2)
leave # remake stack
ret # return
is_one:
movl noOfOp, %eax # set return value to noOfOp
movl $0, noOfOp # reset noOfOp
leave # remake stack
ret # return
This works and is approx. 30% faster than the code I have written only in C. But I know from the assignment that I can shave off even more time making it more effective. If anyone has any ideas how to do so feel free to comment.
Thank you again.

Translating O2 optimized for-loop from assembly to C

This is a homework question.
I am attempting to obtain information from the following assembly code (x86 linux machine, compiled with gcc -O2 optimization). I have commented each section to show what I know. A big chunk of my assumptions could be wrong, but I have done enough searching to the point where I know I should ask these questions here.
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "result %lx\n" //Printed string at end of program
.text
main:
.LFB13:
xorl %esi, %esi // value of esi = 0; x
movl $1, %ecx // value of ecx = 1; result
xorl %edx, %edx // value of edx = 0; Loop increment variable (possibly mask?)
.L2:
movq %rcx, %rax // value of rax = 1; ?
addl $1, %edx // value of edx = 1; Increment loop by one;
salq $3, %rcx // value of rcx = 8; Shift left rcx;
andl $3735928559, %eax // value of eax = 1; Value AND 1 = 1;
orq %rax, %rsi // value of rsi = 1; 1 OR 0 = 1;
cmpl $22, %edx // edx != 22
jne .L2 // if true, go back to .L2 (loop again)
movl $.LC0, %edi // Point to string
xorl %eax, %eax // value of eax = 0;
jmp printf // print
.LFE13: ret // return
And I am supposed to turn it into the following C code with the blanks filled in
#include <stdio.h>
int main()
{
long x = 0x________;
long result = ______;
long mask;
for (mask = _________; mask _______; mask = ________) {
result |= ________;
}
printf("result %lx\n",result);
}
I have a couple of questions and sanity checks that I want to make sure I am getting right since none of the similar examples I have found are for optimized code. Upon compiling some trials myself I get something close but the middle part of L2 is always off.
MY UNDERSTANDING
At the beginning, esi is xor'd with itself, resulting in 0 which is represented by x. 1 is then added to ecx, which would be represented by the variable result.
x = 0; result = 1;
Then, I believe a loop increment variable is stored in edx and set to 0. This will be used in the third part of the for loop (update expression). I also think that this variable must be mask, because later on 1 is added to edx, signifying a loop increment (mask = mask++), along with edx being compared in the middle part of the for loop (test expression aka mask != 22).
mask = 0; (in a way)
The loop is then entered, with rax being set to 1. I don't understand where this is used at all since there is no fourth variable I have declared, although it shows up later to be anded and zeroed out .
movq %rcx, %rax;
The loop variable is then incremented by one
addl $1, %edx;
THE NEXT PART MAKES THE LEAST AMOUNT OF SENSE TO ME
The next three operations I feel make up the body expression of the loop, however I have no idea what to do with them. It would result in something similar to result |= x ... but I don't know what else
salq $3, %rcx
andl $3735928559, %eax
orq %rax, %rsi
The rest I feel I have a good grasp on. A comparison is made ( if mask != 22, loop again), and the results are printed.
PROBLEMS I AM HAVING
I don't understand a couple of things.
1) I don't understand how to figure out my variables. There seem to be 3 hardcoded ones along with one increment or temporary storage variable that is found in the assembly (rax, rcx, rdx, rsi). I think rsi would be the x , and rcx would be result, yet I am unsure of if mask would be rdx or rax, and either way, what would the last variable be?
2) What do the 3 expressions of which I am unsure of do? I feel that I have them mixed up with the incrementation somehow, but without knowing the variables I don't know how to go about solving this.
Any and all help will be great, thank you!

The answer is :
#include <stdio.h>
int main()
{
long x = 0xDEADBEEF;
long result = 0;
long mask;
for (mask = 1; mask != 0; mask = mask << 3) {
result |= mask & x;
}
printf("result %lx\n",result);
}
In the assembly :
rsi is result. We deduce that because it is the only value that get ORed, and it is the second argument of the printf (In x64 linux, arguments are stored in rdi, rsi, rdx, and some others, in order).
x is a constant that is set to 0xDEADBEEF. This is not deductible for sure, but it makes sense because it seems to be set as a constant in the C code, and doesn't seem to be set after that.
Now for the rest, it is obfuscated by an anti-optimization by GCC. You see, GCC detected that the loop would be executed exactly 21 times, and thought is was clever to mangle the condition and replace it by a useless counter. Knowing that, we see that edx is the useless counter, and rcx is mask. We can then deduce the real condition and the real "increment" operation. We can see the <<= 3 in the assembly, and notice that if you shift left a 64-bit int 22 times, it becomes 0 ( shift 3, 22 times means shift 66 bits, so it is all shifted out).
This anti-optimization is sadly really common for GCC. The assembly can be replaced with :
.LFB13:
xorl %esi, %esi
movl $1, %ecx
.L2:
movq %rcx, %rax
andl $3735928559, %eax
orq %rax, %rsi
salq $3, %rcx // implicit test for 0
jne .L2
movl $.LC0, %edi
xorl %eax, %eax
jmp printf
It does exactly the same thing, but we removed the useless counter and saved 3 assembly instructions. It also matches the C code better.

Let's work backwards a bit. We know that result must be the second argument to printf(). In the x86_64 calling convention, that's %rsi. The loop is everything between the .L2 label and the jne .L2 instruction. We see in the template that there's a result |= line at the end of the loop, and indeed, there's an orl instruction there with %rsi as its target, so that checks out. We can now see what it's initialized to at the top of .main.
ElderBug is correct that the compiler spuriously optimized by adding a counter. But we can still figure out: which instruction runs immediately after the |= when the loop repeats? That must be the third part of the loop. What runs immediately before the body of the loop? That must be the loop initialization. Unfortunately, you'll have to figure out what would have happened on the 22nd iteration of the original loop to reverse-engineer the loop condition. (But sal is a left-shift, and that line is a vestige of the original loop condition, which would have been followed by a conditional branch before the %rdx test was inserted.)
Note that the code keeps a copy of the value of mask around in %rcx before modifying it in %rax, and x is folded into a constant (take a close look at the andl line).
Also note that you can feed the .S file to gas to get a .o and see what it does.

Use of DI register in String operations

I was looking at the Compiler output for a C program, just for academic purposes and happened to get the following output.
.file "test.c"
.section .rodata
.LC0:
.string "Hello World"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
I understand the parts where the based pointer and stack pointer operations are taking place and other operation, I wanted to know what is the use of putting
movl $.LC0, %edi
how is loading the address of the test "Hello world" from the data block into the destination register solving the purpose, we could have just loaded that address in the accumulator and let printf handle it. I am not used to programming in assembly but i can make out what the program is doing, am i missing something obvious here?
Google searches showed that they were used for string operations but none said why?

First of all, your call on printf may be passing arguments by registers and not by stack because it was optimised in that way, or because its attributes during compilation were set to __fastcall (MSVC) or __attribute__((fastcall)).
%esi and %esi registers are used in string operations because of their meaning to string instructions, such as cmps, lods, movs, scas, stos, outs or ins. These instructions use the destination and source register for quick sequential access to a string of bytes/words/doublewords. They can be used in loops to make simple operations that are known to be performed continuously in memory, and can shorter execution time in combination with loop prefixes by removing the need of pointer manipulation and limit checking.
A very good example on this is the movs instruction (it also has another forms as movsb, movsw, movsd). If you wanted to write a simple string copy procedure without string instruction, you write something like this:
; IN: EAX=source&, EBX=dest&, ECX=count
; OUT: nothing
copy:
.loop:
cmp ecx, 0
jz .end
dec ecx
mov al, byte [eax+ecx]
mov byte [ebx+ecx], al
jmp .loop
.end:
ret
movsb instruction copies [esi] to [edi], increments esi and edi, then decrements ecx. With this in mind you can write somethign similar to this:
; IN: ESI=source&, EDI=dest&, ECX=count
; OUT: nothing
copy:
.loop:
jecxz .end
movsb
jmp .loop
.end:
ret
Using loop prefixes, you can again speed the whole operation
; IN: ESI=source&, EDI=dest&, ECX=count
; OUT: nothing
copy:
rep movsb
ret

I am going to say yes and no to user35443 answer.
I wanted to know what is the use of putting
movl $.LC0, %edi
Since you are using 64bit Linux (from the use of rbp), in 64 bit land, parameters are passed in registers. rdi contains the first parameter, rsi the second, rdx 3rd, rcx 4th, r8 5th, r9 the 6th parameter; any more parameters are passed on the stack.
we could have just loaded that address in the accumulator and let
printf handle it
No! When using Assembly, it is up to you to read and understand the ABI for the OS you are using and follow it to the T! If you were using Windows, the first parameter would be in rcx instead. It has nothing to do with the source nor destination.
the "Accumulator" is actually a parameter to printf and all vararg functions really. r/eax contains the number of floating point numbers passed in the xmm registers, since in your example code no floats are passed, eax is set to 0.

c & gcc : Stack growth and alignment - for a 64 bit machine

I have the following program. I wonder why it outputs -4 on the following 64 bit machine? Which of my assumptions went wrong ?
[Linux ubuntu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux]
In the above machine and gcc compiler, by default b should be pushed first and a second.
The stack grows downwards. So b should have higher address and a have lower address. So result should be positive. But I got -4. Can anybody explain this ?
The arguments are two chars occupying 2 bytes in the stack frame. But I saw the difference as 4 where as I am expecting 1. Even if somebody says it is because of alignment, then I am wondering a structure with 2 chars is not aligned at 4 bytes.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void CompareAddress(char a, char b)
{
printf("Differs=%ld\n", (intptr_t )&b - (intptr_t )&a);
}
int main()
{
CompareAddress('a','b');
return 0;
}
/* Differs= -4 */

Here's my guess:
On Linux in x64, the calling convention states that the first few parameters are passed by register.
So in your case, both a and b are passed by register rather than on the stack. However, since you take its address, the compiler will store it somewhere on the stack after the function is called.(Not necessary in the downwards order.)
It's also possible that the function is just outright inlined.
In either case, the compiler makes temporary stack space to store the variables. Those can be in any order and subject to optimizations. So they may not be in any particular order that you might expect.

The best way to answer these sort of questions (about behaviour of a specific compiler on a specific platform) is to look at the assembler. You can get gcc to dump its assembler by passing the -S flag (and the -fverbose-asm flag is nice too). Running
gcc -S -fverbose-asm file.c
gives a file.s that looks a little like (I've removed all the irrelevant bits, and the bits in parenthesis are my notes):
CompareAddress:
# ("allocate" memory on the stack for local variables)
subq $16, %rsp
# (put a and b onto the stack)
movl %edi, %edx # a, tmp62
movl %esi, %eax # b, tmp63
movb %dl, -4(%rbp) # tmp62, a
movb %al, -8(%rbp) # tmp63, b
# (get their addresses)
leaq -8(%rbp), %rdx #, b.0
leaq -4(%rbp), %rax #, a.1
subq %rax, %rdx # a.1, D.4597 (&b - &a)
# (set up the parameters for the printf call)
movl $.LC0, %eax #, D.4598
movq %rdx, %rsi # D.4597,
movq %rax, %rdi # D.4598,
movl $0, %eax #,
call printf #
main:
# (put 'a' and 'b' into the registers for the function call)
movl $98, %esi #,
movl $97, %edi #,
call CompareAddress
(This question explains nicely what [re]bp and [re]sp are.)
The reason the difference is negative is the stack grows downward: i.e. if you push two things onto the stack, the one you push first will have a larger address, and a is pushed before b.
The reason it is -4 rather than -1 is the compiler has decided that aligning the arguments to 4 byte boundaries is "better", probably because a 32 bit/64 bit CPU deals with 4 bytes at time better than it handles single bytes.
(Also, looking at the assembler shows the effect that -mpreferred-stack-boundary has: it essentially means that memory on the stack is allocated in different sized chunks.)

I think the answer that program given you is correct, the default preferred-stack-boundary of GCC is 4, you can set -mpreferred-stack-boundary=num to GCC options to change the stack boudary, then program will give you the different answer according your set.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight