Clang 11 and GCC 8 O2 Breaks Inline Assembly

Clang 11 and GCC 8 O2 Breaks Inline Assembly - c

I have a short snippet of code, with some inline assembly that prints argv[0] properly in O0, but does not print anything in O2 (when using Clang. GCC, on the other hand, prints the string stored in envp[0] when printing argv[0]). This problem is also restricted to only argv (the other two function parameters can be used as expected with or without optimizations enabled). I tested this with both GCC and Clang, and both compilers have this issue.
Here is the code:
void exit(unsigned long long status) {
asm volatile("movq $60, %%rax;" //system call 60 is exit
"movq %0, %%rdi;" //return code 0
"syscall"
: //no outputs
:"r"(status)
:"rax", "rdi");
}
int open(const char *pathname, unsigned long long flags) {
asm volatile("movq $2, %%rax;" //system call 2 is open
"movq %0, %%rdi;"
"movq %1, %%rsi;"
"syscall"
: //no outputs
:"r"(pathname), "r"(flags)
:"rax", "rdi", "rsi");
return 1;
}
int write(unsigned long long fd, const void *buf, size_t count) {
asm volatile("movq $1, %%rax;" //system call 1 is write
"movq %0, %%rdi;"
"movq %1, %%rsi;"
"movq %2, %%rdx;"
"syscall"
: //no outputs
:"r"(fd), "r"(buf), "r"(count)
:"rax", "rdi", "rsi", "rdx");
return 1;
}
static void entry(unsigned long long argc, char** argv, char** envp);
/*https://www.systutorials.com/x86-64-calling-convention-by-gcc/: "The calling convention of the System V AMD64 ABI is followed on GNU/Linux. The registers RDI, RSI, RDX, RCX, R8, and R9 are used for integer and memory address arguments
and XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments.
For system calls, R10 is used instead of RCX. Additional arguments are passed on the stack and the return value is stored in RAX."*/
//__attribute__((naked)) defines a pure-assembly function
__attribute__((naked)) void _start() {
asm volatile("xor %%rbp,%%rbp;" //http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html: "%ebp,%ebp sets %ebp to zero. This is suggested by the ABI (Application Binary Interface specification), to mark the outermost frame."
"pop %%rdi;" //rdi: arg1: argc -- can be popped off the stack because it is copied onto register
"mov %%rsp, %%rsi;" //rsi: arg2: argv
"mov %%rdi, %%rdx;"
"shl $3, %%rdx;" //each argv pointer takes up 8 bytes (so multiply argc by 8)
"add $8, %%rdx;" //add size of null word at end of argv-pointer array (8 bytes)
"add %%rsp, %%rdx;" //rdx: arg3: envp
"andq $-16, %%rsp;" //align stack to 16-bits (which is required on x86-64)
"jmp %P0" //https://stackoverflow.com/questions/3467180/direct-c-function-call-using-gccs-inline-assembly: "After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case."
:
:"i"(entry)
:"rdi", "rsp", "rsi", "rdx", "rbp", "memory");
}
//Function cannot be optimized-away, since it is passed-in as an argument to asm-block above
//Compiler Options: -fno-asynchronous-unwind-tables;-O2;-Wall;-nostdlibinc;-nobuiltininc;-fno-builtin;-nostdlib; -nodefaultlibs;--no-standard-libraries;-nostartfiles;-nostdinc++
//Linker Options: -nostdlib; -nodefaultlibs
static void entry(unsigned long long argc, char** argv, char** envp) {
int ttyfd = open("/dev/tty", O_WRONLY);
write(ttyfd, argv[0], 9);
write(ttyfd, "\n", 1);
exit(0);
}
Edit: Added syscall definitions.
Edit: Adding rcx and r11 to the clobber list for the syscalls fixed the issue for clang, but gcc to have the error.
Edit: GCC actually was not having an error, but some kind of strange error in my build system (CodeLite) made it so that the program ran some kind of partially-built program, even though GCC reported errors about it not recognizing two of the compiler flags passed-in.
For GCC, use these flags instead: -fomit-frame-pointer;-fno-asynchronous-unwind-tables;-O2;-Wall;-nostdinc;-fno-builtin;-nostdlib; -nodefaultlibs;--no-standard-libraries;-nostartfiles;-nostdinc++. You can also use these flags for Clang, due to Clang's support for the above GCC options.

You can't use extended asm in a naked function, only basic asm, according to the gcc manual. You don't need to inform the compiler of clobbered registers (since it won't do anything about them anyway; in a naked function you are responsible for all register management). And passing the address of entry in an extended operand is unnecessary; just do jmp entry.
(In my tests your code doesn't compile at all, so I assume you weren't showing us your exact code - next time please do, so as to avoid wasting people's time.)
Linux x86-64 syscall system calls are allowed to clobber the rcx and r11 registers, so you need to add those to the clobber lists of your system calls.
You align the stack to a 16-byte boundary before jumping to entry. However, the 16-byte alignment rule is based on the assumption that you will be calling the function with call, which would push an additional 8 bytes onto the stack. As such, the called function actually expects the stack to initially be, not a multiple of 16, but 8 more or less than a multiple of 16. So you are actually aligning the stack incorrectly, and this can be a cause of all sorts of mysterious trouble.
So either replace your jmp with call, or else subtract a further 8 bytes from rsp (or just push some 64-bit register of your choice).
Style note: unsigned long is already 64 bits on Linux x86-64, so it would be more idiomatic to use that in place of unsigned long long everywhere.
General hint: learn about register constraints in extended asm. You can have the compiler load your desired registers for you, instead of writing instructions in your asm to do it yourself. So your exit function could instead look like:
void exit(unsigned long status) {
asm volatile("syscall"
: //no outputs
:"a"(60), "D" (status)
:"rcx", "r11");
}
This in particular saves you a few instructions, since status is already in the %rdi register on function entry. With your original code, the compiler has to move it somewhere else so that you can then load it into %rdi yourself.
Your open function always returns 1, which will typically not be the fd that was actually opened. So if your program is run with standard output redirected, your program will write to the redirected stdout, instead of to the tty as it seems to want to do. Indeed, this makes the open syscall completely pointless, because you never use the file you opened.
You should arrange for open to return the value that was actually returned by the system call, which will be left in the %rax register when syscall returns. You can use an output operand to have this stored in a temporary variable (which the compiler will likely optimize out), and return that. You'll need to use a digit constraint since it is going in the same register as an input operand. I leave this as an exercise for you. It would likewise be nice if your write function actually returned the number of bytes written.

Related

Getrusage inline assembly

I am trying to implement getrusage function into my client server program using sockets and all of this is running on FreeBSD. I want to print out processor time usage and memory usage.
I have tried to implement the following code but I am getting output Illegal instrucion (Core dumped)
int getrusage(int who, struct rusage *usage){
int errorcode;
__asm__(
"syscall"
: "=a" (errorcode)
: "a" (117), "D" (who), "S" (usage) //const Sysgetrusage : scno = 117
: "memory"
);
if (errorcode<0) {
printf("error");
}
return 1;
}
UPDATE: I have tried to run this but I get zero values or some random number value or negative number values. Any ideas what am I missing?
int getrusage(int who, struct rusage *usage){
int errorcode;
__asm__("push $0;"
"push %2;"
"push %1;"
"movl $117, %%eax;"
"int $0x80;"
:"=r"(errorcode)
:"D"(who),"S"(usage)
:"%eax"
);
if (errorcode<0) {
printf("error");
}
return 1;
}
I would like to use system call write more likely, but it is giving me a compilation warning: passing arg 1 of 'strlen' makes pointer from integer without a cast
EDIT: (this is working code now, regarding to comment)
struct rusage usage;
getrusage(RUSAGE_SELF,&usage);
char tmp[300];
write(i, "Memory: ", 7);
sprintf (tmp, "%ld", usage.ru_maxrss);
write(i, tmp, strlen(tmp));
write(i, "Time: ", 5);
sprintf (tmp, "%lds", usage.ru_utime.tv_sec);
write(i, tmp, strlen(tmp));
sprintf (tmp, "%ldms", usage.ru_utime.tv_usec);
write(i, tmp, strlen(tmp));
Any ideas what is wrong?

The reason you are getting an illegal instruction error is because the SYSCALL instruction is only available on 64-bit FreeBSD running a 64-bit program. This is a serious issue since one of your comments suggests that your code is running on 32-bit FreeBSD.
Under normal circumstances you don't need to write your own getrusage since it is part of the C library (libc) on that platform. It appears you have been tasked to do it with inline assembly.
64-bit FreeBSD and SYSCALL Instruction
There is a bit of a bug in your 64-bit code since SYSCALL destroys the contents of RCX and R11. Your code may work but may fail in the future especially as the program expands and you enable optimizations. The following change adds those 2 registers to the clobber list:
int errorcode;
__asm__(
"syscall"
: "=a" (errorcode)
: "a" (117), "D" (who), "S" (usage) //const Sysgetrusage : scno = 117
: "memory", "rcx", "r11"
);
Using the memory clobber can lead to generation of inefficient code so I use it only if necessary. As you become more of an expert the need for memory clobber can be eliminated. I would have used a function like the following if I wasn't allowed to use the C library version of getrusage:
int getrusage(int who, struct rusage *usage){
int errorcode;
__asm__(
"syscall"
: "=a"(errorcode), "=m"(*usage)
: "0"(117), "D"(who), "S"(usage)
: "rcx", "r11"
);
if (errorcode<0) {
printf("error");
}
return errorcode;
}
This uses a memory operand as an output constraint and drops the memory clobber. Since the compiler knows how large a rusage structure and is =m says the output constraint modifies that memory we don't need need the memory clobber.
32-bit FreeBSD System Calls via Int 0x80
As mention in the comments and your updated code, to make a system call in 32-bit code in FreeBSD you have to use int 0x80. This is described in the FreeBSD System Calls Convention. Parameters are pushed on the stack right to left and you must allocate 4 bytes on the stack by pushing any 4 byte value onto the stack after you push the last parameter.
Your edited code has a few bugs. First you push the extra 4 bytes before the rest of the arguments. You need to push it after. You need to adjust the stack after int 0x80 to effectively reclaim the stack space used by the arguments passed. You pushed three 4-byte values on the stack, so you need to add 12 to ESP after int 0x80.
You also need a memory clobber because the compiler doesn't know you have actually modified memory at all. This is because the way you have done your constraints the data in the variable usage gets modified but the compiler doesn't know what.
The return value of the int 0x80 will be in EAX but you use the constraint =r. It should have been =a since the return value will be returned in EAX. Since using =a tells the compiler EAX is clobbered you don't need to list it as a clobber anymore.
The modified code could have looked like:
int getrusage(int who, struct rusage *usage){
int errorcode;
__asm__("push %2;"
"push %1;"
"push $0;"
"movl $117, %%eax;"
"int $0x80;"
"add $12, %%esp"
:"=a"(errorcode)
:"D"(who),"S"(usage)
:"memory"
);
if (errorcode<0) {
printf("error");
}
return errorcode;
}
Another way one could have written this with more advanced techniques is:
int getrusage(int who, struct rusage *usage){
int errorcode;
__asm__("push %[usage]\n\t"
"push %[who]\n\t"
"push %%eax\n\t"
"int $0x80\n\t"
"add $12, %%esp"
:"=a"(errorcode), "=m"(*usage)
:"0"(117), [who]"ri"(who), [usage]"r"(usage)
:"cc" /* Don't need this with x86 inline asm but use for clarity */
);
if (errorcode<0) {
printf("error");
}
return errorcode;
}
This uses a label (usage and who) to identify each parameter rather than using numerical positions like %3, %4 etc. This makes the inline assembly easier to follow. Since any 4-byte value can be pushed onto the stack just before int 0x80 we can save a few bytes by simply pushing the contents of any register. In this case I used %%eax. This uses =m constraint like I did in the 64-bit example.
More information on extended inline assembler can be found in the GCC documentation.

GCC INLINE ASSEMBLY Won't Let Me Overwrite $esp

I'm writing code to temporarily use my own stack for experimentation. This worked when I used literal inline assembly. I was hardcoding the variable locations as offsets off of ebp. However, I wanted my code to work without haivng to hard code memory addresses into it, so I've been looking into GCC's EXTENDED INLINE ASSEMBLY. What I have is the following:
volatile intptr_t new_stack_ptr = (intptr_t) MY_STACK_POINTER;
volatile intptr_t old_stack_ptr = 0;
asm __volatile__("movl %%esp, %0\n\t"
"movl %1, %%esp"
: "=r"(old_stack_ptr) /* output */
: "r"(new_stack_ptr) /* input */
);
The point of this is to first save the stack pointer into the variable old_stack_ptr. Next, the stack pointer (%esp) is overwritten with the address I have saved in new_stack_ptr.
Despite this, I found that GCC was saving the %esp into old_stack_ptr, but was NOT replacing %esp with new_stack_ptr. Upon deeper inspection, I found it actually expanded my assembly and added it's own instructions, which are the following:
mov -0x14(%ebp),%eax
mov %esp,%eax
mov %eax,%esp
mov %eax,-0x18(%ebp)
I think GCC is trying to preserve the %esp, because I don't have it explicitly declared as an "output" operand... I could be totally wrong with this...
I really wanted to use extended inline assembly to do this, because if not, it seems like I have to "hard code" the location offsets off of %ebp into the assembly, and I'd rather use the variable names like this... especially because this code needs to work on a few different systems, which seem to all offset my variables differently, so using extended inline assembly allows me to explicitly say the variable location... but I don't understand why it is doing the extra stuff and not letting me overwrite the stack pointer like it was before, ever since I started using extended assembly, it's been doing this.
I appreciate any help!!!

Okay so the problem is gcc is allocating input and output to the same register eax. You want to tell gcc that you are clobbering the output before using the input, aka. "earlyclobber".
asm __volatile__("movl %%esp, %0\n\t"
"movl %1, %%esp"
: "=&r"(old_stack_ptr) /* output */
: "r"(new_stack_ptr) /* input */
);
Notice the & sign for the output. This should fix your code.
Update: alternatively, you could force input and output to be the same register and use xchg, like so:
asm __volatile__("xchg %%esp, %0\n\t"
: "=r"(old_stack_ptr) /* output */
: "0"(new_stack_ptr) /* input */
);
Notice the "0" that says "put this into the same register as argument 0".

why addresses of elements in the stack are reversed in ubuntu64?

I write a simple program to print out the addresses of the elements in the stack
#include <stdio.h>
#include <memory.h>
void f(int i,int j,int k)
{
int *pi = (int*)malloc(sizeof(int));
int a =20;
printf("%p,%p,%p,%p,%p\n",&i,&j,&k,&a,pi);
}
int main()
{
f(1,2,3);
return 0;
}
output:(in ubuntu64, unexpected)
0x7fff4e3ca5dc,0x7fff4e3ca5d8,0x7fff4e3ca5d4,0x7fff4e3ca5e4,0x2052010
output:(in ubuntu32 , as expected)
0xbf9525f0,0xbf9525f4,0xbf9525f8,0xbf9525d8,0x931f008
environment for ubuntu64:
$uname -a
Linux 3.8.0-26-generic #38-Ubuntu SMP Mon Jun 17 21:43:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$gcc -v
Target: x86_64-linux-gnu
gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~13.04)
According to the diagram above, that the earlier the element has been pushed to the stack, the higher address it will locate,
and if using calling convention cdecl , the rightest parameter will be push to the stack first.
The local variable should be pushed to the stack after pushed the parameters
But the output is reversed in ubuntu64 as expected:
the address of k is :0x7fff4e3ca5d4 //<---should have been pushed to the stack first
the address of j is :0x7fff4e3ca5d8
the address of i is :0x7fff4e3ca5dc
the address of a is :0x7fff4e3ca5e4 //<---should have been pushed to the stack after i,j,k
Any ideas about it?

Even though a clear ABI has been defined for both architectures, compilers do not guarantee that this is respected. You might wonder why, the reason is usually performance. Passing variables into the stack is more expensive in terms of speed than using registers since the application needs to access the memory for retrieving them. Another example of this habit is how compilers use EBP/RBP register. EBP/RBP should be the register which contains the frame-pointer, that is, the stack base address. The stack base register allows for local variables to be easily accessible. However, the frame-pointer register is often used as a general register for increasing the performance. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions, particular important in X86_32 architecture, where usually programs are eager of registers. The main drawback is that makes debugging impossible on some machines. For more info check -fomit-frame-pointer option of gcc.
The calling function between x86_32 and x86_64 are rather different. The most relevant difference is that the x86_64 tries to use general registers to pass the function-arguments and only if there is no register available or the arguments is bigger than 80 bytes, it will use the stack.
We start from the x86_32 ABI, I have slightly changed your example :
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#if defined(__i386__)
#define STACK_POINTER "ESP"
#define FRAME_POINTER "EBP"
#elif defined(__x86_64__)
#define STACK_POINTER "RSP"
#define FRAME_POINTER "RBP"
#else
#error Architecture not supported yet!!
#endif
void foo(int i,int j,int k)
{
int a =20;
uint64_t stack=0, frame_pointer=0;
// Retrieve stack
asm volatile(
#if defined (__i386__)
"mov %%esp, %0\n"
"mov %%ebp, %1\n"
#else
"mov %%rsp, %0\n"
"mov %%rbp, %1\n"
#endif
: "=m"(stack), "=m"(frame_pointer)
:
: "memory");
// retrieve paramters x86_64
#if defined (__x86_64__)
int i_reg=-1, j_reg=-1, k_reg=-1;
asm volatile ( "mov %%rdi, %0\n"
"mov %%rsi, %1\n"
"mov %%rdx, %2\n"
: "=m"(i_reg), "=m"(j_reg), "=m"(k_reg)
:
: "memory");
#endif
printf("%s=%p %s=%p\n", STACK_POINTER, (void*)stack, FRAME_POINTER, (void*)frame_pointer);
printf("%d, %d, %d\n", i, j, k);
printf("%p\n%p\n%p\n%p\n",&i,&j,&k,&a);
#if defined (__i386__)
// Calling convention c
// EBP --> Saved EBP
char * EBP=(char*)frame_pointer;
printf("Function return address : 0x%x \n", *(unsigned int*)(EBP +4));
printf("- i=%d &i=%p \n",*(int*)(EBP+8) , EBP+8 );
printf("- j=%d &j=%p \n",*(int*)(EBP+ 12), EBP+12);
printf("- k=%d &k=%p \n",*(int*)(EBP+ 16), EBP+16);
#else
printf("- i=%d &i=%p \n",i_reg, &i );
printf("- j=%d &j=%p \n",j_reg, &j );
printf("- k=%d &k=%p \n",k_reg ,&k );
#endif
}
int main()
{
foo(1,2,3);
return 0;
}
The ESP register is being used by foo to point to the top of the stack. The EBP register is acting as a "base pointer". All arguments have been pushed in reverse order into the stack. The arguments passed by main to foo and the local variables in foo can all be referenced as an offset from the base pointer. After calling foo the stack should look like : .
Assuming that the compiler is using the stack pointer, we can access the function arguments by summing an offset of 4 byte to the EBP register. Note the first arguments is located at offset 8 because the call instruction push in the stack the return address of the caller function.
printf("Function return address : 0x%x \n", *(unsigned int*)(EBP +4));
printf("- i=%d &i=%p \n",*(int*)(EBP+8) , EBP+8 );
printf("- j=%d &j=%p \n",*(int*)(EBP+ 12), EBP+12);
printf("- k=%d &k=%p \n",*(int*)(EBP+ 16), EBP+16);
This is more or less how arguments are passed to a function in x86_32.
In x86_64 there are more registers available, it makes sense to use them to pass the parameter of a function. The x86_64 ABI can be found here : http://www.uclibc.org/docs/psABI-x86_64.pdf. The calling convention starts at page 14.
First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function. Some of the most relevant are :
INTEGER This class consists of integral types that ﬁt into one of the
general purpose registers. For example (int, long, bool)
SSE The class consists of types that ﬁts into a SSE register. (float, double)
SSEUP The class consists of types that ﬁt into a SSE register and can
be passed and returned in the most signiﬁcant half of it. ( float_128, __m128,__m256)
NO_CLASS This class is used as initializer in the
algorithms. It will be used for padding and empty structures and unions.
MEMORY This class consists of types that will be passed and returned in memory
via the stack ( structure types)
Once the a parameter is assigned to a class, it is passed to the function according to
these rules :
MEMORY, pass the argument on the stack.
INTEGER, the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
SSE, the next available SSE register is used, the registers are taken in the order from %xmm0 to %xmm7.
SSEUP, the eight bytes is passed in the upper half of the last used SSE register.
If there are no registers available for any eightbyte of an argument, the whole
argument is passed on the stack. If registers have already been assigned for some
eightbytes of such an argument, the assignments get reverted. Once registers are assigned, the arguments passed in memory are pushed on the stack in reversed order.
Since you are passing int variables, the arguments will be inserted into the general purpose registers.
%rdi --> i
%rsi --> j
%rdx --> k
So you can retrieve them we the following code :
#if defined (__x86_64__)
int i_reg=-1, j_reg=-1, k_reg=-1;
asm volatile ( "mov %%rdi, %0\n"
"mov %%rsi, %1\n"
"mov %%rdx, %2\n"
: "=m"(i_reg), "=m"(j_reg), "=m"(k_reg)
:
: "memory");
#endif
I hope I have been clear.
In conclusion,
why addresses of elements in the stack are reversed in ubuntu64?
Because they are not stored into the stack. The addresses you have retrieved in that manner are the addresses of the local variables of the caller function.

There is absolutely no restriction on how arguments are passed to a function, nor where they go on the stack (or in a register, or in shared memory for that matter). It is up to the compiler to instrument passing the variables in such a manner that the caller and callee agree upon. Unless you force a specific calling convention (for linking code that was compiled with different compilers), or unless there is a hardware dictated ABI - there is no guarantee.

Swap with push / assignment / pop in GNU C inline assembly?

I was reading some answers and questions on here and kept coming up with this suggestion but I noticed no one ever actually explained "exactly" what you need to do to do it, On Windows using Intel and GCC compiler. Commented below is exactly what I am trying to do.
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
//assembly code begin
/*
push x into stack; < Need Help
x=y; < With This
pop stack into y; < Please
*/
//assembly code end
printf("x=%d,y=%d",x,y);
getchar();
return 0;
}

You can't just push/pop safely from inline asm, if it's going to be portable to systems with a red-zone. That includes every non-Windows x86-64 platform. (There's no way to tell gcc you want to clobber it). Well, you could add rsp, -128 first to skip past the red-zone before pushing/popping anything, then restore it later. But then you can't use an "m" constraints, because the compiler might use RSP-relative addressing with offsets that assume RSP hasn't been modified.
But really this is a ridiculous thing to be doing in inline asm.
Here's how you use inline-asm to swap two C variables:
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
asm("" // no actual instructions.
: "=r"(y), "=r"(x) // request both outputs in the compiler's choice of register
: "0"(x), "1"(y) // matching constraints: request each input in the same register as the other output
);
// apparently "=m" doesn't compile: you can't use a matching constraint on a memory operand
printf("x=%d,y=%d\n",x,y);
// getchar(); // Set up your terminal not to close after the program exits if you want similar behaviour: don't embed it into your programs
return 0;
}
gcc -O3 output (targeting the x86-64 System V ABI, not Windows) from the Godbolt compiler explorer:
.section .rodata
.LC0:
.string "x=%d,y=%d"
.section .text
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov edx, 1
mov esi, 2
#APP
# 8 "/tmp/gcc-explorer-compiler116814-16347-5i3lz1/example.cpp" 1
# I used "\n" instead of just "" so we could see exactly where our inline-asm code ended up.
# 0 "" 2
#NO_APP
call printf
xor eax, eax
add rsp, 8
ret
C variables are a high level concept; it doesn't cost anything to decide that the same registers now logically hold different named variables, instead of swapping the register contents without changing the varname->register mapping.
When hand-writing asm, use comments to keep track of the current logical meaning of different registers, or parts of a vector register.
The inline-asm didn't lead to any extra instructions outside the inline-asm block either, so it's perfectly efficient in this case. Still, the compiler can't see through it, and doesn't know that the values are still 1 and 2, so further constant-propagation would be defeated. https://gcc.gnu.org/wiki/DontUseInlineAsm

#include <stdio.h>
int main()
{
int x=1;
int y=2;
printf("x::%d,y::%d\n",x,y);
__asm__( "movl %1, %%eax;"
"movl %%eax, %0;"
:"=r"(y)
:"r"(x)
:"%eax"
);
printf("x::%d,y::%d\n",x,y);
return 0;
}
/* Load x to eax
Load eax to y */
If you want to exchange the values, it can also be done using this way. Please note that this instructs GCC to take care of the clobbered EAX register. For educational purposes, it is okay, but I find it more suitable to leave micro-optimizations to the compiler.

You can use extended inline assembly. It is a compiler feature whicg allows you to write assembly instructions within your C code. A good reference for inline gcc assembly is available here.
The following code copies the value of x into y using pop and push instructions.
( compiled and tested using gcc on x86_64 )
This is only safe if compiled with -mno-red-zone, or if you subtract 128 from RSP before pushing anything. It will happen to work without problems in some functions: testing with one set of surrounding code is not sufficient to verify the correctness of something you did with GNU C inline asm.
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
asm volatile (
"pushq %%rax\n" /* Push x into the stack */
"movq %%rbx, %%rax\n" /* Copy y into x */
"popq %%rbx\n" /* Pop x into y */
: "=b"(y), "=a"(x) /* OUTPUT values */
: "a"(x), "b"(y) /* INPUT values */
: /*No need for the clobber list, since the compiler knows
which registers have been modified */
);
printf("x=%d,y=%d",x,y);
getchar();
return 0;
}
Result x=2 y=1, as you expected.
The intel compiler works in a similar way, I think you have just to change the keyword asm to __asm__. You can find info about inline assembly for the INTEL compiler here.

at&t asm inline c++ problem

My Code
const int howmany = 5046;
char buffer[howmany];
asm("lea buffer,%esi"); //Get the address of buffer
asm("mov howmany,%ebx"); //Set the loop number
asm("buf_loop:"); //Lable for beginning of loop
asm("movb (%esi),%al"); //Copy buffer[x] to al
asm("inc %esi"); //Increment buffer address
asm("dec %ebx"); //Decrement loop count
asm("jnz buf_loop"); //jump to buf_loop if(ebx>0)
My Problem
I am using the gcc compiler. For some reason my buffer/howmany variables are undefined in the eyes of my asm. I'm not sure why. I just want to move the beginning address of my buffer array into the esi register, loop it 'howmany' times while copying each element to the al register.

Are you using the inline assembler in gcc? (If not, in what other C++ compiler, exactly?)
If gcc, see the details here, and in particular this example:
asm ("leal (%1,%1,4), %0"
: "=r" (five_times_x)
: "r" (x)
);
%0 and %1 are referring to the C-level variables, and they're listed specifically as the second (for outputs) and third (for inputs) parameters to asm. In your example you have only "inputs" so you'd have an empty second operand (traditionally one uses a comment after that colon, such as /* no output registers */, to indicate that more explicitly).

The part that declares an array like that
int howmany = 5046;
char buffer[howmany];
is not valid C++. In C++ it is impossible to declare an array that has "variable" or run-time size. In C++ array declarations the size is always a compile-time constant.
If your compiler allows this array declaration, it means that it implements it as an extension. In that case you have to do your own research to figure out how it implements such a run-time sized array internally. I would guess that internally buffer will be implemented as a pointer, not as a true array. If my guess is correct and it is really a pointer, then the proper way to load the address of the array into esi might be
mov buffer,%esi
and not a lea, as in your code. lea will only work with "normal" compile-time sized arrays, but not with run-time sized arrays.
Another question is whether you really need a run-time sized array in your code. Could it be that you just made it so by mistake? If you simply change the howmany declaration to
const int howmany = 5046;
the array will turn into an "normal" C++ array and your code might start working as is (i.e. with lea).

All of those asm instructions need to be in the same asm statement if you want to be sure they're contiguous (without compiler-generated code between them), and you need to declare input / output / clobber operands or you will step on the compiler's registers.
You can't use lea or mov to/from a C variable name (except for global / static symbols which are actually defined in the compiler's asm output, but even then you usually shouldn't).
Instead of using mov instructions to set up inputs, ask the compiler to do it for you using input operand constraints. If the first or last instruction of a GNU C inline asm statement, usually that means you're doing it wrong and writing inefficient code.
And BTW, GNU C++ allows C99-style variable-length arrays, so howmany is allowed to be non-const and even set in a way that doesn't optimize away to a constant. Any compiler that can compile GNU-style inline asm will also support variable-length arrays.
How to write your loop properly
If this looks over-complicated, then https://gcc.gnu.org/wiki/DontUseInlineAsm. Write a stand-alone function in asm so you can just learn asm instead of also having to learn about gcc and its complex but powerful inline-asm interface. You basically have to know asm and understand compilers to use it correctly (with the right constraints to prevent breakage when optimization is enabled).
Note the use of named operands like %[ptr] instead of %2 or %%ebx. Letting the compiler choose which registers to use is normally a good thing, but for x86 there are letters other than "r" you can use, like "=a" for rax/eax/ax/al specifically. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, and also other links in the inline-assembly tag wiki.
I also used buf_loop%=: to append a unique number to the label, so if the optimizer clones the function or inlines it multiple places, the file will still assemble.
Source + compiler asm output on the Godbolt compiler explorer.
void ext(char *);
int foo(void)
{
int howmany = 5046; // could be a function arg
char buffer[howmany];
//ext(buffer);
const char *bufptr = buffer; // copy the pointer to a C var we can use as a read-write operand
unsigned char result;
asm("buf_loop%=: \n\t" // do {
" movb (%[ptr]), %%al \n\t" // Copy buffer[x] to al
" inc %[ptr] \n\t"
" dec %[count] \n\t"
" jnz buf_loop \n\t" // } while(ebx>0)
: [res]"=a"(result) // al = write-only output
, [count] "+r" (howmany) // input/output operand, any register
, [ptr] "+r" (bufptr)
: // no input-only operands
: "memory" // we read memory that isn't an input operand, only pointed to by inputs
);
return result;
}
I used %%al as an example of how to write register names explicitly: Extended Asm (with operands) needs a double % to get a literal % in the asm output. You could also use %[res] or %0 and let the compiler substitute %al in its asm output. (And then you'd have no reason to use a specific-register constraint unless you wanted to take advantage of cbw or lodsb or something like that.) result is unsigned char, so the compiler will pick a byte register for it. If you want the low byte of a wider operand, you could use %b[count] for example.
This uses a "memory" clobber, which is inefficient. You don't need the compiler to spill everything to memory, only to make sure that the contents of buffer[] in memory matches the C abstract machine state. (This is not guaranteed by passing a pointer to it in a register).
gcc7.2 -O3 output:
pushq %rbp
movl $5046, %edx
movq %rsp, %rbp
subq $5056, %rsp
movq %rsp, %rcx # compiler-emitted to satisfy our "+r" constraint for bufptr
# start of the inline-asm block
buf_loop18:
movb (%rcx), %al
inc %rcx
dec %edx
jnz buf_loop
# end of the inline-asm block
movzbl %al, %eax
leave
ret
Without a memory clobber or input constraint, leave appears before the inline asm block, releasing that stack memory before the inline asm uses the now-stale pointer. A signal-handler running at the wrong time would clobber it.
A more efficient way is to use a dummy memory operand which tells the compiler that the entire array is a read-only memory input to the asm statement. See get string length in inline GNU Assembler for more about this flexible-array-member trick for telling the compiler you read all of an array without specifying the length explicitly.
In C you can define a new type inside a cast, but you can't in C++, hence the using instead of a really complicated input operand.
int bar(unsigned howmany)
{
//int howmany = 5046;
char buffer[howmany];
//ext(buffer);
buffer[0] = 1;
buffer[100] = 100; // test whether we got the input constraints right
//using input_t = const struct {char a[howmany];}; // requires a constant size
using flexarray_t = const struct {char a; char x[];};
const char *dummy;
unsigned char result;
asm("buf_loop%=: \n\t" // do {
" movb (%[ptr]), %%al \n\t" // Copy buffer[x] to al
" inc %[ptr] \n\t"
" dec %[count] \n\t"
" jnz buf_loop \n\t" // } while(ebx>0)
: [res]"=a"(result) // al = write-only output
, [count] "+r" (howmany) // input/output operand, any register
, "=r" (dummy) // output operand in the same register as buffer input, so we can modify the register
: [ptr] "2" (buffer) // matching constraint for the dummy output
, "m" (*(flexarray_t *) buffer) // whole buffer as an input operand
//, "m" (*buffer) // just the first element: doesn't stop the buffer[100]=100 store from sinking past the inline asm, even if you used asm volatile
: // no clobbers
);
buffer[100] = 101;
return result;
}
I also used a matching constraint so buffer could be an input directly, and the output operand in the same register means we can modify that register. We got the same effect in foo() by using const char *bufptr = buffer; and then using a read-write constraint to tell the compiler that the new value of that C variable is what we leave in the register. Either way we leave a value in a dead C variable that goes out of scope without being read, but the matching constraint way can be useful for macros where you don't want to modify the value of your input (and don't need the type of your input: int dummy would work fine, too.)
The buffer[100] = 100; and buffer[100] = 101; assignments are there to show that they both appear in the asm, instead of being merged across the inline-asm (which does happen if you leave out the "m" input operand). IDK why the buffer[100] = 101; isn't optimized away; it's dead so it should be. Also note that asm volatile doesn't block this reordering, so it's not an alternative to a "memory" clobber or using the right constraints.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight