I recently dabbled into low level programming, and want to make a function somesyscall that accepts (CType rax, CType rbx, CType rcx, CType rdx). struct CType looks like:
/*
TYPES:
0 int
1 string
2 bool
*/
typedef struct {
void* val;
int typev;
} CType;
the function is a bit messy, but in theory should work:
#include <errno.h>
#include <stdbool.h>
#include "ctypes.h"
//define functions to set registers
#define seteax(val) asm("mov %0, %%rax" :: "g" (val) : "%rax")
#define setebx(val) asm("mov %0, %%rbx" :: "g" (val) : "%rbx")
#define setecx(val) asm("mov %0, %%rcx" :: "g" (val) : "%rcx")
#define setedx(val) asm("mov %0, %%rdx" :: "g" (val) : "%rdx")
///////////////////////////////////
#define setregister(value, register) \
switch (value.typev) { \
case 0: { \
register(*((double*)value.val)); \
break; \
} \
case 1: { \
register(*((char**)value.val)); \
break; \
} \
case 2: { \
register(*((bool*)value.val)); \
break; \
} \
}
static inline long int somesyscall(CType a0, CType a1, CType a2, CType a3) {
//set the registers
setregister(a0, seteax);
setregister(a1, setebx);
setregister(a2, setecx);
setregister(a3, setedx);
///////////////////
asm("int $0x80"); //interrupt
//fetch back the rax
long int raxret;
asm("mov %%rax, %0" : "=r" (raxret));
return raxret;
}
when I run with:
#include "syscall_unix.h"
int main() {
CType rax;
rax.val = 39;
rax.typev = 0;
CType rbx;
rbx.val = 0;
rbx.typev = 0;
CType rcx;
rcx.val = 0;
rcx.typev = 0;
CType rdx;
rdx.val = 0;
rdx.typev = 0;
printf("%ld", somesyscall(rax, rbx, rcx, rdx));
}
and compile (and run binary) with
clang test.c
./a.out
I get a segfault. However, everything seems to look correct. Am I doing anything wrong here?
After macro expansion you will have something like
long int raxret;
asm("mov %0, %%rax" :: "g" (a0) : "%rax");
asm("mov %0, %%rbx" :: "g" (a1) : "%rbx");
asm("mov %0, %%rcx" :: "g" (a2) : "%rcx");
asm("mov %0, %%rdx" :: "g" (a3) : "%rdx");
asm("int $0x80");
asm("mov %%rax, %0" : "=r" (raxret));
This doesn't work because you haven't told the compiler that it's not allowed to reuse rax, rbx, rcx, and rdx for something else during the sequence of asm statements. For instance, the register allocator might decide to copy a2 from the stack to rax and then use rax as the input operand for the mov %0, %%rcx instruction -- clobbering the value you put in rax.
(asm statements with no outputs are implicitly volatile so the first 5 can't reorder relative to each other, but the final one can move anywhere. For example, be moved after later code to where the compiler finds it convenient to generate raxret in a register of its choice. RAX might no longer have the system call return value at that point - you need to tell the compiler that the output comes from the asm statement that actually produces it, without assuming any registers survive between asm statements.)
There are two different ways to tell the compiler not to do that:
Put only the int instruction in an asm, and express all of the requirements for what goes in what register with constraint letters:
asm volatile ("int $0x80"
: "=a" (raxret) // outputs
: "a" (a0), "b" (a1), "c" (a2), "d" (a3) // pure inputs
: "memory", "r8", "r9", "r10", "r11" // clobbers
// 32-bit int 0x80 system calls in 64-bit code zero R8..R11
// for native "syscall", clobber "rcx", "r11".
);
This is possible for this simple example but not always possible in general, because there aren't constraint letters for every single register, especially not on CPUs other than x86.
// use the native 64-bit syscall ABI
// remove the r8..r11 clobbers for 32-bit mode
Put only the int instruction in an asm, and express the requirements for what goes in what register with explicit register variables:
register long rax asm("rax") = a0;
register long rbx asm("rbx") = a1;
register long rcx asm("rcx") = a2;
register long rdx asm("rdx") = r3;
// Note that int $0x80 only looks at the low 32 bits of input regs
// so `uint32_t` would be more appropriate than long
// but really you should just use "syscall" in 64-bit code.
asm volatile ("int $0x80"
: "+r" (rax) // read-write: in=call num, out=retval
: "r" (rbx), "r" (rcx), "r" (rdx) // read-only inputs
: "memory", "r8", "r9", "r10", "r11"
);
return rax;
This will work regardless of which registers you need to use. It's also probably more compatible with the macros you're trying to use to erase types.
Incidentally, if this is 64-bit x86/Linux then you should be using syscall rather than int $0x80, and the arguments belong in the ABI-standard incoming-argument registers (rdi, rsi, rdx, rcx, r8, r9 in that order), not in rbx, rcx, rdx etc. The system call number still goes in rax, though. (Use call numbers from #include <asm/unistd.h> or <sys/syscall.h>, which will be appropriate for the native ABI of the mode you're compiling for, another reason not to use int $0x80 in 64-bit mode.)
Also, the asm statement for the system-call instruction should have a "memory" clobber and be declared volatile; almost all system calls access memory somehow.
(As a micro-optimization, I suppose you could have a list of system calls that don't read memory, write memory, or modify the virtual address space, and avoid the memory clobber for them. It would be a pretty short list and I'm not sure it would be worth the trouble. Or use the syntax shown in How can I indicate that the memory *pointed* to by an inline ASM argument may be used? to tell GCC which memory might be read or written, instead of a "memory" clobber, if you write wrappers for specific syscalls.
Some of the no-pointer cases include getpid where it would be a lot faster to call into the VDSO to avoid a round trip to kernel mode and back, like glibc does for the appropriate syscalls. That also applies to clock_gettime which does take pointers.)
Incidentally, beware of the actual kernel interfaces not matching up with the interfaces presented by the C library's wrappers. This is generally documented in the NOTES section of the man page, e.g. for brk(2) and getpriority(2)
Related
I have a short snippet of code, with some inline assembly that prints argv[0] properly in O0, but does not print anything in O2 (when using Clang. GCC, on the other hand, prints the string stored in envp[0] when printing argv[0]). This problem is also restricted to only argv (the other two function parameters can be used as expected with or without optimizations enabled). I tested this with both GCC and Clang, and both compilers have this issue.
Here is the code:
void exit(unsigned long long status) {
asm volatile("movq $60, %%rax;" //system call 60 is exit
"movq %0, %%rdi;" //return code 0
"syscall"
: //no outputs
:"r"(status)
:"rax", "rdi");
}
int open(const char *pathname, unsigned long long flags) {
asm volatile("movq $2, %%rax;" //system call 2 is open
"movq %0, %%rdi;"
"movq %1, %%rsi;"
"syscall"
: //no outputs
:"r"(pathname), "r"(flags)
:"rax", "rdi", "rsi");
return 1;
}
int write(unsigned long long fd, const void *buf, size_t count) {
asm volatile("movq $1, %%rax;" //system call 1 is write
"movq %0, %%rdi;"
"movq %1, %%rsi;"
"movq %2, %%rdx;"
"syscall"
: //no outputs
:"r"(fd), "r"(buf), "r"(count)
:"rax", "rdi", "rsi", "rdx");
return 1;
}
static void entry(unsigned long long argc, char** argv, char** envp);
/*https://www.systutorials.com/x86-64-calling-convention-by-gcc/: "The calling convention of the System V AMD64 ABI is followed on GNU/Linux. The registers RDI, RSI, RDX, RCX, R8, and R9 are used for integer and memory address arguments
and XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments.
For system calls, R10 is used instead of RCX. Additional arguments are passed on the stack and the return value is stored in RAX."*/
//__attribute__((naked)) defines a pure-assembly function
__attribute__((naked)) void _start() {
asm volatile("xor %%rbp,%%rbp;" //http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html: "%ebp,%ebp sets %ebp to zero. This is suggested by the ABI (Application Binary Interface specification), to mark the outermost frame."
"pop %%rdi;" //rdi: arg1: argc -- can be popped off the stack because it is copied onto register
"mov %%rsp, %%rsi;" //rsi: arg2: argv
"mov %%rdi, %%rdx;"
"shl $3, %%rdx;" //each argv pointer takes up 8 bytes (so multiply argc by 8)
"add $8, %%rdx;" //add size of null word at end of argv-pointer array (8 bytes)
"add %%rsp, %%rdx;" //rdx: arg3: envp
"andq $-16, %%rsp;" //align stack to 16-bits (which is required on x86-64)
"jmp %P0" //https://stackoverflow.com/questions/3467180/direct-c-function-call-using-gccs-inline-assembly: "After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case."
:
:"i"(entry)
:"rdi", "rsp", "rsi", "rdx", "rbp", "memory");
}
//Function cannot be optimized-away, since it is passed-in as an argument to asm-block above
//Compiler Options: -fno-asynchronous-unwind-tables;-O2;-Wall;-nostdlibinc;-nobuiltininc;-fno-builtin;-nostdlib; -nodefaultlibs;--no-standard-libraries;-nostartfiles;-nostdinc++
//Linker Options: -nostdlib; -nodefaultlibs
static void entry(unsigned long long argc, char** argv, char** envp) {
int ttyfd = open("/dev/tty", O_WRONLY);
write(ttyfd, argv[0], 9);
write(ttyfd, "\n", 1);
exit(0);
}
Edit: Added syscall definitions.
Edit: Adding rcx and r11 to the clobber list for the syscalls fixed the issue for clang, but gcc to have the error.
Edit: GCC actually was not having an error, but some kind of strange error in my build system (CodeLite) made it so that the program ran some kind of partially-built program, even though GCC reported errors about it not recognizing two of the compiler flags passed-in.
For GCC, use these flags instead: -fomit-frame-pointer;-fno-asynchronous-unwind-tables;-O2;-Wall;-nostdinc;-fno-builtin;-nostdlib; -nodefaultlibs;--no-standard-libraries;-nostartfiles;-nostdinc++. You can also use these flags for Clang, due to Clang's support for the above GCC options.
You can't use extended asm in a naked function, only basic asm, according to the gcc manual. You don't need to inform the compiler of clobbered registers (since it won't do anything about them anyway; in a naked function you are responsible for all register management). And passing the address of entry in an extended operand is unnecessary; just do jmp entry.
(In my tests your code doesn't compile at all, so I assume you weren't showing us your exact code - next time please do, so as to avoid wasting people's time.)
Linux x86-64 syscall system calls are allowed to clobber the rcx and r11 registers, so you need to add those to the clobber lists of your system calls.
You align the stack to a 16-byte boundary before jumping to entry. However, the 16-byte alignment rule is based on the assumption that you will be calling the function with call, which would push an additional 8 bytes onto the stack. As such, the called function actually expects the stack to initially be, not a multiple of 16, but 8 more or less than a multiple of 16. So you are actually aligning the stack incorrectly, and this can be a cause of all sorts of mysterious trouble.
So either replace your jmp with call, or else subtract a further 8 bytes from rsp (or just push some 64-bit register of your choice).
Style note: unsigned long is already 64 bits on Linux x86-64, so it would be more idiomatic to use that in place of unsigned long long everywhere.
General hint: learn about register constraints in extended asm. You can have the compiler load your desired registers for you, instead of writing instructions in your asm to do it yourself. So your exit function could instead look like:
void exit(unsigned long status) {
asm volatile("syscall"
: //no outputs
:"a"(60), "D" (status)
:"rcx", "r11");
}
This in particular saves you a few instructions, since status is already in the %rdi register on function entry. With your original code, the compiler has to move it somewhere else so that you can then load it into %rdi yourself.
Your open function always returns 1, which will typically not be the fd that was actually opened. So if your program is run with standard output redirected, your program will write to the redirected stdout, instead of to the tty as it seems to want to do. Indeed, this makes the open syscall completely pointless, because you never use the file you opened.
You should arrange for open to return the value that was actually returned by the system call, which will be left in the %rax register when syscall returns. You can use an output operand to have this stored in a temporary variable (which the compiler will likely optimize out), and return that. You'll need to use a digit constraint since it is going in the same register as an input operand. I leave this as an exercise for you. It would likewise be nice if your write function actually returned the number of bytes written.
I am trying to write a preemptive scheduler for AVR and therefore I need some assembler code ... but I have no experience with assembler. However, I wrote all the assembler code I think I need in some C macros. At compilation I get some errors related to assembler (constant value required and garbage at and of line), which makes me think that something is not correct in my macros ...
The following macro, load_SP(some_unsigned_char, some_unsigned_char), sets the stack pointer to a known memory location ... I am keeping the location in the global struct aux_SP;
Something similar is with load_PC(...) which is loading on the stack, a program counter: "func_pointer" which is actually, as the name suggest, a pointer to a function. I assume here that the program counter as well as the function pointer are represented on 2 bytes (because the flash is small enough)
For this I am using processor register R16. In order to leave this register untouched, I am saving its value first with the macro "save_R16(tempR)" and the restoring its value with the macro "load_R16(tempR)" where "tempR" as can be seen is a global C variable.
This is simply written in a header file. This along with another two macros (not written here because of their size) "pushRegs()" and "popRegs()" which are basically pushing and then popping all processors registers is ALL my assembler code ...
What should I do to correct my macros?
// used to store the current StackPointer when creating a new task until it is restored at the
// end of createNewTask function.
struct auxSP
{
unsigned char auxSPH;
unsigned char auxSPL;
};
struct auxSP cSP = {0,0};
// used to restore processor register when using load_SP or load_PC macros to perform
// a Stack Pointer or Program Counter load.
unsigned char tempReg = 0;
////////////////////////////////////////////////////////////////////////////////////////////
//////////////////////////// assembler macros begin ////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////
// save processor register R16
#define save_R16(tempR) \
asm volatile( \
"STS tempR, R16 \n\t" \
);
// load processor register R16
#define load_R16(tempR) \
asm volatile( \
"LDS R16, tempR \n\t" \
);
// load the Stack Pointer. Warning: Alters the processor registers
#define load_SP(new_SP_H, new_SP_L) \
asm volatile( \
"LDI R16, new_SP_H \n\t" \
"OUT SPH, R16 \n\t" \
"LDI R16, new_SP_L \n\t" \
"OUT SPL, R16 \n\t" \
);
// load the Program Counter on stack. Warning: Alters the processor registers
#define load_PC(func_pointer) \
asm volatile( \
"LDS r16, LOW(func_pointer) \n\t" \
"PUSH r16 \n\t" \
"LDS r16, HIGH(func_pointer) \n\t" \
"PUSH r16 \n\t" \
);
Your main source of reference for this should be http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
Avoid using "unsigned char" - use "uint8_t", as it is shorter and more explicit. Avoid macros - use static inline functions instead whenever possible. Don't invent your own struct for auxSP, especially not using a different endian ordering than the target normally uses - just use uint16_t. Don't write things in assembly when you can write them in C. And don't split up asm statements that need to be combined together (such as preserving R16 in one statement, then using it in a second statement).
Where does that leave us?
It's a long time since I have done much AVR programming, but this might get you started:
static inline uint16_t read_SP(void) {
uint16_t sp;
asm volatile(
"in %A[sp], __SP_L__ \n\t"
"in %B[sp], __SP_H__ \n\t"
: [sp] "=r" (sp) :: );
return sp;
}
static inline void write_SP(uint16_t sp) {
asm volatile(
"out __SP_L__, %A[sp] \n\t"
"out __SP_H__, %B[sp] \n\t"
:: [sp] "r" (sp) : );
}
typedef void (*FVoid)(void);
static inline void load_PC(FVoid f) __attribute__((noreturn));
static inline void load_PC(FVoid f) {
asm volatile(
"ijmp"
:: "z" (f) );
__builtin_unreachable();
}
You will probably also want to make sure you disable interrupts before using any of these.
Here is an example of C code I did on an AVR platform. It is not macros but functions because it was more adapted for me.
void call_xxx(uint32_t address, uint16_t data)
{
asm volatile("push r15"); //has to be saved
asm volatile("ldi r18, 0x01"); //r15 will be copied to SPMCSR
asm volatile("mov r15, r18"); //copy r18 to r15 (cannot be done directly)
asm volatile("movw r0, r20"); //r1:r0 <= r21:r20 //should conatain "data" parameter
asm volatile("movw r30, r22"); //31:r30<=r23:r22 // should contain "address" parameter ...
asm volatile("sts 0x5b, r24"); //RAMPZ
asm volatile("rcall .+0"); //push PC on top of stack and never pop it
asm volatile("jmp 0x3ecb7"); //secret function
asm volatile("eor r1, r1"); //null r1
asm volatile("pop r15"); //restore value
return;
}
Also try without your \n\t this may be the "garbage at and of line"
The problem of constant value required may come from here :
#define save_R16(tempR) \
asm volatile( \
"STS tempR, R16 \n\t" \
);
For this I am less sure but STS (and other) requires an address that may need to be fixed at compile time. So depending on how you use the macro it may not compile. If tempR is not fixed, you may use functions instead of macro.
According to GCC's Extended ASM and Assembler Template, to keep instructions consecutive, they must be in the same ASM block. I'm having trouble understanding what provides the scheduling or timings of reads and writes to the operands in a block with multiple statements.
As an example, EBX or RBX needs to be preserved when using CPUID because, according to the ABI, the caller owns it. There are some open questions with respect to the use of EBX and RBX, so we want to preserve it unconditionally (its a requirement). So three instructions need to be encoded into a single ASM block to ensure the consecutive-ness of the instructions (re: the assembler template discussed in the first paragraph):
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=b"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
If the expression representing the operands is interpreted at the wrong point in time, then __EBX will be the saved EBX (and not the CPUID's EBX), which will likely be a pointer to the Global Offset Table (GOT) if PIC is enabled.
Where, exactly, does the expression specify that the store of CPUID's %EBX into __EBX should happen (1) after the PUSH %EBX; (2) after the CPUID; but (3) before the POP %EBX?
In your question you present some code that does a push and pop of ebx. The idea of saving ebx in the event that you compile with gcc using -fPIC (position independent code) is correct. It is up to our function not to clobber ebx upon return in that situation. Unfortunately the way you have defined the constraints you explicitly use ebx. Generally the compiler will warn you (error: inconsistent operand constraints in an 'asm') if you are using PIC code and you specify =b as an output constraint. Why it doesn't produce a warning for you is unusual.
To get around this problem you can let the assembler template choose a register for you. Instead of pushing and popping we simply exchange %ebx with an unused register chosen by the compiler and restore it by exchanging it back after. Since we don't wish to have the compiler clobber our input registers during the exchange we specify early clobber modifier, thus ending up with a constraint of =&r (instead of =b in the OPs code). More on modifiers can be found here. Your code (for 32 bit) would look something like:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
If you intend to compile for X86_64 (64 bit) you'll need to save the entire contents of %rbx. The code above will not quite work. You'd have to use something like:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
You could code this up using conditional compilation to deal with both X86_64 and i386:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
#if defined(__i386__)
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#elif defined(__x86_64__)
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#else
#error "Unknown architecture."
#endif
GCC has a __cpuid macro defined in cpuid.h. It defined the macro so that it only saves the ebx and rbx register when required. You can find the GCC 4.8.1 macro definition here to get an idea of how they handle cpuid in cpuid.h.
The astute reader may ask the question - what stops the compiler from choosing ebx or rbx as the scratch register to use for the exchange. The compiler knows about ebx and rbx in the context of PIC, and will not allow it to be used as a scratch register. This is based on my personal observations over the years and reviewing the assembler (.s) files generated from C code. I can't say for certain how more ancient versions of gcc handled it so it could be a problem.
I think you understand, but to be clear, the "consecutive" rule means that this:
asm ("a");
asm ("b");
asm ("c");
... might get other instructions interposed, so if that's not desirable then it must be rewritten like this:
asm ("a\n"
"b\n"
"c");
... and now it will be inserted as a whole.
As for the cpuid snippet, we have two problems:
The cpuid instruction will overwrite ebx, and hence clobber the data that PIC code must keep there.
We want to extract the value that cpuid places in ebx while never returning to compiled code with the "wrong" ebx value.
One possible solution would be this:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"mov %ebx, %ecx"
"pop %ebx"
: "=c"(__EBX)
: "a"(__FUNC), "c"(__SUBFUNC)
: "eax", "edx"
);
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
There's no need to mark ebx as clobbered as you're putting it back how you found it.
(I don't do much Intel programming, so I may have some of the assembler-specific details off there, but this is how asm works.)
I've seen the post about the same error but i'm still get error :
too many memory references for `mov'
junk `hCPUIDmov buffer' after expression
... here's the code (mingw compiler / C::B) :
#include iostream
using namespace std;
union aregister
{
int theint;
unsigned bits[32];
};
union tonibbles
{
int integer;
short parts[2];
};
void GetSerial()
{
int part1,part2,part3;
aregister issupported;
int buffer;
__asm(
"mov %eax, 01h"
"CPUID"
"mov buffer, edx"
);//do the cpuid, move the edx (feature set register) to "buffer"
issupported.theint = buffer;
if(issupported.bits[18])//it is supported
{
__asm(
"mov part1, eax"
"mov %eax, 03h"
"CPUID"
);//move the first part into "part1" and call cpuid with the next subfunction to get
//the next 64 bits
__asm(
"mov part2, edx"
"mov part3, ecx"
);//now we have all the 96 bits of the serial number
tonibbles serial[3];//to split it up into two nibbles
serial[0].integer = part1;//first part
serial[1].integer = part2;//second
serial[2].integer = part3;//third
}
}
Your assembly code is not correctly formatted for gcc.
Firstly, gcc uses AT&T syntax (EDIT: by default, thanks nrz), so it needs a % added for each register reference and a $ for immediate operands. The destination operand is always on the right side.
Secondly, you'll need to pass a line separator (for example \n\t) for a new line. Since gcc passes your string straight to the assembler, it requires a particular syntax.
You should usually try hard to minimize your assembler since it may cause problems for the optimizer. Simplest way to minimize the assembler required would probably be to break the cpuid instruction out into a function, and reuse that.
void cpuid(int32_t *peax, int32_t *pebx, int32_t *pecx, int32_t *pedx)
{
__asm(
"CPUID"
/* All outputs (eax, ebx, ecx, edx) */
: "=a"(*peax), "=b"(*pebx), "=c"(*pecx), "=d"(*pedx)
/* All inputs (eax) */
: "a"(*peax)
);
}
Then just simply call using;
int a=1, b, c, d;
cpuid(&a, &b, &c, &d);
Another possibly more elegant way is to do it using macros.
Because of how C works,
__asm(
"mov %eax, 01h"
"CPUID"
"mov buffer, edx"
);
is equivalent to
__asm("mov %eax, 01h" "CPUID" "mov buffer, edx");
which is equivalent to
__asm("mov %eax, 01hCPUIDmov buffer, edx");
which isn't what you want.
AT&T syntax (GAS's default) puts the destination register at the end.
AT&T syntax requires immediates to be prefixed with $.
You can't reference local variables like that; you need to pass them in as operands.
Wikipedia's article gives a working example that returns eax.
The following snippet might cover your use-cases (I'm not intricately familiar with GCC inline assembly or CPUID):
int eax, ebx, ecx, edx;
eax = 1;
__asm( "cpuid"
: "+a" (eax), "+b" (ebx), "+c" (ecx), "+d" (edx));
buffer = edx
in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump output:
movq -0x80(%rbp),%xmm2
movq -0x88(%rbp),%xmm3
movdqa %xmm2,%xmm1
movdqa %xmm2,%xmm0
pxor %xmm1,%xmm0
movdqa %xmm0,%xmm2
movq %xmm2,-0x78(%rbp)
You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.
C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.
Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.
The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
x;
asm (
"movdqa %1, %%xmm0;" /* xmm0 <- a */
"movdqa %2, %%xmm1;" /* xmm1 <- b */
"pxor %%xmm1, %%xmm0;" /* xmm0 <- xmm0 xor xmm1 */
"movdqa %%xmm0, %0;" /* x <- xmm0 */
:"=x"(x) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
:"%xmm0","%xmm1" /* clobbered registers */
);
/* printf the arguments and result as 2 64-bit hex values */
print128(a);
print128(b);
print128(x);
}
compile with (gcc, ubuntu 32 bit)
gcc -msse2 -o app app.c
output:
10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff
In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.
print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.
The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *px = (int64_t*) &value;
printf("%.16llx %.16llx\n", px[1], px[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);
asm (
"pxor %2, %0;" /* a <- b xor a */
:"=x"(a) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
);
print128(a);
}
compile as before:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main()
{
__m128i x = _mm_xor_si128(
_mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
_mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));
print128(x);
}
compile:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Umm, why not use the __builtin_ia32_pxor intrinsic?
Under late model gcc (mine is 4.5.5) the option -O2 or above implies -fstrict-aliasing which causes the code given above to complain:
supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here
This can be remedied by supplying additional type attributes as follows:
typedef int64_t __attribute__((__may_alias__)) alias_int64_t;
void print128(__m128i value) {
alias_int64_t *v64 = (int64_t*) &value;
printf("%.16lx %.16lx\n", v64[1], v64[0]);
}
I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.
BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.
And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.