Compile GCC Inline Assembly into Microsoft Visual C++ 2008 - c

I'm having trouble compiling this GCC inline assembly to Microsoft Visual C++ 2008 assembly
GCC inline assembly:
"smull %0, %1, %2, %3 \n\t"
"mov %0, %0, LSR #16 \n\t"
"add %1, %0, %1, LSL #16 \n\t"
: "=&r"(lo), "=&r"(hi)
: "r"(rb), "r"(ra));
The compiler says:
error C2143: syntax error : missing ')' before ':'
The complete function is:
static __inline Word32 mull(Word32 a, Word16 b)
register Word32 ra = a;
register Word32 rb = b;
Word32 lo, hi;
"smull %0, %1, %2, %3 \n\t"
"mov %0, %0, LSR #16 \n\t"
"add %1, %0, %1, LSL #16 \n\t"
: "=&r"(lo), "=&r"(hi)
: "r"(rb), "r"(ra));
return hi;

Visual Studio does not support ARM inline assembly. See: Inline assembly is not supported on the ARM. You will need to either reverse-engineer the assembly code to C, or use a separate assembler and link this as a separate function.
It looks like this function just does a 32 x 32 -> 64 bit signed multiply and then shifts the 64 bit result right by 16 bits and truncates it to 32 bits:
static __inline Word32 mull(Word32 a, Word16 b)
return (Word32)(((Word64)a * (Word64)b) >> 16);


ARM inline assembly, registers are read in an incorrect order

I am trying to read ARM registers via C using inline assembly but it doesn't follow the order of execution as it should.
volatile uint32_t var1 = 0;
volatile uint32_t var2 = 0;
volatile uint32_t var3 = 0;
volatile uint32_t var4 = 0;
__asm volatile(
"mov r0, #5\n\t"
"mov r1, #6\n\t"
"mov r2, #7\n\t"
"mov r3, #8\n\t"
"mov %0, r0\n\t"
"mov %0, r1\n\t"
"mov %0, r2\n\t"
"mov %0, r3\n\t"
: "=r" (var1),
"=r" (var2),
"=r" (var3),
"=r" (var4));
What happens is that the output is;
var1 = 8
var2 = 5
var3 = 6
var4 = 7
What I expect is;
var1 = 5
var2 = 6
var3 = 7
var4 = 8
It seems r0 is read last and it starts with r1. Why is that happening?
Note: Ignore the purpose of the code or how variables are defined, it's a copy/paste from a bigger application. It's just this specific register behavior in question.
GCC-style assembly inserts are designed to insert a single instruction per asm("..."), and they require you to give accurate information about all of the registers involved. In this case, you have not notified the compiler that registers r0, r1, r2, and r3 are used internally, so it probably thinks it's OK to reuse some of those for var1 through var4. You have also reused %0 as the destination of all four of the final mov instructions, so all four of them are actually writing to var1.
Also, it may or may not be your immediate problem, but volatile doesn't do what you think it does and probably isn't accomplishing anything useful here.
How to fix it? Well, first off, there needs to be a really strong reason why you can't just write
uint32_t var1 = 5;
uint32_t var2 = 6;
uint32_t var3 = 7;
uint32_t var4 = 8;
Assuming there is such a reason, then you should instead try writing one instruction per asm, and not using any scratch registers at all...
asm ("mov %0, #5" : "=r" (var1));
asm ("mov %0, #6" : "=r" (var2));
asm ("mov %0, #7" : "=r" (var3));
asm ("mov %0, #8" : "=r" (var4));
If you really absolutely have to do a whole bunch of work in a single asm then the first thing you should consider is putting it in a separate .S file and making it conform to the ABI so that you can call it like a normal function:
.globl do_the_thing
.type do_the_thing, #function
; all your actual code here
bx r14
.size do_the_thing, .-do_the_thing
and then in your C
rv = do_the_thing(arg1, arg2, ...);
If there's no way to make that work, then, and only then, should you sit down and read the entire "Extended Asm" chapter of the GCC manual and work out how to wedge the complete register-usage behavior of your insert into the constraints. If you need help with that, post a new question in which you show the real assembly language construct you need to insert, rather than a vague example, because every little detail matters.
The arguments passed into the inline assembly need to be incremented;
"mov %0, r0\n\t"
"mov %1, r1\n\t"
"mov %2, r2\n\t"
"mov %3, r3\n\t"

64bit dividend on 32bit architecture, works in assembly but not in C

I have a 64bit dividend and a 32bit divisor.
GCC do not seem to be able to create this kind of assembly. It complains about undefined reference to '__udivdi3', I know this is because I use the -nostdlib flag. I can however not use any stdlibs.
The 64bit variables are of type unsigned long long.
Are there any more elegant way to do this other than this inline assembly?
My goals is: my64bit / 32bitDivisor.
volatile uint32_t high = my64bit >> 32;
volatile uint32_t low = my64bit;
volatile uint32_t out;
__asm__ __volatile__ (
"movl %0, %%edx\n\t"
"movl %1, %%eax\n\t"
"div %2\n\t"
"movl %%eax, (%3)\n\t"
:: "r" (high), "r" (low), "r" (32bitDivisor) "r" (&out)
: "%eax", "%edx"

ARM assembly: can’t find a register in class ‘GENERAL_REGS’ while reloading ‘asm’

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:
typedef struct UN_256fe{
uint32_t uint32[8];
typedef struct UN_288bite{
uint32_t uint32[9];
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
asm (
"umull r3, r4, %9, %10;\n\t"
"mov %0, r3; \n\t"/*res->uint32[0] = r3*/
"umull r3, r5, %9, %11;\n\t"
"adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/
"mov %1, r6; \n\t"
"umull r3, r4, %9, %12;\n\t"
"adcs r6, r5, r3; \n\t"
"mov %2, r6; \n\t"/*res->uint32[2] = r6*/
"umull r3, r5, %9, %13;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %3, r6; \n\t"/*res->uint32[3] = r6*/
"umull r3, r4, %9, %14;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %4, r6; \n\t"/*res->uint32[4] = r6*/
"umull r3, r5, %9, %15;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %5, r6; \n\t"/*res->uint32[5] = r6*/
"umull r3, r4, %9, %16;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %6, r6; \n\t"/*res->uint32[6] = r6*/
"umull r3, r5, %9, %17;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %7, r6; \n\t"/*res->uint32[7] = r6*/
"adc r6, r5, #0 ; \n\t"
"mov %8, r6; \n\t"/*res->uint32[8] = r6*/
: "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
"=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
: "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
"r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
: "r3", "r4", "r5", "r6", "cc", "memory");
EDIT-1: I updated my clobber list based on the first comment, but I still get the same error
A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res)
uint32_t mulhi1, mullo1;
uint32_t mulhi2, mullo2;
uint32_t tmp;
asm("umull %0, %1, %2, %3;\n\t"
: "=r" (mullo1), "=r" (mulhi1)
: "r"(A), "r"(B->uint32[7])
res->uint32[8] = mullo1; /* was 'mov %0, r3; */
volatile asm("umull %0, %1, %3, %4;\n\t"
"adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/
: "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
: "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
: "cc"
res->uint32[7] = tmp; /* was 'mov %1, r6; */
/* ... etc */
The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.
By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.
You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,
register int *p1 asm ("r0");
and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.
GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.
The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.
You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

ARM inline assembly multi-precision multiplication [duplicate]

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:
typedef struct UN_256fe{
uint32_t uint32[8];
typedef struct UN_288bite{
uint32_t uint32[9];
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
asm (
"umull r3, r4, %9, %10;\n\t"
"mov %0, r3; \n\t"/*res->uint32[0] = r3*/
"umull r3, r5, %9, %11;\n\t"
"adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/
"mov %1, r6; \n\t"
"umull r3, r4, %9, %12;\n\t"
"adcs r6, r5, r3; \n\t"
"mov %2, r6; \n\t"/*res->uint32[2] = r6*/
"umull r3, r5, %9, %13;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %3, r6; \n\t"/*res->uint32[3] = r6*/
"umull r3, r4, %9, %14;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %4, r6; \n\t"/*res->uint32[4] = r6*/
"umull r3, r5, %9, %15;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %5, r6; \n\t"/*res->uint32[5] = r6*/
"umull r3, r4, %9, %16;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %6, r6; \n\t"/*res->uint32[6] = r6*/
"umull r3, r5, %9, %17;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %7, r6; \n\t"/*res->uint32[7] = r6*/
"adc r6, r5, #0 ; \n\t"
"mov %8, r6; \n\t"/*res->uint32[8] = r6*/
: "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
"=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
: "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
"r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
: "r3", "r4", "r5", "r6", "cc", "memory");
EDIT-1: I updated my clobber list based on the first comment, but I still get the same error
A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res)
uint32_t mulhi1, mullo1;
uint32_t mulhi2, mullo2;
uint32_t tmp;
asm("umull %0, %1, %2, %3;\n\t"
: "=r" (mullo1), "=r" (mulhi1)
: "r"(A), "r"(B->uint32[7])
res->uint32[8] = mullo1; /* was 'mov %0, r3; */
volatile asm("umull %0, %1, %3, %4;\n\t"
"adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/
: "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
: "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
: "cc"
res->uint32[7] = tmp; /* was 'mov %1, r6; */
/* ... etc */
The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.
By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.
You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,
register int *p1 asm ("r0");
and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.
GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.
The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.
You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

SSE2 instruction in C code

I am trying to reverse engineer a c code, but this part of assembly I cant really understand. I know it is part of the SSE extension. However, somethings are really different than what I am used to in x86 instructions.
static int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
int ret;
__asm__ volatile(
"pxor %%xmm6, %%xmm6 \n\t"
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((x86_reg)stride)
__asm__ volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
return ret;
What are the %1, %2, and %3? what does (%1,%2,%3) mean? Also what does "+r", "-r", "=r" mean?
You'll want to have a look at this GCC Inline Asssembly HOWTO.
The percent sign numbers are the instruction operands.
The inline assembler works similar to a macro preprocessor. Operands with exactly one leading percent are replaced by the the input parameters in the order as they appear in the parameter list, in this case:
%0 h output, register, r/w
%1 blk1 output, register, r/w
%2 blk2 output, register, r/w
%3 (x86_reg)stride input, register, read only
The parameters are normal C expressions. They can be further specified by "constraints", in this case "r" means the value should be in a register, opposed to "m" which is a memory operand. The constraint modifier "=r" makes this a write-only operand, "+r" is a read-write operand and "r" and normal read operand.
After the first colon the output operands appear, after the second the input operands and after the optional third the clobbered registers.
So the instruction sequence calculates the sum of the absolute differences in each byte of blk1 and blk2. This happens in 16 byte blocks, so if stride is 16, the blocks are contiguous, otherwise there are holes. Each instruction appears twice because some minimal loop unrolling is done, the h parameter is the number of 32 byte blocks to process. The second asm block seems to be useless, as the psadbw instruction sums up only in the low 16 bit of the destination register. (Did you omit some code?)
