Struct offsets in inline assembly - c

I'm working on project, where some interrupt service has to be handled in assembler. The handler function is called from interrupt vector wrapper. The handler body is written in assembler and it receives single (pointer) parameter in specific register.
The code target is MSP430 and it has to compile with both MSP430-gcc and TI compiler. I already have working solution for MSP430-gcc and it looks like this:
static void __attribute__((naked)) _shared_vector_handler(Timer_driver_t *driver) {
__asm__(
" MOVX.W %c[iv_register_offset](R12),R14 ; \n"
" ADD #R14,PC ; \n"
" RETA ; \n"
" JMP CCIFG_1_HND ; Vector 2 \n"
" JMP CCIFG_2_HND ; Vector 4 \n"
" JMP CCIFG_3_HND ; Vector 6 \n"
" JMP CCIFG_4_HND ; Vector 8 \n"
" JMP CCIFG_5_HND ; Vector 10 \n"
" JMP CCIFG_6_HND ; Vector 12 \n"
"TIFG_HND: \n"
" MOVX.A %c[overflow_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_1_HND: \n"
" MOVX.A %c[ccr1_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_2_HND: \n"
" MOVX.A %c[ccr2_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_3_HND: \n"
" MOVX.A %c[ccr3_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_4_HND: \n"
" MOVX.A %c[ccr4_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_5_HND: \n"
" MOVX.A %c[ccr5_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n"
" RETA ; \n"
"CCIFG_6_HND: \n"
" MOVX.A %c[ccr6_handle_offset](R12),R14 ; \n"
" MOVX.A %c[handler_param_offset](R14),R12 ; \n"
" MOVX.A %c[handler_offset](R14),R14 ; \n"
" CALLA R14 ; \n" ::
[iv_register_offset] "i" (offsetof(Timer_driver_t, _IV_register)),
[overflow_handle_offset] "i" (offsetof(Timer_driver_t, _overflow_handle)),
[ccr1_handle_offset] "i" (offsetof(Timer_driver_t, _CCR1_handle)),
[ccr2_handle_offset] "i" (offsetof(Timer_driver_t, _CCR2_handle)),
[ccr3_handle_offset] "i" (offsetof(Timer_driver_t, _CCR3_handle)),
[ccr4_handle_offset] "i" (offsetof(Timer_driver_t, _CCR4_handle)),
[ccr5_handle_offset] "i" (offsetof(Timer_driver_t, _CCR5_handle)),
[ccr6_handle_offset] "i" (offsetof(Timer_driver_t, _CCR6_handle)),
[handler_offset] "i" (offsetof(Timer_channel_handle_t, _handler)),
[handler_param_offset] "i" (offsetof(Timer_channel_handle_t, _handler_param)) :
);
}
Translated to English: the driver structure contains address of IV register on some specific offset. Content on that address is added to PC, so jump to specific label (depending on which interrupt flag is set) occurs. This is recommended usage as described by TI in user's guide, page 653. All labels do the same: they take pointer to some handle from driver structure from specific offset. The handle has again on some specific offset function pointer (interrupt service handler) and pointer to some parameter, that shall be passed to handler. The structures in short:
typedef struct Timer_driver {
// enable dispose(Timer_driver_t *)
Disposable_t _disposable;
// base of HW timer registers, (address of corresponding TxCTL register)
uint16_t _CTL_register;
...
// interrupt vector register
uint16_t _IV_register;
// stored mode control
uint8_t _mode;
// amount of CCRn registers
uint8_t _available_handles_cnt;
// main (CCR0) handle
Timer_channel_handle_t *_CCR0_handle;
// up to six (CCRn) handles sharing one interrupt vector
Timer_channel_handle_t *_CCR1_handle;
Timer_channel_handle_t *_CCR2_handle;
...
}
and
struct Timer_channel_handle {
// vector wrapper, enable dispose(Timer_channel_handle_t *)
Vector_handle_t vector;
// HW timer driver reference
Timer_driver_t *_driver;
// capture / compare control register
uint16_t _CCTLn_register;
// capture / compare register
uint16_t _CCRn_register;
// vector interrupt service handler
void (*_handler)(void *);
// vector interrupt service handler parameter
void *_handler_param;
...
}
Now the problem.
offsets are not known until compile time
I can't pass to assembler some offsetof(s, m)
offsets depend on memory model used (size of pointers 16bit or 32bit)
offsets depend on size of first member of both structures and this size depends on preprocessor definitions (1 pointer or 4 pointers)
offsets cannot be precomputed, because each compiler adds some alignment and padding to the first member structure
the first member must be first member (reordening is not allowed)
TI compiler does not support passing compile-time vars to inline assembly code
The goal:
support both compilers
do not duplicate code, do not hardcode offsets
if possible, avoid extracting whole handler to asm file and including headers via .cdecls (or #include in case of gcc). Both compilers handle including of C headers in a lot different way, structure offsets are also defined in a lot different way and some non-trivial restructuring of headers would be required, which I believe is near impossible.
When I compile this with TI compiler I get following error:
"../module/driver/src/timer.c", line 274: error #18: expected a ")"
"../module/driver/src/timer.c", line 285: warning #12-D: parsing restarts here after previous syntax error
1 error detected in the compilation of "../module/driver/src/timer.c".
gmake: *** [module/driver/src/timer.obj] Error 1
My build is handled by CMake and I can think of one solution - just to pregenerate those offsets to some header file, that shall be included in the driver. The way how to do that is described here. But if possible I'd like to also avoid this one step, since it needs to compile in Code Composer Studio, that does not run cmake.
So how do I create CMake target to pregenerate those offsets? Or any other ideas?

Thanks to everyone, special thanks to #CL. I've been stuck with the thought that this has to be done in assembler for so many reasons and that I only need to get those offsets somehow. The solution is simple:
static void _shared_vector_handler(Timer_driver_t *driver) {
uint16_t interrupt_source;
Timer_channel_handle_t *handle;
if ( ! (interrupt_source = hw_register_16(driver->_IV_register))) {
return;
}
handle = *((Timer_channel_handle_t **)
(((uintptr_t)(&driver->_CCR0_handle)) + (interrupt_source * _POINTER_SIZE_ / 2)));
(*handle->_handler)(handle->_handler_param);
}
translated to assembler (TI compiler, memory model large):
_shared_vector_handler():
011ef6: 4C1F 0008 MOV.W 0x0008(R12),R15
011efa: 4F2F MOV.W #R15,R15
011efc: 930F TST.W R15
011efe: 240D JEQ (0x1f1a)
231 (*handle->_handler)(handle->_handler_param);
011f00: F03F 3FFF AND.W #0x3fff,R15
011f04: 025F RLAM.W #1,R15
011f06: 4F0F MOV.W R15,R15
011f08: 00AC 000C ADDA #0x0000c,R12
011f0c: 0FEC ADDA R15,R12
011f0e: 0C0F MOVA #R12,R15
011f10: 0F3C 003E MOVA 0x003e(R15),R12
011f14: 00AF 003A ADDA #0x0003a,R15
011f18: 0F00 BRA #R15
$C$L12:
011f1a: 0110 RETA
The original assembler takes 7 instructions to execute the handler, but the add-IV-to-PC breaks the pipeline. Here we have 13 instructions, therefore the effeciency is almost equal.
BTW the actual commit is here.

For constants that are available in numeric form when the C preprocessor runs, you can use #define macros to stringify and concat them with the inline-asm string, like asm("blah blah " stringify(MYCONST) "\nblah blah");, but that won't work for offsetof, which requires the compiler proper to evaluate it to a number.
FIXME: this won't work as easily when cross-compiling. You'd have to parse compiler-generated asm, or dump static data from a .o Both of those are possible as minor modifications to this method, but are kinda ugly. I'm going to leave this answer here in case it's useful for non-cross-compiling use-cases.
However, since you tagged this cmake, you have a build system that can handle a chain of dependencies. You could write a program that uses offsetof to create a .h with contents like this, using some simple printf statements
// already stringified to simplify
// if you want them as numeric literals, leave out the double quotes and use a STR() macro
#define OFFSET_Timer_driver_t__CCR1_handle "12"
#define OFFSET_Timer_driver_t__CCR2_handle "16"
...
Then you can #include "struct_offsets.h" in files that need it, and use it for inline asm like
asm("insn " OFFSET_Timer_driver_t__CCR1_handle "\n\t"
"insn blah, blah \n\t"
"insn foo " OFFSET_Timer_driver_t__CCR2_handle "\n\t"
);
Or use pure asm instead of a naked function, since you're.
Use CMake build dependencies so that any files that need struct_offsets.h are rebuild if it changes.

Related

C embedded assembly error: ‘asm’ operand has impossible constraints

When I embedded assembly in C language, I met the following error compiling these code using a shell command in ubuntu linux 14.04.
IFR_temp_measure.cpp: In function ‘void BlockTempClc(char*, char*,
int, int, char, int, int, int, int*, int, int*, int)’:
IFR_temp_measure.cpp:1843:6: error: ‘asm’ operand has impossible
constraints);
^
&make: *** [IFR_temp_measure.o] Error 1
or the position of the error code line 1842,1843 is respond to the code
:"cc", "memory","q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q10", "q11", "q12", "q13", "q14", "q15","r0", "r1", "r3", "r4", "r5","r6","r8", "r9", "r10", "r12"
);
I have tried to solve this problem,but Few references are available online,there is a linker:
Gcc inline assembly what does "'asm' operand has impossible constraints" mean? and http://www.ethernut.de/en/documents/arm-inline-asm.html
but not helped.
My code is as follows:
void BlockTempClc(char* src1,char* src2,int StrideDist,int height,char temp_comp1,int numofiterations,int temp_comp2,int temp_comp3,int *dstData,int width,int *dstSum,int step)
{
volatile char array1[16] = {0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0};
volatile char array2[16] = {0,0,1,0,2,0,3,0,
4,0,5,0,6,0,7,0};
asm volatile(
"mov r0, %0; " //image[0]
"mov r1, %1; " //image[1]
"mov r12,%11; " //m
"mov r3, %4; " //n
"mov r4, %2; " //store data
"mov r8, %12; " //step down for loading next line of image
"mov r5, %6; " //numofiterations
"mov r6, %3; " //out
"mov.8 r9,%5;"//isp_temp_comp
"mov.8 r10,%7;"//led_temp_comp
"mov.8 r11,%8;"//fac_temp_comp
"vdup.8 d20,r9;"//copy arm register value isp_temp_comp to neon register
"VMOV.S16 q9, d20; " //isp_temp_comp transfer to signed short type
"VLD1.8 {d16,d17}, [%9];"//q8 array1 sum
"VLD1.8 {d6,d7}, [%10];"//q3 array2
"VMOV.S16 q0, #256; "
"VMOV.S16 q1, #2730; " //Assign immediate number 2730 to each 16 bits of d1
".loop:;"
"vdup.8 d21,r10;"//copy arm register value led_temp_comp to neon register
"vdup.8 d22,r11;"//copy arm register value fac_temp_comp to neon register
"VLD1.8 d14, [r1],r8; " // q7 *(image[1] + tmp + n) Load: Load Picture Pixels r6:move step ?
"VLD1.8 d15, [r0],r8 " // *(image[0] + tmp + n) Load: Load Picture Pixels
"PLD [r1]; " //Preload: one line in cache
"PLD [r0]; " //?
"VMOV.S16 q5, d14; " //q5 8*16 transfer to signed short type:*(image[1] + tmp + n)
"VMOV.S16 q6, d15; " //q6 8*16 transfer to signed short type : *(image[0] + tmp + n)
"VADD.S16 q12,q6, q9;"//*(image[0] + tmp + n) + isp_temp_comp
"VMOV.S16 q6, d21; " //led_temp_comp
"VADD.S16 q13,q12, q6;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp
"VMOV.S16 q6, d22; " //fac_temp_comp
"VADD.S16 q14,q13, q6;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp
"VSUB.S16 q15,q14, q1;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp-2730
"VMLA.S16 q15, q5, q0;"//img_temp[m][n]=*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp-2730+*(image[1] + tmp + n) *256
"VADD.S16 q2,q15, q8;"//sum
"VMOV.S16 q8, q2; " //q8
"vdup.8 d20,r3;"//n
"vdup.8 d21,r12;"//m
"VMOV.S16 q11, d20; " //n
"VMOV.S16 q10, d21; " //m
"VADD.S16 q4,q3, q11;"//(n,n+1,n+2,n+3,n+4,n+5,n+6,n+7)
"VADD.S16 q7,q3, q10;"//(m,m+1,m+2,m+3,m+4,m+5,m+6,m+7) q7
"VST1.16 {d30[0]}, [r4]!; "//restore img_temp[m][n] to pointer data
"VST1.16 {d14[0]}, [r4]!; "//restore m
"VST1.16 {d8[0]}, [r4]!; " //restore n
"VST1.16 {d30[1]}, [r4]!; "
"VST1.16 {d14[1]}, [r4]!; "
"VST1.16 {d8[1]}, [r4]!; "
"VST1.16 {d30[2]}, [r4]!; "
"VST1.16 {d14[2]}, [r4]!; "
"VST1.16 {d8[2]}, [r4]!; "
"VST1.16 {d30[3]}, [r4]!; "
"VST1.16 {d14[3]}, [r4]!; "
"VST1.16 {d8[3]}, [r4]!; "//response to array
"subs r5, r5, #1; " // decrement: numofinteration -= 1;
"bne .loop; " // Branch If Not Zero; to .loop
"VST1.16 {d4[0]}, [r6]!; "//q2 refer to sum restore the final result to pointer out
"VST1.16 {d4[1]}, [r6]!; "
"VST1.16 {d4[2]}, [r6]!; "
"VST1.16 {d4[3]}, [r6]!; "
"VST1.16 {d5[0]}, [r6]!; "
"VST1.16 {d5[1]}, [r6]!; "
"VST1.16 {d5[2]}, [r6]!; "
"VST1.16 {d5[3]}, [r6]!; "
:"+r"(src1),"+r"(src2),"+r"(dstData),"+r"(dstSum),"+r"(height)
:"r"(temp_comp1),"r"(numofiterations),"r"(temp_comp2),"r"(temp_comp3),
"r"(array1),"r"(array2), "r"(width),"r"(step)
:"cc", "memory","q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q10", "q11", "q12", "q13", "q14", "q15","r0", "r1", "r3", "r4", "r5","r6","r8", "r9", "r10", "r12"
);
}
I suppose the problem maybe output operands lists or output operands lists.
Whats cause the error of my code?and how to solve it?
You declare clobbers on most of the integer registers, but then you ask for 13 different input variables. 32-bit ARM only has 16 registers, and 2 of those are PC and SP leaving only 14 at best really general purpose registers.
We can test that too many clobbers + operands are the problem by removing all the clobbers on r0.. r12; this lets it compile (into incorrect code!!). https://godbolt.org/z/Z6x78N This is not the solution because it introduces huge bugs, it's just how I confirmed that this is the problem.
Any time your inline asm template starts with mov to copy from an input register operand into a hard-coded register, you're usually doing it wrong. Even if you had enough registers, the compiler is going to have to emit code to get the variable into a register, then your hand-written asm uses another mov to copy it for no reason.
See https://stackoverflow.com/tags/inline-assembly/info for more guides.
Instead ask the compiler for the input in that register in the first place with register int foo asm("r0"), or better let the compiler do register allocation by using %0 or the equivalent named operand like %[src1] instead of a hard-coded r0 everywhere inside your asm template. The syntax for naming an operand is [name] "r" (C_var_name). They don't have to match, but they don't have to be unique either; it's often convenient to use the same asm operand name as the C var name.
Then you can remove the clobbers on most of the GP registers. You do need to tell the compiler about any input registers you modify, e.g. by using a "+r" constraint instead of "r" (and then not using that C variable after the asm modifies it). Or use an "=r" output constraint and a matching input constraint like "0" (var) to put that input in the same register as output operand 0. "+r" is much easier in a wrapper function where the C variable is not used afterwards anyway.
You can remove the clobbers on vector registers if you use dummy output operands to get the compiler to do register allocation, but it's basically fine if you just leave those hard-coded.
asm( // "mov r0, %[src1]; " // remove this and just use %[src1] instead of r0
"... \n\t"
"VST1.16 {d30[0]}, [%[dstData]]! \n\t" //restore img_temp[m][n] to pointer data
"... \n\t"
: [src1]"+&r"(src1), [src2]"+&r"(src2), [dstData]"+&r"(dstData),
[dstSum]"+&r"(dstSum), [height]"+&r"(height)
: [temp_comp1] "r"(temp_comp1), [niter] "r"(numofiterations),
[temp_comp2] "r"(temp_comp2), [temp_comp3] "r"(temp_comp3),
...
: "memory", "cc", all the q and d regs you use. // but not r0..r13
);
You can look at the compiler's asm output to see how it filled in the %0 and %[name] operands in the asm template you gave it. Use "instruction \n\t" to make this readable, ; puts all the instructions onto the same line in the asm output. (C string-literal concatenation doesn't introduce newlines).
The early-clobber declarations on the read/write operands makes sure that none of the input-only operands share a register with them, even if they have the compiler knows that temp_comp1 == height for example. Because the original value of temp_comp1 still needs to be readable from the register %[temp_comp1], even after something has modified %[height]. So they can't both be r4 for example. Otherwise, without the & in "+&r", the compiler could choose that to gain efficiency if outputs are only written after all inputs are read. (e.g. when wrapping a single instruction, like GNU C inline asm is designed to do efficiently).
side-note: char array1[16] and 2 don't need to be volatile; the "memory" clobber on the asm statement is sufficient even though you just pass pointers to them, not use them as "m" input operands.

Operands mismatch for mul when inserting asm into c

I'm trying to make an assembly insert into C code. However when I try to multiply two registers inside it I get an error calling for operands mismatch. I tried "mul %%bl, %%cl\n" (double %% because it's in C code). From my past experience with asm I think this should work. I also tried "mul %%cl\n" (moving bl to al first), but in this case I get tons of errors from linker
zad3:(.rodata+0x4): multiple definition of `len'
/tmp/ccJxYyIp.o:(.rodata+0x0): first defined here
zad3: In function `_fini':
(.fini+0x0): multiple definition of `_fini'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o:(.fini+0x0): first defined here
zad3: In function `data_start':
(.data+0x0): multiple definition of `__data_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.data+0x0): first defined here
zad3: In function `data_start':
(.data+0x8): multiple definition of `__dso_handle'
/usr/lib/gcc/x86_64-linux-gnu/5/crtbegin.o:(.data+0x0): first defined here
zad3:(.rodata+0x0): multiple definition of `_IO_stdin_used'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.rodata.cst4+0x0): first defined here
zad3: In function `_start':
(.text+0x0): multiple definition of `_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o: (.text+0x0): first defined here
zad3: In function `data_start':
(.data+0x10): multiple definition of `str'
/tmp/ccJxYyIp.o:(.data+0x0): first defined here
/usr/bin/ld: Warning: size of symbol `str' changed from 4 in /tmp/ccJxYyIp.o to 9 in zad3
zad3: In function `main':
(.text+0xf6): multiple definition of `main'
/tmp/ccJxYyIp.o:zad3.c:(.text+0x0): first defined here
zad3: In function `_init':
(.init+0x0): multiple definition of `_init'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o:(.init+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/5/crtend.o:(.tm_clone_table+0x0): multiple definition of `__TMC_END__'
zad3:(.data+0x20): first defined here
/usr/bin/ld: error in zad3(.eh_frame); no .eh_frame_hdr table will be created.
collect2: error: ld returned 1 exit status
From what I understand, it tells me I defined len and a few other variables a few times, but I cannot see this multiple definition.
The goal of my program is to take a string of numbers and count sum of them but using 2 as a base. So let's say string is 293, then I want to count 2*2^2+9*2^1+3*2^0
Code:
#include <stdio.h>
char str[] = "543";
const int len = 3;
int main(void)
{
asm(
"mov $0, %%rbx \n"
"mov $1, %%rcx \n"
"potega: \n"
"shl $1, %%cl \n"
"inc %%rbx \n"
"cmp len, %%ebx \n"
"jl potega \n"
"mov $0, %%rbx \n"
"petla: \n"
"mov (%0, %%rbx, 1), %%al \n"
"sub $48, %%al \n"
"mul %%al, %%cl \n"
"shr $1, %%cl \n"
"add $48, %%al \n"
"mov %%al, (%0, %%rbx, 1) \n"
"inc %%rbx \n"
"cmp len, %%ebx \n"
"jl petla \n"
:"r"(&str)
:"%rax", "%rbx", "%rcx"
);
printf("Wynik: %s\n", str);
return 0;
}
While I try to avoid "doing people's homework" for them, you have already solved this and given that it has been over a week, have probably already turned it in.
So, looking at your final solution, there are a few things you might want to consider doing differently. In no particular order:
Comments. While all code needs comments, asm REALLY needs comments. As you'll see from my solution (below), having comments alongside the code really helps clarify what the code does. It might seem like a homework project hardly needs them. But since you posted this here, 89 people have tried to read this code. Comments would have made this easier for all of us. Not to mention that it will make life easier for your 'future self,' when you come back months from now to try to maintain it. Comments. Nuff said.
Zeroing registers. While mov $0, %%rbx will indeed put zero in rbx, this is not the most efficient way to zero a register. Using xor %%rbx, %%rbx is both (microscopically) faster and produces (slightly) smaller executable code.
potega. Without comments, it took me a bit to sort out what you were doing in your first loop. You are using rbx to keep track of how many characters you have processed, and cl gets shifted one to the left for each character. A few thoughts here:
3a. First thing I'd do is look at moving the shl $1, %%cl out of the loop. Instead of doing both increment and shift, just count the characters, then do a single shift of the appropriate size. This is (slightly) complicated by the fact that if you want to shift by a variable amount, the amount must be specified in cl (ie shl %%cl, %%rbx). Why cl? Who knows? That's just how shl works. So you'd want to do the counting in cl instead of rbx.
3b. Second thing about this loop has to do with len1. Since you already know the size (it's in len1), why would you even need a loop? Perhaps a more sensible approach would be:
3c. Strings in C are terminated with a null character (aka 0). If you want to find the length of a string, normally you'd walk the string until you find it. This removes the requirement to even have len1.
3d. Your code assumes that the input string is valid. What would happen if you got passed "abc"? Or ""? Validating parameters is boring, time consuming, and makes the program bigger and run slower. On the other hand, it pays HUGE dividends when something unexpected goes wrong. At the very least you should specify your assumptions about your input.
3e. Using global variables is usually a bad idea. You run into naming collisions (2 files both using the name len1), code in several different files all changing the value (making bugs difficult to track down) and it can make your program bigger than it needs to be. There are times when globals are useful, but this does not appear to be one of them. The only purpose here seems to be to allow access to these variables from within the asm, and there are other ways to do that.
3f. You use %0 to refer to str. That works (and is better than accessing the global symbol directly), but it is harder to read than it needs to be. You can associate a name with the parameter and use that instead.
Let's take a break for a moment to see what we've got so far:
"xor %%rcx, %%rcx\n" // Zero the strlen count
// Count how many characters in string
"potega%=: \n\t"
"mov (%[pstr], %%rcx), %%bl\n\t" // Read the next char
"test %%bl, %%bl \n\t" // Check for 0 at end of string
"jz kont%= \n\t"
"cmp $'0', %%bl\n\t" // Ensure digit is 0-9
"jl gotowe%=\n\t"
"cmp $'9', %%bl\n\t"
"jg gotowe%=\n\t"
"inc %%rcx \n\t" // Increment index/len
"jmp potega%= \n"
"kont%=:\n\t"
// rcx = the number of character in the string excluding null
You'll notice that I'm using %= at the end of all the labels. You can read about what this does in the gcc docs, but mostly it just appends a number to the labels. Why do that? Well, if you wanted to try computing multiple strings in a single run (like I do below), you might call this code several times. But compilers (being the tricky devils that they are) might choose to "inline" your assembler. That would mean you'd have several chunks of code that all had the same label names in the same routine. Which would cause your compile to fail.
Note that I don't check to see if the string is "too long" or NULL. Left as an exercise for the student...
Ok, what else?
petla. Mostly my code matches yours.
4a. I did change to sub $'0', %%al instead of just using $48. It does the same thing, but subtracting '0' seems to me to be more "self-documenting."
4b. I also slightly reordered things to put the shr at the end. Why do that? You use cmp along with jz to see when it's time to exit the loop. The way cmp works is that it sets some flags in the flags register, then jz looks at those flags to figure out whether to jump or not. However shr sets those flags too. Each time you shift, you are moving that '1' further and further to the right. What happens when it's at the rightmost position and you shift it 1 more? You get zero. At which point the "jump if not zero" (aka jnz) works as expected. Since you have to do the shr anyway, why not use it to tell you when to exit the loop too?
That gives me:
"petla%=:\n\t"
"mov (%[pstr], %%rcx, 1), %%al\n\t" // read the next char
"sub $'0', %%al\n\t" // convert char to value
"mul %%bl\n\t" // mul bl * al -> ax
"add %%ax, %[res]\n\t" // Accumulate result
"inc %%rcx\n\t" // move to next char
"shr $1, %%rbx\n\t" // decrease our exponent
"jnz petla%=\n" // Has our exponent gone to 0?
"gotowe%=:"
Lastly, the parameters:
:[res] "=r"(result)
:[pstr] "r"(str), "0"(0)
:"%rax", "%rbx", "%rcx", "cc"
I'm going to store the result in the C variable named result. Since I specify =r with this constraint, I know that it is stored in a register, although I don't know which register the compiler will pick. But I don't need to. I can just refer to it using %[res] and let the compiler sort it out. Likewise I refer to the string using %[pstr]. I could use %0 like you did, except that since I've added result, pstr isn't %0 anymore, it's %1 (result is now %0). This is another reason to use names instead of numbers.
That last bit ("0"(0)) might take a bit of explaining. Using "0" for the constraint (instead of say "r") tells the compiler to put this value into the same place as parameter #0. The (0) says store a zero there before starting the asm. In other words, initialize the register that is going to hold result to 0. Yes, I could do this in the asm. But I prefer to let the compiler do this for me. While it may not matter in a tiny program like this, letting the C compiler do as much work as possible tends to produce the most efficient code.
So, when we wrap this all together, I get:
/*
my_file.c - The goal of this program is to take a string of numbers and
count sum of them but using 2 as a base.
example: "543" -> 5*(2^2)+4*(2^1)+3*(2^0)=31
*/
#include <stdio.h>
void TestOne(const char *str)
{
short result;
// Code assumes str is not NULL. Strings with non-digits and zero
// length strings return 0.
asm(
"xor %%rcx, %%rcx\n" // Zero the strlen count
// Count how many characters in string
"potega%=: \n\t"
"mov (%[pstr], %%rcx), %%bl\n\t" // Read the next char
"test %%bl, %%bl \n\t" // Check for 0 at end of string
"jz kont%= \n\t"
"cmp $'0', %%bl\n\t" // Ensure digit is 0-9
"jl gotowe%=\n\t"
"cmp $'9', %%bl\n\t"
"jg gotowe%=\n\t"
"inc %%rcx \n\t" // Increment index/len
"jmp potega%= \n"
"kont%=:\n\t"
// rcx = the number of character in the string excluding null
"dec %%rcx \n\t" // We want to shift rbx 1 less than pstr length
"jl gotowe%=\n\t" // Check for zero length string
"mov $1, %%rbx\n\t" // Set exponent for first digit
"shl %%cl, %%rbx\n\t"
"xor %%rcx, %%rcx\n" // Reset string index
"petla%=:\n\t"
"mov (%[pstr], %%rcx, 1), %%al\n\t" // read the next char
"sub $'0', %%al\n\t" // convert char to value
"mul %%bl\n\t" // mul bl * al -> ax
"add %%ax, %[res]\n\t" // Accumulate result
"inc %%rcx\n\t" // move to next char
"shr $1, %%rbx\n\t" // decrease our exponent
"jnz petla%=\n" // Has our exponent gone to 0?
"gotowe%=:"
:[res] "=r"(result)
:[pstr] "r"(str), "0"(0)
:"%rax", "%rbx", "%rcx", "cc"
);
printf("Wynik: \"%s\" = %d\n", str, result);
}
int main(){
TestOne("x");
TestOne("");
TestOne("5");
TestOne("54");
TestOne("543");
TestOne("5432");
return 0;
}
Notice: No global variables. And no len1. Just a pointer to the string.
It might be interesting to experiment and see how long a string you can support. Using mul %%bl, add %%ax and short result works for tiny strings like these, but will eventually be insufficient as the strings get longer (requiring eax or rax etc). I'll leave that for you too. Warning: There's a trick when moving 'up' from mul %%bl to mul %%bx.
One last point about letting the compiler do as much work as possible tends to produce the most efficient code: Sometimes people assume that since they are writing assembler, this will result in faster code than if they write it in C. However, these people fail to take into account the fact that the entire purpose of a C compiler is to turn your C code into assembler. When you turn on optimization (-O2), the compiler is almost certainly going to turn your (well-written) C code into better assembler code than anything you can write by hand.
There are thousands of tweaks and tricks like the ones I've mentioned here. And the people who write compilers know them all. While there are a few places where inline asm can make sense, smart programmers leave this work to the lunatics who write compilers whenever possible. See also this.
I realize this is just a school project and you are only doing what your teacher requires, but since she has elected to use the most difficult way possible to teach you asm, perhaps she failed to mention that the thing you are doing is something you should (almost) never do in real life.
This post turned out longer than I expected. Hopefully there is information here that you can use. And forgive my attempts at Polish labels. Hopefully I haven't said anything obscene...
As somebody pointed out - yes it's a student exercise.
When it comes to my original problem, when I removed line add $48,%%al \n" it worked. I also switched to mul %%cl.
When it comes to rest of problems, you pointed out, I talked with my professor and she slightly changed her mind (or I got the assgment wrong the first time - whatever you find more possible) and now she wanted me to return an argument from the inline function and said the intiger type was good. It resulted in me writing such piece of code (which actually does what I wanted)
example: "543" -> 5*(2^2)+4*(2^1)+3*(2^0)=31
#include <stdio.h>
char str[] = "543";
const int len = 3;
int len1 = 2;
int result;
int main(){
asm(
"mov $0, %%rbx\n"
"mov $1, %%rcx\n"
"mov $0, %%rdx\n"
"potega: \n"
"inc %%rbx\n"
"shl $1, %%cl\n"
"cmp len1, %%ebx \n"
"jl potega\n"
"mov $0, %%rbx\n"
"petla:\n"
"mov (%0, %%rbx, 1), %%al\n"
"sub $48, %%al\n"
"mul %%cl\n"
"shr $1, %%cl\n"
"add %%al, %%dl\n"
"inc %%rbx\n"
"cmp len, %%ebx\n"
"jl petla\n"
"movl %%edx, result\n"
://"=r"(result)
:"r"(&str), "r"(&result)
:"%rax", "%rbx", "%rcx", "%rdx"
);
printf("Wynik: %d\n", result);
return 0;
}
Also - I do realise, that normally you return variables the way it's showed in comment, but it didn't work, so by my professor's suggestion I wrote the program this way.
Thanks everybody for help!

Assembly loop through a string to count characters

i try to make an assembly code that count how many characters is in the string, but i get an error.
Code, I use gcc and intel_syntax
#include <stdio.h>
int main(){
char *s = "aqr b qabxx xryc pqr";
int x;
asm volatile (
".intel_syntax noprefix;"
"mov eax, %1;"
"xor ebx,ebx;"
"loop:"
"mov al,[eax];"
"or al, al;"
"jz print;"
"inc ebx;"
"jmp loop"
"print:"
"mov %0, ebx;"
".att_syntax prefix;"
: "=r" (x)
: "r" (s)
: "eax", "ebx"
);
printf("Length of string: %d\n", x);
return 0;
}
And i got error:
Error: invalid use of register
Finally I want to make program, which search for regex pattern([pq][^a]+a) and prints it's start position and length. I wrote it in C, but I have to make it work in assembly:
My C code:
#include <stdio.h>
#include <string.h>
int main(){
char *s = "aqr b qabxx xryc pqr";
int y,i;
int x=-1,length=0, pos = 0;
int len = strlen(s);
for(i=0; i<len;i++){
if((s[i] == 'p' || s[i] == 'q') && length<=0){
pos = i;
length++;
continue;
} else if((s[i] != 'a')) && pos>0){
length++;
} else if((s[i] == 'a') && pos>0){
length++;
if(y < length) {
y=length;
length = 0;
x = pos;
pos = 0;
}
else
length = 0;
pos = 0;
}
}
printf("position: %d, length: %d", x, y);
return 0;
}
You omitted the semicolon after jmp loop and print:.
Also your asm isn't going to work correctly. You move the pointer to s into eax, but then you overwrite it with mov al,[eax]. So the next pass thru the loop, eax doesn't point to the string anymore.
And when you fix that, you need to think about the fact that each pass thru the loop needs to change eax to point to the next character, otherwise mov al,[eax] keeps reading the same character.
Since you haven't accepted an answer yet (by clicking the checkmark to the left), there's still time for one more edit.
Normally I don't "do people's homework", but it's been a few days. Presumably the due date for the assignment has passed. Such being the case, here are a few solutions, both for the education of the OP and for future SO users:
1) Following the (somewhat odd) limitations of the assignment:
asm volatile (
".intel_syntax noprefix;"
"mov eax, %1;"
"xor ebx,ebx;"
"cmp byte ptr[eax], 0;"
"jz print;"
"loop:"
"inc ebx;"
"inc eax;"
"cmp byte ptr[eax], 0;"
"jnz loop;"
"print:"
"mov %0, ebx;"
".att_syntax prefix;"
: "=r" (x)
: "r" (s)
: "eax", "ebx"
);
2) Violating some of the assignment rules to make slightly better code:
asm (
"\n.intel_syntax noprefix\n\t"
"mov eax, %1\n\t"
"xor %0,%0\n\t"
"cmp byte ptr[eax], 0\n\t"
"jz print\n"
"loop:\n\t"
"inc %0\n\t"
"inc eax\n\t"
"cmp byte ptr[eax], 0\n\t"
"jnz loop\n"
"print:\n"
".att_syntax prefix"
: "=r" (x)
: "r" (s)
: "eax", "cc", "memory"
);
This uses 1 fewer register (no ebx) and omits the (unnecessary) volatile qualifier. It also adds the "cc" clobber to indicate that the code modifies the flags, and uses the "memory" clobber to ensure that any 'pending' writes to s get flushed to memory before executing the asm. It also uses formatting (\n\t) so the output from building with -S is readable.
3) Advanced version which uses even fewer registers (no eax), checks to ensure that s is not NULL (returns -1), uses symbolic names and assumes -masm=intel which results in more readable code:
__asm__ (
"test %[string], %[string]\n\t"
"jz print\n"
"loop:\n\t"
"inc %[length]\n\t"
"cmp byte ptr[%[string] + %[length]], 0\n\t"
"jnz loop\n"
"print:"
: [length] "=r" (x)
: [string] "r" (s), "[length]" (-1)
: "cc", "memory"
);
Getting rid of the (arbitrary and not well thought out) assignment constraints allows us to reduce this to 7 lines (5 if we don't check for NULL, 3 if we don't count labels [which aren't actually instructions]).
There are ways to improve this even further (using %= on the labels to avoid possible duplicate symbol issues, using local labels (.L), even writing it so it works for both -masm=intel and -masm=att, etc.), but I daresay that any of these 3 are better than the code in the original question.
Well Kuba, I'm not sure what more you are after here before you'll accept an answer. Still, it does give me the chance to include Peter's version.
4) Pointer increment:
__asm__ (
"cmp byte ptr[%[string]], 0\n\t"
"jz .Lprint%=\n"
".Loop%=:\n\t"
"inc %[length]\n\t"
"cmp byte ptr[%[length]], 0\n\t"
"jnz .Loop%=\n"
".Lprint%=:\n\t"
"sub %[length], %[string]"
: [length] "=&r" (x)
: [string] "r" (s), "[length]" (s)
: "cc", "memory"
);
This does not do the 'NULL pointer' check from #3, but it does do the 'pointer increment' that Peter was recommending. It also avoids potential duplicate symbols (using %=), and uses 'local' labels (ones that start with .L) to avoid extra symbols getting written to the object file.
From a "performance" point of view, this might be slightly better (I haven't timed it). However from a "school project" point of view, the clarity of #3 seems like it would be a better choice. From a "what would I write in the real world if for some bizarre reason I HAD to write this in asm instead of just using a standard c function" point of view, I'd probably look at usage, and unless this was performance critical, I'd be tempted to go with #3 in order to ease future maintenance.

Inline assembly: clarification of constraint modifiers

Two questions:
(1) If I understand ARM inline assembly correctly, a constraint of "r" says that the instruction operand can only be a core register and that by default is a read-only operand. However, I've noticed that if the same instruction has an output operand with the constraint "=r", the compiler may re-use the same register. This seems to violate the "read-only" attribute. So my question is: Does "read-only" refer to the register, or to the C variable that it is connected to?
(2) Is it correct to say that presence of "&" in the constraint of "=&r" simply requires that the register chosen for the output operand must not be the same as one of the input operand registers? My question relates to the code below used to compute the integer power function: i.e., are the "&" constraint modifiers necessary/appropriate?
asm (
" MOV %[power],1 \n\t"
"loop%=: \n\t"
" CBZ %[exp],done%= \n\t"
" LSRS %[exp],%[exp],1 \n\t"
" IT CS \n\t"
" MULCS %[power],%[power],%[base] \n\t"
" MUL %[base],%[base],%[base] \n\t"
" B loop%= \n\t"
"done%=: "
: [power] "+&r" (power)
[base] "+&r" (base)
[exp] "+&r" (exp)
:
: "cc"
) ;
Thanks!
Dan
Read-only refers to the use of the operand in assembly code. The assembly code can only read from the operand, and it must do so before any normal output operand (not an early clobber or a read/write operand) is written. This is because, as you've seen, the same register can be allocated to both an input and output operand. The assumption is that inputs are fully consumed before any output is written, which is normally the case for an assembly instruction.
I don't think using an early-clobber modifier & with an read/write modifier + has any effect since a register allocated to a read/write operand can't be used for anything else.
Here's how I'd write your code:
unsigned power = 1;
asm (
" CBZ %[exp],done%= \n\t"
"loop%=: \n\t"
" LSRS %[exp],%[exp],1 \n\t"
" IT CS \n\t"
" MULCS %[power],%[power],%[base] \n\t"
" MUL %[base],%[base],%[base] \n\t"
" BNE loop%= \n\t"
"done%=: "
: [power] "+r" (power),
[base] "+r" (base),
[exp] "+r" (exp)
:
: "cc"
) ;
Note the transformation of putting the loop test at the end of the loop, saving one instruction. Without it the code doesn't have any obvious improvement over what the compiler can generate. I also let the compiler do the initialization of the register used for the power operand. There's a small chance it will be able to allocate a register that already has the value 1 in it.
Thanks to all of you for the clarification. Just to be sure that I have it right, would it be correct to say that the choice between "=r" and "+r" for an output operand comes down to how the corresponding register is first used in the assembly template? I.e.,
"=r": The first use of the register is as a write-only output of an instruction.
The register may be re-used later by another instruction as an input or output. Adding an early clobber constraint (e.g., "=&r") prevents the compiler from assigning a register that was previously used as an input operand.
"+r": The first use of the register is as an input to an instruction, but the register is used again later as an output.
Best,
Dan

Using FPU with C inline assembly

I wrote a vector structure like this:
struct vector {
float x1, x2, x3, x4;
};
Then I created a function which does some operations with inline assembly using the vector:
struct vector *adding(const struct vector v1[], const struct vector v2[], int size) {
struct vector vec[size];
int i;
for(i = 0; i < size; i++) {
asm(
"FLDL %4 \n" //v1.x1
"FADDL %8 \n" //v2.x1
"FSTL %0 \n"
"FLDL %5 \n" //v1.x2
"FADDL %9 \n" //v2.x2
"FSTL %1 \n"
"FLDL %6 \n" //v1.x3
"FADDL %10 \n" //v2.x3
"FSTL %2 \n"
"FLDL %7 \n" //v1.x4
"FADDL %11 \n" //v2.x4
"FSTL %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4) //wyjscie
:"g"(&v1[i].x1), "g"(&v1[i].x2), "g"(&v1[i].x3), "g"(&v1[i].x4), "g"(&v2[i].x1), "g"(&v2[i].x2), "g"(&v2[i].x3), "g"(&v2[i].x4) //wejscie
:
);
}
return vec;
}
Everything looks OK, but when I try to compile this with GCC I get these errors:
Error: Operand type mismatch for 'fadd'
Error: Invalid instruction suffix for 'fld'
On OS/X in XCode everything working correctly. What is wrong with this code?
Coding Issues
I'm not looking at making this efficient (I'd be using SSE/SIMD if the processor supports it). Since this part of the assignment is to use the FPU stack then here are some concerns I have:
Your function declares a local stack based variable:
struct vector vec[size];
The problem is that your function returns a vector * and you do this:
return vec;
This is very bad. The stack based variable could get clobbered after the function returns and before the data gets consumed by the caller. One alternative is to allocate memory on the heap rather than the stack. You can replace struct vector vec[size]; with:
struct vector *vec = malloc(sizeof(struct vector)*size);
This would allocate enough space for an array of size number of vector. The person who calls your function would have to use free to deallocate the memory from the heap when finished.
Your vector structure uses float, not double. The instructions FLDL, FADDL, FSTL all operate on double (64-bit floats). Each of these instructions will load and store 64-bits when used with a memory operand. This would lead to the wrong values being loaded/stored to/from the FPU stack. You should be using FLDS, FADDS, FSTS to operate on 32-bit floats.
In the assembler templates you use the g constraint on the inputs. This means the compiler is free to use any general purpose registers, a memory operand, or an immediate value. FLDS, FADDS, FSTS do not take immediate values or general purpose registers (non-FPU registers) so if the compiler attempts to do so it will likely produce errors similar to Error: Operand type mismatch for xxxx.
Since these instructions understand a memory reference use m instead of g constraint. You will need to remove the & (ampersands) from the input operands since m implies that it will be dealing with the memory address of a variable/C expression.
You don't pop the values off the FPU stack when finished. FST with a single operand copies the value at the top of the stack to the destination. The value on the stack remains. You should store it and pop it off with an FSTP instruction. You want the FPU stack to be empty when your assembler template ends. The FPU stack is very limited with only 8 slots available. If the FPU stack is not clear when the template completes then you run the risk of the FPU stack overflowing on subsequent calls. Since you leave 4 values on the stack on each call, calling the function adding a third time should fail.
To simplify the code a bit I'd recommend using a typedef to define vector. Define your structure this way:
typedef struct {
float x1, x2, x3, x4;
} vector;
All references to struct vector can simply become vector.
With all of these things in mind your code could look something like this:
typedef struct {
float x1, x2, x3, x4;
} vector;
vector *adding(const vector v1[], const vector v2[], int size) {
vector *vec = malloc(sizeof(vector)*size);
int i;
for(i = 0; i < size; i++) {
__asm__(
"FLDS %4 \n" //v1.x1
"FADDS %8 \n" //v2.x1
"FSTPS %0 \n"
"FLDS %5 \n" //v1.x2
"FADDS %9 \n" //v2.x2
"FSTPS %1 \n"
"FLDS %6 \n" //v1->x3
"FADDS %10 \n" //v2->x3
"FSTPS %2 \n"
"FLDS %7 \n" //v1->x4
"FADDS %11 \n" //v2->x4
"FSTPS %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)
:"m"(v1[i].x1), "m"(v1[i].x2), "m"(v1[i].x3), "m"(v1[i].x4),
"m"(v2[i].x1), "m"(v2[i].x2), "m"(v2[i].x3), "m"(v2[i].x4)
:
);
}
return vec;
}
Alternative Solutions
I don't know the parameters of the assignment, but if it were to make you use GCC extended assembler templates to manually do an operation on the vector with an FPU instruction then you could define the vector with an array of 4 float. Use a nested loop to process each element of the vector independently passing each through to the assembler template to be added together.
Define the vector as:
typedef struct {
float x[4];
} vector;
The function as:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDPS\n"
:"=t"(vec[i].x[e])
:"0"(v1[i].x[e]), "u"(v2[i].x[e])
);
}
return vec;
}
This uses the i386 machine constraints t and u on the operands. Rather than passing a memory address we allow GCC to pass them via the top two slots on the FPU stack. t and u are defined as:
t
Top of 80387 floating-point stack (%st(0)).
u
Second from top of 80387 floating-point stack (%st(1)).
The no operand form of FADDP does this:
Add ST(0) to ST(1), store result in ST(1), and pop the register stack
We pass the two values to add at the top of the stack and perform an operation leaving ONLY the result in ST(0). We can then get the assembler template to copy the value on the top of the stack and pop it off automatically for us.
We can use an output operand of =t to specify the value we want moved is from the top of the FPU stack. =t will also pop (if needed) the value off the top of FPU stack for us. We can also use the top of the stack as an input value too! If the output operand is %0 we can reference it as an input operand with the constraint 0 (which means use the same constraint as operand 0). The second vector value will use the u constraint so it is passed as the second FPU stack element (ST(1))
A slight improvement that could potentially allow GCC to optimize the code it generates would be to use the % modifier on the first input operand. The % modifier is documented as:
Declares the instruction to be commutative for this operand and the following operand. This means that the compiler may interchange the two operands if that is the cheapest way to make all operands fit the constraints. ‘%’ applies to all alternatives and must appear as the first character in the constraint. Only read-only operands can use ‘%’.
Because x+y and y+x yield the same result we can tell the compiler that it can swap the operand marked with % with the one defined immediately after in the template. "0"(v1[i].x[e]) could be changed to "%0"(v1[i].x[e])
Disadvantages: We've reduced the code in the assembler template to a single instruction, and we've used the template to do most of the work setting things up and tearing it down. The problem is that if the vectors are likely going to be memory bound then we transfer between FPU registers and memory and back more times than we may like it to. The code generated may not be very efficient as we can see in this Godbolt output.
We can force memory usage by applying the idea in your original code to the template. This code may yield more reasonable results:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDS %2\n"
:"=&t"(vec[i].x[e])
:"0"(v1[i].x[e]), "m"(v2[i].x[e])
);
}
return vec;
}
Note: I've removed the % modifier in this case. In theory it should work, but GCC seems to emit less efficient code (CLANG seems okay) when targeting x86-64. I'm unsure if it is a bug; whether my understanding is lacking in how this operator should work; or there is an optimization being done I don't understand. Until I look at it closer I am leaving it off to generate the code I would expect to see.
In the last example we are forcing the FADDS instruction to operate on a memory operand. The Godbolt output is considerably cleaner, with the loop itself looking like:
.L3:
flds (%rdi) # MEM[base: _51, offset: 0B]
addq $16, %rdi #, ivtmp.6
addq $16, %rcx #, ivtmp.8
FADDS (%rsi) # _31->x
fstps -16(%rcx) # _28->x
addq $16, %rsi #, ivtmp.9
flds -12(%rdi) # MEM[base: _51, offset: 4B]
FADDS -12(%rsi) # _31->x
fstps -12(%rcx) # _28->x
flds -8(%rdi) # MEM[base: _51, offset: 8B]
FADDS -8(%rsi) # _31->x
fstps -8(%rcx) # _28->x
flds -4(%rdi) # MEM[base: _51, offset: 12B]
FADDS -4(%rsi) # _31->x
fstps -4(%rcx) # _28->x
cmpq %rdi, %rdx # ivtmp.6, D.2922
jne .L3 #,
In this final example GCC unwound the inner loop and only the outer loop remains. The code generated by the compiler is similar in nature to what was produced by hand in the original question's assembler template.

Resources