When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?
Let's consider an example which adds two float arrays x, and y and writes the results to z. Normally I would use intrinsics to do this like this
for(int i=0; i<n/4; i++) {
__m128 x4 = _mm_load_ps(&x[4*i]);
__m128 y4 = _mm_load_ps(&y[4*i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[4*i], s);
}
Here is the inline assembly solution I have come up with using the register modifier "r"
void add_asm1(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%1,%%rax,4), %%xmm0\n"
"addps (%2,%%rax,4), %%xmm0\n"
"movaps %%xmm0, (%0,%%rax,4)\n"
:
: "r" (z), "r" (y), "r" (x), "a" (i)
:
);
}
}
This generates similar assembly to GCC. The main difference is that GCC adds 16 to the index register and uses a scale of 1 whereas the inline-assembly solution adds 4 to the index register and uses a scale of 4.
I was not able to use a general register for the iterator. I had to specify one which in this case was rax. Is there a reason for this?
Here is the solution I came up with using the memory modifer "m"
void add_asm2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps %1, %%xmm0\n"
"addps %2, %%xmm0\n"
"movaps %%xmm0, %0\n"
: "=m" (z[i])
: "m" (y[i]), "m" (x[i])
:
);
}
}
This is less efficient as it does not use an index register and instead has to add 16 to the base register of each array. The generated assembly is (gcc (Ubuntu 5.2.1-22ubuntu2) with gcc -O3 -S asmtest.c):
.L22
movaps (%rsi), %xmm0
addps (%rdi), %xmm0
movaps %xmm0, (%rdx)
addl $4, %eax
addq $16, %rdx
addq $16, %rsi
addq $16, %rdi
cmpl %eax, %ecx
ja .L22
Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register? The reason I asked is that it seemed more logical to me to use the memory modifer "m" since I am reading and writing memory. Additionally, with the register modifier "r" I never use an output operand list which seemed odd to me at first.
Maybe there is a better solution than using "r" or "m"?
Here is the full code I used to test this
#include <stdio.h>
#include <x86intrin.h>
#define N 64
void add_intrin(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__m128 x4 = _mm_load_ps(&x[i]);
__m128 y4 = _mm_load_ps(&y[i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[i], s);
}
}
void add_intrin2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n/4; i++) {
__m128 x4 = _mm_load_ps(&x[4*i]);
__m128 y4 = _mm_load_ps(&y[4*i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[4*i], s);
}
}
void add_asm1(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%1,%%rax,4), %%xmm0\n"
"addps (%2,%%rax,4), %%xmm0\n"
"movaps %%xmm0, (%0,%%rax,4)\n"
:
: "r" (z), "r" (y), "r" (x), "a" (i)
:
);
}
}
void add_asm2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps %1, %%xmm0\n"
"addps %2, %%xmm0\n"
"movaps %%xmm0, %0\n"
: "=m" (z[i])
: "m" (y[i]), "m" (x[i])
:
);
}
}
int main(void) {
float x[N], y[N], z1[N], z2[N], z3[N];
for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
add_intrin2(x,y,z1,N);
add_asm1(x,y,z2,N);
add_asm2(x,y,z3,N);
for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}
Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.
You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.
If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.
GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.
Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).
Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.
(See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a whole Q&A about this.)
Another huge benefit to an m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.
Here's my version, with some tweaks as noted in comments. This is not optimal, e.g. can't be unrolled efficiently by the compiler.
#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
: "memory"
// you can avoid a "memory" clobber with dummy input/output operands
);
}
}
Godbolt compiler explorer asm output for this and a couple versions below.
Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.
If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.
In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!
# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
movaps (%rsi,%rax,4), %xmm0 # y, i, vectmp
addps (%rdi,%rax,4), %xmm0 # x, i, vectmp
movaps %xmm0, (%rdx,%rax,4) # vectmp, z, i
addl $4, %eax #, i
addq $16, %r10 #, ivtmp.19
addq $16, %r9 #, ivtmp.21
addq $16, %r8 #, ivtmp.22
cmpl %eax, %ecx # i, n
ja .L11 #,
r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.
You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const char (*)[]) pStr). This casts the pointer to a pointer-to-array (of unspecified size). See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
If we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address (of the whole array) as an operand, rather than a pointer to the current memory being operated on.
This actually works without any extra pointer or counter increments inside the loop: (avoiding a "memory" clobber, but still not easily unrollable by the compiler).
void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
float *restrict z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp)
, "=m" (*(float (*)[]) z) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
, "m" (*(const float (*)[]) x),
"m" (*(const float (*)[]) y) // pointer to unsized array = all memory from this pointer
);
}
}
This gives us the same inner loop we got with a "memory" clobber:
.L19: # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
movaps (%rsi,%rax,4), %xmm0 # y, i, vectmp
addps (%rdi,%rax,4), %xmm0 # x, i, vectmp
movaps %xmm0, (%rdx,%rax,4) # vectmp, z, i
addl $4, %eax #, i
cmpl %eax, %ecx # i, n
ja .L19 #,
It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective. There's no way for this to end up with a 16(%rsi,%rax,4) addressing mode in a 2nd copy of this block in the same loop, because we're hiding the addressing from the compiler.
A version with m constraints, that gcc can unroll:
#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
// x, y, z are assumed to be aligned
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
// "movaps %[yi], %[vectmp]\n\t" // get the compiler to do this load instead
"addps %[xi], %[vectmp]\n\t"
"movaps %[vectmp], %[zi]\n\t"
// __m128 is a may_alias type so these casts are safe.
: [vectmp] "=x" (vectmp) // let compiler pick a stratch reg
,[zi] "=m" (*(__m128*)&z[i]) // actual memory output for the movaps store
: [yi] "0" (*(__m128*)&y[i]) // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
,[xi] "xm" (*(__m128*)&x[i])
//, [idx] "r" (i) // unrolling with this would need an insn for every increment by 4
);
}
}
Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.
When I compile your add_asm2 code with gcc (4.9.2) I get:
add_asm2:
.LFB0:
.cfi_startproc
xorl %eax, %eax
xorl %r8d, %r8d
testl %ecx, %ecx
je .L1
.p2align 4,,10
.p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
movaps (%rsi,%rax), %xmm0
addps (%rdi,%rax), %xmm0
movaps %xmm0, (%rdx,%rax)
# 0 "" 2
#NO_APP
addl $4, %r8d
addq $16, %rax
cmpl %r8d, %ecx
ja .L5
.L1:
rep; ret
.cfi_endproc
so it is not perfect (it uses a redundant register), but does use indexed loads...
gcc also has builtin vector extensions which are even cross platform:
typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n/4; i+=1) {
*(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
}
}
On my gcc version 4.7.2 the generated assembly is:
.L28:
movaps (%rdi,%rax), %xmm0
addps (%rsi,%rax), %xmm0
movaps %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq %rcx, %rax
jne .L28
Related
I'm trying to compile a simple C program (Win7 32bit, Mingw32 Shell and GCC 5.3.0). The C code is like this:
#include <stdio.h>
#include <stdlib.h>
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
:\
:"a" (addr),\
"m" (*(n)),\
"m" (*(n+2)),\
"m" (*(n+4)),\
"m" (*(n+5)),\
"m" (*(n+6)),\
"m" (*(n+7))\
)
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
char *n;
char *addr;
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
_set_tssldt_desc(n,addr,type) is a macro and its body is assembly code. set_tss_desc(n,addr) is another macro very similar to _set_tssldt_desc(n,addr,type). The set_tss_desc(n,addr) macro is called in main function.
When I'm trying to compile this code, the compiler's showing me the following error:
$ gcc test.c
test.c: In function 'main':
test.c:5:1: error: 'asm' operand has impossible constraints
__asm__ ("movw $104,%1\n\t" \
^
test.c:16:30: note: in expansion of macro '_set_tssldt_desc'
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
^
test.c:25:3: note: in expansion of macro 'set_tss_desc'
set_tss_desc(n, addr);
^
The strange thing is, if I comment invoke point out in main function, the code compiled successfully.
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
//I comment it out and code compiled.
//set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
Or, if I delete some variables in output part of assembly code, it also compiled.
#include <stdio.h>
#include <stdlib.h>
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
:\
:"a" (addr),\
"m" (*(n)),\
"m" (*(n+2)),\
"m" (*(n+4)),\
"m" (*(n+5)),\
"m" (*(n+6))\
)
//I DELETE "m" (*(n+7)) , code compiled
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
char *n;
char *addr;
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
Can someone explain to me why that is and how to fix this?
As #MichealPetch says, you're approaching this the wrong way. If you're trying to set up an operand for lgdt, do that in C and only use inline-asm for the lgdt instruction itself. See the inline-assembly tag wiki, and the x86 tag wiki.
Related: a C struct/union for messing with Intel descriptor-tables: How to do computations with addresses at compile/linking time?. (The question wanted to generate the table as static data, hence asking about breaking addresses into low / high halves at compile time).
Also: Implementing GDT with basic kernel for some C + asm GDT manipulation. Or maybe not, since the answer there just says the code in the question is problematic, without a detailed fix.
Linker error setting loading GDT register with LGDT instruction using Inline assembly has an answer from Michael Petch, with some links to more guides/tutorials.
It's still useful to answer the specific question, even though the right fix is https://gcc.gnu.org/wiki/DontUseInlineAsm.
This compiles fine with optimization enabled.
With -O0, gcc doesn't notice or take advantage of the fact that the operands are all small constant offsets from each other, and can use the same base register with an offset addressing mode. It wants to put a pointer to each input memory operand into a separate register, but runs out of registers. With -O1 or higher, CSE does what you'd expect.
You can see this in a reduced example with the last 3 memory operands commented, and changing the asm string to include an asm comment with all the operands. From gcc5.3 -O0 -m32 on the Godbolt compiler explorer:
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
"#operands: %0, %1, %2, %3\n" \
...
void simple_wrapper(char *n, char *addr) {
set_tss_desc(n, addr);
}
pushl %ebp
movl %esp, %ebp
pushl %ebx
movl 8(%ebp), %eax
leal 2(%eax), %ecx
movl 8(%ebp), %eax
leal 4(%eax), %ebx
movl 12(%ebp), %eax
movl 8(%ebp), %edx
#APP # your inline-asm code
movw $104,(%edx)
#operands: %eax, (%edx), (%ecx), (%ebx)
#NO_APP
nop # no idea why the compiler inserted a literal NOP here (not .p2align)
popl %ebx
popl %ebp
ret
But with optimization enabled, you get
simple_wrapper:
movl 4(%esp), %edx
movl 8(%esp), %eax
#APP
movw $104,(%edx)
#operands: %eax, (%edx), 2(%edx), 4(%edx)
#NO_APP
ret
Notice how the later operands use base+disp addressing modes.
Your constraints are totally backwards. You're writing to memory that you've told the compiler is an input operand. It will assume that the memory is not modified by the asm statement, so if you load from it in C, it might move that load ahead of the asm. And other possible breakage.
If you had used "=m" output operands, this code would be correct (but still inefficient compared to letting the compiler do it for you.)
You could have written your asm to do the offsetting itself from a single memory-input operand, but then you'd need to do something to tell the compiler about that the memory read by the asm statement; e.g. "=m" (*(struct {char a; char x[];} *) n) to tell it that you write the entire object starting at n. (See this answer).
AT&T syntax x86 memory operands are always offsetable, so you can use 2 + %[nbase] instead of a separate operand, if you do
asm("movw $104, %[nbase]\n\t"
"movw $123, 2 + %[nbase]\n\t"
: [nbase] "=m" (*(struct {char a; char x[];} *) n)
: [addr] "ri" (addr)
);
gas will warn about 2 + (%ebx) or whatever it ends up being, but that's ok.
Using a separate memory output operand for each place you write will avoid any problems about telling the compiler which memory you write. But you got it wrong: you've told the compiler that your code doesn't use n+1 when in fact you're using movw $104 to store 2 bytes starting at n. So that should be a uint16_t memory operand. If this sounds complicated, https://gcc.gnu.org/wiki/DontUseInlineAsm. Like Michael said, do this part in C with a struct, and only use inline asm for a single instruction that needs it.
It would obviously be more efficient to use fewer wider store instructions. IDK what you're planning to do next, but any adjacent constants should be coalesced into a 32-bit store, like mov $(104 + 0x1234 << 16), %[n0] or something. Again, https://gcc.gnu.org/wiki/DontUseInlineAsm.
Since I'm very new to GCC, I'm facing a problem in inline assembly code. The problem is that I'm not able to figure out how to copy the contents of a C variable (which is of type UINT32) into the register eax. I have tried the below code:
__asm__
(
// If the LSB of src is a 0, use ~src. Otherwise, use src.
"mov $src1, %eax;"
"and $1,%eax;"
"dec %eax;"
"xor $src2,%eax;"
// Find the number of zeros before the most significant one.
"mov $0x3F,%ecx;"
"bsr %eax, %eax;"
"cmove %ecx, %eax;"
"xor $0x1F,%eax;"
);
However mov $src1, %eax; doesn't work.
Could someone suggest a solution to this?
I guess what you are looking for is extended assembly e.g.:
int a=10, b;
asm ("movl %1, %%eax; /* eax = a */
movl %%eax, %0;" /* b = eax */
:"=r"(b) /* output */
:"r"(a) /* input */
:"%eax" /* clobbered register */
);
In the example above, we made the value of b equal to that of a using assembly instructions and eax register:
int a = 10, b;
b = a;
Please see the inline comments.
note:
mov $4, %eax // AT&T notation
mov eax, 4 // Intel notation
A good read about inline assembly in GCC environment.
I'm having trouble with a gcc inline asm statement; gcc seems to think the result is a constant (which it isn't) and optimizes the statement away. I think I am using the operand constraints correctly, but would like a second opinion on the matter. If the problem is not in my use of constraints, I'll try to isolate a test case for a gcc bug report, but that may be difficult as even subtle changes in the surrounding code cause the problem to disappear.
The inline asm in question is
static inline void
ularith_div_2ul_ul_ul_r (unsigned long *r, unsigned long a1,
const unsigned long a2, const unsigned long b)
{
ASSERT(a2 < b); /* Or there will be quotient overflow */
__asm__(
"# ularith_div_2ul_ul_ul_r: divq %0 %1 %2 %3\n\t"
"divq %3"
: "+a" (a1), "=d" (*r)
: "1" (a2), "rm" (b)
: "cc");
}
which is a pretty run-of-the-mill remainder of a two-word dividend by a one-word divisor. Note that the high word of the input, a2, and the remainder output, *r, are tied to the same register %rdx by the "1" constraint.
From the surrounding code, ularith_div_2ul_ul_ul_r() gets effectively called as if by
if (s == 1)
modpp[0].one = 0;
else
ularith_div_2ul_ul_ul_r(&modpp[0].one, 0UL, 1UL, s);
so the high word of the input, a2, is the constant 1UL.
The resulting asm output of gcc -S -fverbose_asm looks like:
(earlier:)
xorl %r8d, %r8d # cstore.863
(then:)
cmpq $1, -208(%rbp) #, %sfp
movl $1, %eax #, tmp841
movq %rsi, -184(%rbp) # prephitmp.966, MEM[(struct __modulusredcul_t *)&modpp][0].invm
cmovne -208(%rbp), %rcx # prephitmp.966,, %sfp, prephitmp.966
cmovne %rax, %r8 # cstore.863,, tmp841, cstore.863
movq %r8, -176(%rbp) # cstore.863, MEM[(struct __modulusredcul_t *)&modpp][0].one
The effect is that the result of the ularith_div_2ul_ul_ul_r() call is assumed to be the constant 1; the divq never appears in the output.
Various changes make the problem disappear; different compiler flags, different code context or marking the asm block __asm__ __volatile__ (...). The output then correctly contains the divq instruction:
#APP
# ularith_div_2ul_ul_ul_r: divq %rax %rdx %rdx -208(%rbp) # a1, tmp590, tmp590, %sfp
divq -208(%rbp) # %sfp
#NO_APP
So, my question to the inline assembly guys here: did I do something wrong with the contraints?
The bug affects only Ubuntu versions of gcc; the stock GNU gcc is unaffected as far as we can tell. The bug was reported to Ubuntu launchpad and confirmed: Bug #1029454
This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.
I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:
// SSE VERSION
#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i <N; i+= 2){
__m128d mm_a = _mm_load_pd( &a[i] );
_mm_prefetch( &a[i+4], _MM_HINT_T0 );
__m128d mm_b = _mm_load_pd( &b[i] );
_mm_prefetch( &b[i+4] , _MM_HINT_T0 );
__m128d mm_c = _mm_load_pd( &c[i] );
_mm_prefetch( &c[i+4] , _MM_HINT_T0 );
__m128d mm_r;
mm_r = _mm_add_pd( mm_a, mm_b );
mm_a = _mm_mul_pd( mm_r , mm_c );
_mm_store_pd( &r[i], mm_a );
}
}
}
//NO SSE VERSION
//same definitions as before
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i < N; i++ ){
r[i] = (a[i]+b[i])*c[i];
}
}
}
When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.
Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?
Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:
#define N 10000
#define NTIMES 100000
double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));
int
main (void)
{
int i, times;
for (times = 0; times < NTIMES; times++)
{
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
}
return 0;
}
and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:
.L3:
movapd a(%eax), %xmm0
addpd b(%eax), %xmm0
mulpd c(%eax), %xmm0
movapd %xmm0, r(%eax)
addl $16, %eax
cmpl $80000, %eax
jne .L3
As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:
.L3:
vmovapd a(%eax), %ymm0
vaddpd b(%eax), %ymm0, %ymm0
vmulpd c(%eax), %ymm0, %ymm0
vmovapd %ymm0, r(%eax)
addl $32, %eax
cmpl $80000, %eax
jne .L3
Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.
I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.
Just replace the inner loop in chill's sample code with:
for (i = N-1; i >= 0; --i)
r[i] = (a[i] + b[i]) * c[i];
GCC (4.8.4) with options -S -O3 -mavx produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5
This question already has answers here:
How to invoke a system call via syscall or sysenter in inline assembly?
(2 answers)
Closed 3 years ago.
is it possible to write a single character using a syscall from within an inline assembly block? if so, how? it should look "something" like this:
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" movl $80, %%ecx \n\t"
" movl $0, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
$80 is 'P' in ascii, but that returns nothing.
any suggestions much appreciated!
You can use architecture-specific constraints to directly place the arguments in specific registers, without needing the movl instructions in your inline assembly. Furthermore, then you can then use the & operator to get the address of the character:
#include <sys/syscall.h>
void sys_putc(char c) {
// write(int fd, const void *buf, size_t count);
int ret;
asm volatile("int $0x80"
: "=a"(ret) // outputs
: "a"(SYS_write), "b"(1), "c"(&c), "d"(1) // inputs
: "memory"); // clobbers
}
int main(void) {
sys_putc('P');
sys_putc('\n');
}
(Editor's note: the "memory" clobber is needed, or some other way of telling the compiler that the memory pointed-to by &c is read. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
(In this case, =a(ret) is needed to indicate that the syscall clobbers EAX. We can't list EAX as a clobber because we need an input operand to use that register. The "a" constraint is like "r" but can only pick AL/AX/EAX/RAX. )
$ cc -m32 sys_putc.c && ./a.out
P
You could also return the number of bytes written that the syscall returns, and use "0" as a constraint to indicate EAX again:
int sys_putc(char c) {
int ret;
asm volatile("int $0x80" : "=a"(ret) : "0"(SYS_write), "b"(1), "c"(&c), "d"(1) : "memory");
return ret;
}
Note that on error, the system call return value will be a -errno code like -EBADF (bad file descriptor) or -EFAULT (bad pointer).
The normal libc system call wrapper functions check for a return value of unsigned eax > -4096UL and set errno + return -1.
Also note that compiling with -m32 is required: the 64-bit syscall ABI uses different call numbers (and registers), but this asm is hard-coding the slow way of invoking the 32-bit ABI, int $0x80.
Compiling in 64-bit mode will get sys/syscall.h to define SYS_write with 64-bit call numbers, which would break this code. So would 64-bit stack addresses even if you used the right numbers. What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - don't do that.
IIRC, two things are wrong in your example.
Firstly, you're writing to stdin with mov $0, %ebx
Second, write takes a pointer as it's second argument, so to write a single character you need that character stored somewhere in memory, you can't write the value directly to %ecx
ex:
.data
char: .byte 80
.text
mov $char, %ecx
I've only done pure asm in Linux, never inline using gcc, you can't drop data into the middle of the assembly, so I'm not sure how you'd get the pointer using inline assembly.
EDIT: I think I just remembered how to do it. you could push 'p' onto the stack and use %esp
pushw $80
movl %%esp, %%ecx
... int $0x80 ...
addl $2, %%esp
Something like
char p = 'P';
int main()
{
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" leal p , %%ecx \n\t"
" movl $0, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
}
Add: note that I've used lea to Load the Effective Address of the char into ecx register; for the value of ebx I tried $0 and $1 and it seems to work anyway ...
Avoid the use of external char
int main()
{
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" subl $4, %%esp \n\t"
" movl $80, (%%esp)\n\t"
" movl %%esp, %%ecx \n\t"
" movl $1, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
" addl $4, %%esp\n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
}
N.B.: it works because of the endianness of intel processors! :D