I'm trying to learn how to write gcc inline assembly.
The following code is supposed to perform an shl instruction and return the result.
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
int left = x;
__asm__ ("shl %1, %0"
:"=r"(left)
:"i"(b), "0"(left));
return left;
}
int main()
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Compilation fails with error: impossible constraint in asm
The problem is basically with "i"(b). I've tried "o", "n", "m" among others but it still doesn't work. Either its this error or operand size mismatch.
What am I doing wrong?
As written, you code compiles correctly for me (I have optimization enabled). However, I believe you may find this to be a bit better:
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
__asm__ ("shl %b[shift], %[value]"
: [value] "+r"(x)
: [shift] "Jc"(b)
: "cc");
return x;
}
int main(int argc, char *argv[])
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Note that the 'J' is for 64bit. If you are using 32bit, 'I' is the correct value.
Other things of note:
You are truncating your rotate value from uint64_t to int? Are you compiling for 32bit code? I don't believe shl can do 64bit rotates when compiled as 32bit.
Allowing 'c' on the input constraint means you can use variable rotate amounts (ie not hard-coded at compile time).
Since shl modifies the flags, use "cc" to let the compiler know.
Using the [name] form makes the asm easier to read (IMO).
The %b is a modifier. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#i386Operandmodifiers
If you want to really get smart about inline asm, check out the latest gcc docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
Related
I want to verify the role of volatile by this method. But my inline assembly code doesn't seem to be able to modify the value of i without the compiler knowing. According to the articles I read, I only need to write assembly code like __asm { mov dword ptr [ebp-4], 20h }, I think I write the same as what he did.
actual output:
before = 10
after = 123
expected output:
before = 10
after = 10
Article link: https://www.runoob.com/w3cnote/c-volatile-keyword.html
#include <stdio.h>
int main() {
int a, b;
// volatile int i = 10;
int i = 10;
a = i;
printf("before = %d\n", a);
// Change the value of i in memory without letting the compiler know.
// I can't run the following statement here, so I wrote one myself
// mov dword ptr [ebp-4], 20h
asm("movl $123, -12(%rbp)");
b = i;
printf("after = %d\n", b);
}
I want to verify the role of volatile ...
You can't.
If a variable is not volatile, the compiler may optimize; it does not need to do this.
A compiler may always treat any variable as volatile.
How to change the value of a variable without the compiler knowing?
Create a second thread writing to the variable.
Example
The following example is for Linux (under Windows, you need a different function than pthread_create()):
#include <stdio.h>
#include <pthread.h>
int testVar;
volatile int waitVar;
void * otherThread(void * dummy)
{
while(waitVar != 2) { /* Wait */ }
testVar = 123;
waitVar = 3;
return NULL;
}
int main()
{
pthread_t pt;
waitVar = 1;
pthread_create(&pt, 0, otherThread, NULL);
testVar = 10;
waitVar = 2;
while(waitVar != 3) { /* Wait */ }
printf("%d\n", testVar - 10);
return 0;
}
If you compile with gcc -O0 -o x x.c -lpthread, the compiler does not optimize and works like all variables are volatile. printf() prints 113.
If you compile with -O3 instead of -O0, printf() prints 0.
If you replace int testVar by volatile int testVar, printf() always prints 113 (independent of -O0/-O3).
(Tested with the GCC 9.4.0 compiler.)
I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t d;
uint64_t unused;
asm ("mulq %3\n\t"
"divq %4"
:"=a"(unused), "=&d"(d)
:"a"(a), "rm"(b), "rm"(n)
:"cc");
return d;
}
Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?
Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.
While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t ret, q, rh, rl;
__asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
: "%0,0" (a), "r,m" (b) : "cc");
/* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
* too large to store in in `%rax`. */
/* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
* the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
* for the x86-64 asm statements. */
__asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
: "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");
return ret;
}
The codes:
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
typedef unsigned int uint32_t;
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, signed long long int);
}
va_end(var_arg);
return sum / n_values;
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
printf("hello world!\n");
uint32_t t1 = 1;
uint32_t t2 = 4;
uint32_t t3 = 4;
printf("result:%f\n", average(3, t1, t2, t3));
return 0;
}
When I run in ubuntu (x86_64), It's Ok.
lix#lix-VirtualBox:~/test/c$ ./a.out
hello world!
result:3.000000
lix#lix-VirtualBox:~/test/c$ uname -a
Linux lix-VirtualBox 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
lix#lix-VirtualBox:~/test/c$
But when I cross-compiler and run it in openwrt(ARM 32bit), It's wrong.
[root#OneCloud_0723:/root/lx]#./helloworld
hello world!
result:13952062464.000000
[root#OneCloud_0723:/root/lx]#uname -a
Linux OneCloud_0723 3.10.33 #1 SMP PREEMPT Thu Nov 2 19:55:17 CST 2017 armv7l GNU/Linux
I know do not call va_arg with an argument of the incorrect type. But Why we can get right result in x86_64 not in arm?
Thank you.
On x86-64 Linux, each 32-bit arg is passed in a separate 64-bit register (because that's what the x86-64 System V calling convention requires).
The caller happens to have zero-extended the 32-bit arg into the 64-bit register. (This isn't required; the undefined behaviour in your program could bite you with a different caller that left high garbage in the arg-passing registers.)
The callee (average()) is looking for three 64-bit args, and looks in the same registers where the caller put them, so it happens to work.
On 32-bit ARM, long long is doesn't fit in a single register, so the callee looking for long long args is definitely looking in different places than where the caller placed uint32_t args.
The first 64-bit arg the callee sees is probably ((long long)t1<<32) | t2, or the other way around. But since the callee is looking for 6x 32 bits of args, it will be looking at registers / memory that the caller didn't intend as args at all.
(Note that this could cause corruption of the caller's locals on the stack, because the callee is allowed to clobber stack args.)
For the full details, look at the asm output of your code with your compiler + compile options to see what exactly what behaviour resulted from the C Undefined Behaviour in your source. objdump -d ./helloworld should do the trick, or look at compiler output directly: How to remove "noise" from GCC/clang assembly output?.
On my system (x86_64)
#include <stdio.h>
int main(void)
{
printf("%zu\n", sizeof(long long int));
return 0;
}
this prints 8, which tells me that long long int is 64bits wide, I don't know
the size of a long long int on arm.
Regardless your va_arg call is wrong, you have to use the correct type, in
this case uint32, so your function has undefined behaviour and happens to get
the correct values. average should look like this:
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, uint32_t);
}
va_end(var_arg);
return sum / n_values;
}
Also don't declare your uint32_t as
typedef unsigned int uint32_t;
this is not portable, because int is not guaranteed to be 4 bytes long across
all architectures. The Standard C Library actually declares this type in
stdint.h, you should use the thos types instead.
So you program should look like this:
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <stdint.h>
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, uint32_t);
}
va_end(var_arg);
return sum / n_values;
}
int main(void)
{
printf("hello world!\n");
uint32_t t1 = 1;
uint32_t t2 = 4;
uint32_t t3 = 4;
printf("result:%f\n", average(3, t1, t2, t3));
return 0;
}
this is portable and should yield the same results across different
architectures.
I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.
I was doing porting of armcc compiler to ARM GNU GCC , I pretty much figured out everything but I am stuck at this point :
A code is using something like this :
unsigned int add_bytes(unsigned int val1, unsigned int val2)
{
unsigned int res;
res = __uadd8(val1,val2); /* res[7:0] = val1[7:0] + val2[7:0]
res[15:8] = val1[15:8] + val2[15:8]
res[23:16] = val1[23:16] + val2[23:16]
res[31:24] = val1[31:24] + val2[31:24]
*/
return res;
}
__uadd8 is RVCT specific , Is there something equivalent provided by GCC or how can I achieve this?
GCC doesn't provide intrinsics for ARMv6 SIMD instructions. However you can define your own __UADD8 like below.
__attribute__( ( always_inline ) ) static __inline__ uint32_t __UADD8(uint32_t op1, uint32_t op2)
{
uint32_t result;
__asm__ volatile ("uadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) );
return(result);
}
This is from one of the CMSIS header files. I didn't test it myself but including that file might fix every other v6 intrinsics. At worst you may need to do some copy-pasting.