In the following code, I can get the result of mm0 - mm1 in mm0 by PSUBSW instruction. When I compiled on Mac book air by gcc.
But, PSUBSW instruction is explained that we can get the result of mm1 - mm0 in mm1 in Intel developer's manual: PSUBSW mm, mm/m64, Subtract signed packed words in mm/m64 from signed packed words in mm and saturate results.
#include <stdio.h>
int
main()
{
short int a[4] = {1111,1112,1113,1114};
short int b[4] = {1111,2112,3113,4114};
short int c[4];
asm volatile (
"movq (%1),%%mm0\n\t"
"movq (%2),%%mm1\n\t"
"psubsw %%mm1,%%mm0\n\t"
"movq %%mm0,%0\n\t"
"emms"
: "=g"(c): "r"(&a),"r"(&b));
printf("%d %d %d %d\n", c[0], c[1], c[2], c[3]);
return 0;
}
What is this difference? Which is the src, mm0 or mm1? If this difference is Intel syntax and AT&T syntax.
In the GNU manual it states that gcc is based not on the Intel assembly language, but rather on a language that descends from the AT&T Unix assembler. There are two main camps on assembly syntax:
Intel: opcode dest, src
AT&T: opcode src, dest
In your example you're using AT&T syntax, the default. However, if you prefer to switch to Intel syntax, you can either compile with -masm=intel or use the following:
__asm__(".intel_syntax;" ...);
You can find more discussions of that topic in this or this Stackoverflow thread, or read more here.
Related
I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t d;
uint64_t unused;
asm ("mulq %3\n\t"
"divq %4"
:"=a"(unused), "=&d"(d)
:"a"(a), "rm"(b), "rm"(n)
:"cc");
return d;
}
Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?
Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.
While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t ret, q, rh, rl;
__asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
: "%0,0" (a), "r,m" (b) : "cc");
/* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
* too large to store in in `%rax`. */
/* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
* the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
* for the x86-64 asm statements. */
__asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
: "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");
return ret;
}
I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.
I have some simple code that finds the difference between two assembly labels:
#include <stdio.h>
static void foo(void){
__asm__ __volatile__("_foo_start:");
printf("Hello, world.\n");
__asm__ __volatile__("_foo_end:");
}
int main(void){
extern const char foo_start[], foo_end[];
printf("foo_start: %p, foo_end: %p\n", foo_start, foo_end);
printf("Difference = 0x%tx.\n", foo_end - foo_start);
foo();
return 0;
}
Now, this code works perfectly on 64-bit processors, just like you would expect it to. However, on 32-bit processors, the address of foo_start is the same as foo_end.
I'm sure it has to do with 32 to 64 bit. On i386, it results in 0x0, and x86_64 results in 0x7. On ARMv7 (32 bit), it results in 0x0, while on ARM64, it results in 0xC. (the 64-bit results are correct, I checked them with a disassembler)
I'm using Clang+LLVM to compile.
I'm wondering if it has to do with non-lazy pointers. In the assembly output of both 32-bit processor archs mentioned above, they have something like this at the end:
L_foo_end$non_lazy_ptr:
.indirect_symbol _foo_end
.long 0
L_foo_start$non_lazy_ptr:
.indirect_symbol _foo_start
.long 0
However, this is not present in the assembly output of both x86_64 and ARM64. I messed with removing the non-lazy pointers and addressing the labels directly yesterday, but to no avail. Any ideas on why this happens?
EDIT:
It appears that when compiled for 32 bit processors, foo_start[] and foo_end[] point to main. I....I'm so confused.
I didn't check on real code but suspect you are a victim of instruction reordering. As long as you do not define proper memory barriers, the compiler ist free to move your code within the function around as it sees fit since there is no interdependency between labels and printf() call.
Try adding ::: "memory" to your asm statements which should nail them where you wrote them.
I finally found the solution (or, alternative, I suppose). Apparently, the && operator can be used to get the address of C labels, removing the need for me to use inline assembly at all. I don't think it's in the C standard, but it looks like Clang supports it, and I've heard GCC does too.
#include <stdio.h>
int main(void){
foo_start:
printf("Hello, world.\n");
foo_end:
printf("Foo has ended.");
void* foo_start_ptr = &&foo_start;
void* foo_end_ptr = &&foo_end;
printf("foo_start: %p, foo_end: %p\n", foo_start_ptr, foo_end_ptr);
printf("Difference: 0x%tx\n", (long)foo_end_ptr - (long)foo_start_ptr);
return 0;
}
Now, this only works if the labels are in the same function, but for what I intend to use this for, it's perfect. No more ASM, and it doesn't leave a symbol behind. It appears to work just how I need it to. (Not tested on ARM64)
When compiling this code with Apple LLVM 4.1 in Xcode I get an error:
#include <stdio.h>
int main(int argc, const char * argv[])
{
int a = 1;
printf("a = %d\n", a);
asm volatile(".intel_syntax noprefix;"
"mov [%0], 2;"
:
: "r" (&a)
);
printf("a = %d\n", a);
return 0;
}
The error is Unknown token in expression.
If I use AT&T syntax it works fine:
asm volatile("movl $0x2, (%0);"
:
: "r" (&a)
: "memory"
);
What is wrong with the first code?
It looks like the compiler is translating %0 to %reg (%rcx on my machine) and the assembler does not like the % (as it is in intel mode).
I don't know if it's possible to mix the automatic register allocation feature (extended asm) with the intel syntax, as I've not seen any example yet.
Good documentation about gcc inline assembly is usually hard to come by, and clang states in its documentation that it's mostly compatible with gcc in this area...
I want to write a small low level program. For some parts of it I will need to use assembly language, but the rest of the code will be written on C/C++.
So, if I will use GCC to mix C/C++ with assembly code, do I need to use AT&T syntax or can
I use Intel syntax? Or how do you mix C/C++ and asm (intel syntax) in some other way?
I realize that maybe I don't have a choice and must use AT&T syntax, but I want to be sure..
And if there turns out to be no choice, where I can find full/official documentation about the AT&T syntax?
Thanks!
If you are using separate assembly files, gas has a directive to support Intel syntax:
.intel_syntax noprefix # not recommended for inline asm
which uses Intel syntax and doesn't need the % prefix before register names.
(You can also run as with -msyntax=intel -mnaked-reg to have that as the default instead of att, in case you don't want to put .intel_syntax noprefix at the top of your files.)
Inline asm: compile with -masm=intel
For inline assembly, you can compile your C/C++ sources with gcc -masm=intel (See How to set gcc to use intel syntax permanently? for details.) The compiler's own asm output (which the inline asm is inserted into) will use Intel syntax, and it will substitute operands into asm template strings using Intel syntax like [rdi + 8] instead of 8(%rdi).
This works with GCC itself and ICC, but for clang only clang 14 and later.
(Not released yet, but the patch is in current trunk.)
Using .intel_syntax noprefix at the start of inline asm, and switching back with .att_syntax can work, but will break if you use any m constraints. The memory reference will still be generated in AT&T syntax. It happens to work for registers because GAS accepts %eax as a register name even in intel-noprefix mode.
Using .att_syntax at the end of an asm() statement will also break compilation with -masm=intel; in that case GCC's own asm after (and before) your template will be in Intel syntax. (Clang doesn't have that "problem"; each asm template string is local, unlike GCC where the template string truly becomes part of the text file that GCC sends to as to be assembled separately.)
Related:
GCC manual: asm dialect alternatives: writing an asm statement with {att | intel} in the template so it works when compiled with -masm=att or -masm=intel. See an example using lock cmpxchg.
https://stackoverflow.com/tags/inline-assembly/info for more about inline assembly in general; it's important to make sure you're accurately describing your asm to the compiler, so it knows what registers and memory are read / written.
AT&T syntax: https://stackoverflow.com/tags/att/info
Intel syntax: https://stackoverflow.com/tags/intel-syntax/info
The x86 tag wiki has links to manuals, optimization guides, and tutorials.
You can use inline assembly with -masm=intel as ninjalj wrote, but it may cause errors when you include C/C++ headers using inline assembly. This is code to reproduce the errors on Cygwin.
sample.cpp:
#include <cstdint>
#include <iostream>
#include <boost/thread/future.hpp>
int main(int argc, char* argv[]) {
using Value = uint32_t;
Value value = 0;
asm volatile (
"mov %0, 1\n\t" // Intel syntax
// "movl $1, %0\n\t" // AT&T syntax
:"=r"(value)::);
auto expr = [](void) -> Value { return 20; };
boost::unique_future<Value> func { boost::async(boost::launch::async, expr) };
std::cout << (value + func.get());
return 0;
}
When I built this code, I got error messages below.
g++ -E -std=c++11 -Wall -o sample.s sample.cpp
g++ -std=c++11 -Wall -masm=intel -o sample sample.cpp -lboost_system -lboost_thread
/tmp/ccuw1Qz5.s: Assembler messages:
/tmp/ccuw1Qz5.s:1022: Error: operand size mismatch for `xadd'
/tmp/ccuw1Qz5.s:1049: Error: no such instruction: `incl DWORD PTR [rax]'
/tmp/ccuw1Qz5.s:1075: Error: no such instruction: `movl DWORD PTR [rcx],%eax'
/tmp/ccuw1Qz5.s:1079: Error: no such instruction: `movl %eax,edx'
/tmp/ccuw1Qz5.s:1080: Error: no such instruction: `incl edx'
/tmp/ccuw1Qz5.s:1082: Error: no such instruction: `cmpxchgl edx,DWORD PTR [rcx]'
To avoid these errors, it needs to separate inline assembly (the upper half of the code) from C/C++ code which requires boost::future and the like (the lower half). The -masm=intel option is used to compile .cpp files that contain Intel syntax inline assembly, not to other .cpp files.
sample.hpp:
#include <cstdint>
using Value = uint32_t;
extern Value GetValue(void);
sample1.cpp: compile with -masm=intel
#include <iostream>
#include "sample.hpp"
int main(int argc, char* argv[]) {
Value value = 0;
asm volatile (
"mov %0, 1\n\t" // Intel syntax
:"=r"(value)::);
std::cout << (value + GetValue());
return 0;
}
sample2.cpp: compile without -masm=intel
#include <boost/thread/future.hpp>
#include "sample.hpp"
Value GetValue(void) {
auto expr = [](void) -> Value { return 20; };
boost::unique_future<Value> func { boost::async(boost::launch::async, expr) };
return func.get();
}