I am working on a m/c Intel(R) Xeon(R) CPU E5-2640 v2 # 2.00GHz It supports SSE4.2.
I have written C code to perform XOR operation over string bits. But I want to write corresponding SIMD code and check for performance improvement. Here is my code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define LENGTH 10
unsigned char xor_val[LENGTH];
void oper_xor(unsigned char *r1, unsigned char *r2)
{
unsigned int i;
for (i = 0; i < LENGTH; ++i)
{
xor_val[i] = (unsigned char)(r1[i] ^ r2[i]);
printf("%d",xor_val[i]);
}
}
int main() {
int i;
time_t start, stop;
double cur_time;
start = clock();
oper_xor("1110001111", "0000110011");
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
printf("Time used %f seconds.\n", cur_time / 100);
for (i = 0; i < LENGTH; ++i)
printf("%d",xor_val[i]);
printf("\n");
return 0;
}
On compiling and running a sample code I am getting output shown below. Time is 00 here but in actual project it is consuming sufficient time.
gcc xor_scalar.c -o xor_scalar
pan88: ./xor_scalar
1110111100 Time used 0.000000 seconds.
1110111100
How can I start writing a corresponding SIMD code for SSE4.2
The Intel Compiler and any OpenMP compiler support #pragma simd and #pragma omp simd, respectively. These are your best bet to get the compiler to do SIMD codegen for you. If that fails, you can use intrinsics or, as a means of last resort, inline assembly.
Note the the printf function calls will almost certainly interfere with vectorization, so you should remove them from any loops in which you want to see SIMD.
Related
The codes:
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
typedef unsigned int uint32_t;
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, signed long long int);
}
va_end(var_arg);
return sum / n_values;
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
printf("hello world!\n");
uint32_t t1 = 1;
uint32_t t2 = 4;
uint32_t t3 = 4;
printf("result:%f\n", average(3, t1, t2, t3));
return 0;
}
When I run in ubuntu (x86_64), It's Ok.
lix#lix-VirtualBox:~/test/c$ ./a.out
hello world!
result:3.000000
lix#lix-VirtualBox:~/test/c$ uname -a
Linux lix-VirtualBox 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
lix#lix-VirtualBox:~/test/c$
But when I cross-compiler and run it in openwrt(ARM 32bit), It's wrong.
[root#OneCloud_0723:/root/lx]#./helloworld
hello world!
result:13952062464.000000
[root#OneCloud_0723:/root/lx]#uname -a
Linux OneCloud_0723 3.10.33 #1 SMP PREEMPT Thu Nov 2 19:55:17 CST 2017 armv7l GNU/Linux
I know do not call va_arg with an argument of the incorrect type. But Why we can get right result in x86_64 not in arm?
Thank you.
On x86-64 Linux, each 32-bit arg is passed in a separate 64-bit register (because that's what the x86-64 System V calling convention requires).
The caller happens to have zero-extended the 32-bit arg into the 64-bit register. (This isn't required; the undefined behaviour in your program could bite you with a different caller that left high garbage in the arg-passing registers.)
The callee (average()) is looking for three 64-bit args, and looks in the same registers where the caller put them, so it happens to work.
On 32-bit ARM, long long is doesn't fit in a single register, so the callee looking for long long args is definitely looking in different places than where the caller placed uint32_t args.
The first 64-bit arg the callee sees is probably ((long long)t1<<32) | t2, or the other way around. But since the callee is looking for 6x 32 bits of args, it will be looking at registers / memory that the caller didn't intend as args at all.
(Note that this could cause corruption of the caller's locals on the stack, because the callee is allowed to clobber stack args.)
For the full details, look at the asm output of your code with your compiler + compile options to see what exactly what behaviour resulted from the C Undefined Behaviour in your source. objdump -d ./helloworld should do the trick, or look at compiler output directly: How to remove "noise" from GCC/clang assembly output?.
On my system (x86_64)
#include <stdio.h>
int main(void)
{
printf("%zu\n", sizeof(long long int));
return 0;
}
this prints 8, which tells me that long long int is 64bits wide, I don't know
the size of a long long int on arm.
Regardless your va_arg call is wrong, you have to use the correct type, in
this case uint32, so your function has undefined behaviour and happens to get
the correct values. average should look like this:
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, uint32_t);
}
va_end(var_arg);
return sum / n_values;
}
Also don't declare your uint32_t as
typedef unsigned int uint32_t;
this is not portable, because int is not guaranteed to be 4 bytes long across
all architectures. The Standard C Library actually declares this type in
stdint.h, you should use the thos types instead.
So you program should look like this:
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <stdint.h>
float average(int n_values, ... )
{
va_list var_arg;
int count;
float sum = 0;
va_start(var_arg, n_values);
for (count = 0; count < n_values; count += 1) {
sum += va_arg(var_arg, uint32_t);
}
va_end(var_arg);
return sum / n_values;
}
int main(void)
{
printf("hello world!\n");
uint32_t t1 = 1;
uint32_t t2 = 4;
uint32_t t3 = 4;
printf("result:%f\n", average(3, t1, t2, t3));
return 0;
}
this is portable and should yield the same results across different
architectures.
I tried compiling this code,
#include <stdlib.h>
struct rgb {
int r, g, b;
};
void adjust_brightness(struct rgb *picdata, size_t len, int adjustment) {
// assume adjustment is between 0 and 255.
for (int i = 0; i < len; i++) {
picdata[i].r += adjustment;
picdata[i].g += adjustment;
picdata[i].b += adjustment;
}
}
on OSX using this command,
$ cc -Rpass-analysis=loop-vectorize -c -std=c99 -O3 brightness.c
brightness.c:13:3: remark: loop not vectorized: unsafe dependent memory operations in loop [-Rpass-analysis=loop-vectorize]
for (int i = 0; i < len; i++) {
^
Can someone explain what is unsafe and dependent here? I'm learning about SIMD, and this was explained at the most obvious use for SIMD. I was hoping to learn how the compiler would generated SIMD instructions for a simple example. In my head, I expect the compiler to maybe instead of incrementing by 1, it would increment by enough to put the loop body into vector registers?
Do I misunderstand?
I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.
I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.
I have
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <math.h>
int fib(int n) {
return n < 2 ? n : fib(n-1) + fib(n-2);
}
double clock_now()
{
struct timeval now;
gettimeofday(&now, NULL);
return (double)now.tv_sec + (double)now.tv_usec/1.0e6;
}
#define NITER 5
and in my main(), I'm doing a simple benchmark like this:
printf("hi\n");
double t = clock_now();
int f = 0;
double tmin = INFINITY;
for (int i=0; i<NITER; ++i) {
printf("run %i, %f\n", i, clock_now()-t);
t = clock_now();
f += fib(40);
t = clock_now()-t;
printf("%i %f\n", f, t);
if (t < tmin) tmin = t;
t = clock_now();
}
printf("fib,%.6f\n", tmin*1000);
When I compile with clang -O3 (LLVM 5.0 from Xcode 5.0.1), it always prints out zero time, except at the init of the for-loop, i.e. this:
hi
run 0, 0.866536
102334155 0.000000
run 1, 0.000001
204668310 0.000000
run 2, 0.000000
307002465 0.000000
run 3, 0.000000
409336620 0.000000
run 4, 0.000001
511670775 0.000000
fib,0.000000
It seems that it statically precalculates the fib(40) and stores it somewhere. Right? The strange lag at the beginning (0.8 secs) is probably because it loads that cache?
I'm doing this for benchmarking. The C compiler should optimize fib() itself as much as it can. However, I don't want it to precalculate it already at compile time. So basically I want all code optimized as heavily as possible, but not main() (or at least not this specific optimization). Can I do that somehow?
What optimization is it anyway in this specific case? It's somehow strange and quite nice.
I found a solution by marking certain data volatile. Esp, what I did was:
volatile int f = 0;
//...
volatile int FibArg = 40;
f += fib(FibArg);
That way, it forces the compiler to read FibArg when calling the function and it forces it to not assume that it is constant. Thus it must call the function to calculate it.
The volatile int f was not necessary for my compiler at the moment but it might be in the future when the compiler figures out that fib has no side effects and its result nor f is every used.
Note that this is still not the end. A future compiler could have advanced so far that it guesses that 40 is a likely argument for fib. Maybe it builds a database for likely values. And for the most likely values, it builds up a small cache. And when fib is called, it does a fast runtime-check whether it has that value cached. Of course, the runtime-check adds some overhead but maybe the compiler estimates that this overhead is minor for some particular code in relation to the speed gained by the cached.
I'm not sure if a compiler will ever do such optimization but it could. Profile Guided Optimization (PGO) goes already in that direction.