This is my first question ;-)
I try to use AVX in CUDA application (ccminer) but nvcc shows an error:
/usr/local/cuda/bin/nvcc -Xcompiler "-Wall -mavx" -O3 -I . -Xptxas "-abi=no -v" -gencode=arch=compute_50,code=\"sm_50,compute_50\" --maxrregcount=80 --ptxas-options=-v -I./compat/jansson -o x11/x11.o -c x11/x11.cu
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avxintrin.h(118): error: identifier "__builtin_ia32_addpd256" is undefined
[...]
This is just the first error. There are many 'undefined' builtin functions :-(
Everything is ok for 'C/C++' programs - with .c or .cpp extensions. But .cu - error :-( What do I do wrong ? I can compile ccminer but I cannot add AVX intrinsics to .cu files - only .c files. I use Intel intrinsics not gcc.
Any help greatly appreciated. Thanks in advance.
Linux Mint (ubuntu 13) 64bit, gcc 4.8.1, cuda 6.5.
I do not expect AVX to work on GPU. In .cu file there is small portion CPU based code which I want to vectorize.
Here is example to reproduce the error. I took the simplest example from:
http://computer-graphics.se/hello-world-for-cuda.html
Added line at the beginning:
#include <immintrin.h>
and tried to compile with the command:
nvcc cudahello.cu -Xcompiler -mavx
got an error:
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avxintrin.h(118): error:
identifier "__builtin_ia32_addpd256" is undefined
The same code without #include <immintrin.h>
compiles without problems.
Here is whole code:
#include <stdio.h>
#if defined(__AVX__)
#include <immintrin.h>
#endif
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}
Compile with
nvcc cudahello.cu -Xcompiler -mavx
to get the error or with
nvcc cudahello.cu
to compile clean.
I think I have an answer. Functions like:
_builtin_ia32_addpd256
are built into gcc and nvcc does not know about them. Since they are declared in immintrin.h nvcc returns errors while compiling .cu file with immintrin.h included. So we cannot mix cuda features with builtin gcc functions in one file.
This issue was actually fixed with CUDA 8, with the nvcc version shipping with CUDA 8 I can compile code that contains AVX intrinsics (which I couldn't with older versions).
Related
I am trying to implement stack canaries manually and without the standard library. Therefore I have created a simple PoC with the help of this guide from the OSDev wiki. The article suggests that a simple implementation must provide the __stack_chk_guard variable and the __stack_chk_fail() handler.
However, when I compile using GCC and provide the -fstack-protector-all flag, the executable does not contain any stack canary check at all. What am I missing to get GCC to include the stack canary logic?
gcc -Wall -nostdlib -nodefaultlibs -fstack-protector-all -g -m64 -o poc main.c customlib.h
main.c
#include "customlib.h"
#define STACK_CHK_GUARD (0xDEADBEEFFFFFFFF & ~0xFF)
uintptr_t __stack_chk_guard = STACK_CHK_GUARD;
__attribute__((noreturn)) void __stack_chk_fail()
{
__exit(123);
while(1);
}
int main()
{
__attribute__((unused)) char buffer[16];
for (size_t index = 0; index < 32; index++)
{
buffer[index] = 'A';
}
return 0;
}
customlib.h
This code is mostly irrelevant and is just necessary so that the program can be compiled and linked correctly.
typedef unsigned long int size_t;
typedef unsigned long int uintptr_t;
size_t __syscall(size_t arg1, size_t arg2, size_t arg3, size_t arg4, size_t arg5, size_t arg6)
{
asm("int $0x80\n"
: "=a"(arg1)
: "a"(arg1), "b"(arg2), "c"(arg3), "d"(arg4), "S"(arg5), "D"(arg6));
return arg1;
}
void _exit(int exit_code)
{
__syscall(1, exit_code, 0, 0, 0, 0);
while(1);
}
extern int main();
void _start()
{
main();
_exit(0);
}
GCC version 10.2.0, Linux 5.10.36-2-MANJARO GNU/Linux
It looks like the Arch gcc package (which the Manjaro package is based on) is turning off -fstack-protector when building without the standard library (Done for Arch bug 64270).
This behavior is apparently also present in Gentoo.
I haven't tried this, but I believe you should be able to dump the GCC specs using gcc -dumpspecs into a file, keeping only the section *cc1_options, removing %{nostdlib|nodefaultlibs|ffreestanding:-fno-stack-protector} from it, and passing it to gcc with gcc -specs=your_spec_file.
Alternately, you can rebuild the gcc package with this patch removed.
I'm trying to run this code on nvidia GPU and it returns strange values. It consist of two modules main.cu and exmodul.cu (listed bellow). For building I'm using:
nvcc -dc -arch sm_35 main.cu
nvcc -dc -arch sm_35 exmodul.cu
nvcc -arch sm_35 -lcudart -o main main.o exmodul.o
If I run that I obtained strange last line!!! gd must be 1.
result=0
result=0
result=0
result=0
gd=-0.5
When I change 1.0 in exmodul.cu to number greater than
1.000000953 or bellow 0.999999999999999945, it return proper
result.
When I change 1.1 in exmodul.cu it also fails except value 1.0.
Behavior doesn't depend on constant 2.0 in the same module.
When I use another function instead of cos like sin or exp it works properly.
Use of double q = cos(1.1); has no effect.
When I copy function extFunc() to module main.cu it works properly.
If I uncomment *gd=1.0; in main.cu it returns correct 1.0.
Tested on Nvidia GT750M and GeForce GTX TITAN Black. (on GT750M returns different value gd=6.1232329394368592e-17 but still wrong). OS: Debian Jessie.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1
Have you got any idea what is wrong?
Thanks, Lukas
main.cu
#include <stdio.h> // printf
#include "exmodul.h" // extFunc
__global__ void mykernel(double*gd);
void deviceCheck();
int main(int argc, char *argv[])
{
double gd, *d_gd;
cudaMalloc(&d_gd, sizeof(double)); deviceCheck();
mykernel<<<1,1>>>(d_gd); deviceCheck();
cudaMemcpy(&gd, d_gd, sizeof(double), cudaMemcpyDeviceToHost);
deviceCheck();
cudaFree(d_gd); deviceCheck();
fprintf(stderr,"gd=%.17g\n",gd);
return 0;
}
void deviceCheck()
{
cudaError_t result = cudaSuccess;
cudaDeviceSynchronize();
result = cudaGetLastError();
fprintf(stderr,"result=%d\n",result); fflush(stderr);
}
__global__ void mykernel(double *gd)
{
*gd = extFunc();
//*gd=1.0;
__syncthreads();
return;
}
exmodul.cu
#include "exmodul.h"
__device__ double extFunc()
{
double q = 1.1;
q = cos(q);
if(q<2.0) { q = 1.0; }
return q;
}
exmodul.h
__device__ double extFunc();
I was able to reproduce problematic behavior on a supported config (CUDA 6.5, CentOS 6.2, K40).
When I switched from CUDA 6.5 to CUDA 7 RC, the problem went away.
The problem also did not appear to be reproducible on an older config (CUDA 5.5, CentOS 6.2, M2070)
I suggest switching to CUDA 7 RC to address this issue. I suspect an underlying bug in the compilation process that has been fixed already.
I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.
Here's my code:
#include <stdio.h>
#include <CL/cl.h>
#include <CL/cl_platform.h>
int main(){
cl_float3 f3 = (cl_float3){1, 1, 1};
cl_float3 f31 = (cl_float3) {2, 2, 2};
cl_float3 f32 = (cl_float3) {2, 2, 2};
f3 = f31 + f32;
printf("%g %g %g \n", f3.x, f3.y, f3.z);
return 0;
}
When compiling with gcc 4.6, it produces the error
test.c:14:11: error: invalid operands to binary + (have ‘cl_float3’ and ‘cl_float3’)
Very strange to me, because the OpenCL Specification demontrates in section 6.4 just that, an addition of two floatn. Do I need to include any other headers?
But even more strange is that when compiling with -std=c99 I get errors like
test.c:16:26: error: ‘cl_float3’ has no member named ‘x’
..for all components (x, y and z)...
The reason for the compilation problem with structure subscripts can be seen in the implementation of the standard in the AMD SDK.
If you look at the <CL/cl_platform.h> header from AMD toolkit you could see how the structures are defined.
typedef cl_float4 cl_float3;
typedef union
{
cl_float CL_ALIGNED(16) s[4];
#if (defined( __GNUC__) || defined( __IBMC__ )) && ! defined( __STRICT_ANSI__ )
__extension__ struct{ cl_float x, y, z, w; };
....
#endif
}cl_float4;
The #if clause is ignored when gcc is invoked with --std=c99.
To make your code work with --std=c99 you could replace references to f3.x with f3.s[0] and so on.
OpenCL programs consist of two parts.
A program which runs on the host. This is normally written in C or C++, but it's nothing special except that it uses the API described in sections 4 & 5 of the OpenCL Specification.
A kernel which runs on the OpenCL device (normally a GPU). This is written in the language specified in section 6. This isn't C, but it's close. It adds things like vector operations (like you're trying to use). This is compiled by the host program passing a string which contains the kernel code to OpenCL via the API.
You've confused the two, and tried to use features of the kernel language in the host code.
cl_float.v is another option:
#include <assert.h>
#include <CL/cl.h>
int main(void) {
cl_float4 f = {{1, 2, 3, 4}};
cl_float4 g = {{5, 6, 7, 8}};
cl_float4 h;
h.v4 = f.v4 + g.v4;
assert(h.s[0] == 6);
assert(h.s[1] == 8);
return EXIT_SUCCESS;
}
which can be run as:
gcc -std=c89 -Wall -Wextra tmp.c -lOpenCL && ./a.out
in Ubuntu 16.10, gcc 6.2.0.
v is defined in Linux GCC x86 via GCC vector extensions.
The file https://github.com/KhronosGroup/OpenCL-Headers/blob/bf0f43b76f4556c3d5717f8ba8a01216b27f4af7/cl_platform.h contains:
#if defined( __SSE__ )
[...]
#if defined( __GNUC__ )
typedef float __cl_float4 __attribute__((vector_size(16)));
[...]
#define __CL_FLOAT4__ 1
and then:
typedef union
{
[...]
#if defined( __CL_FLOAT4__)
__cl_float4 v4;
#endif
}cl_float4;
Not sure this barrage of ifdefs was a good move by Khronos, but it is what we have.
I recommend that you just always use .s[0], which is the most portable option. We shouldn't need to speedup the host SIMD if we are focusing on the GPU...
C11 anonymous structs
The error error: ‘cl_float3’ has no member named ‘x’ happens because of the lines mentioned at: https://stackoverflow.com/a/10981639/895245
More precisely, this feature is called "anonymous struct", and it is an extension that was standardized in C11.
So in theory it should also work with -std=c11, but it currently doesn't because the CL headers weren't updated to check for C11, see also: https://github.com/KhronosGroup/OpenCL-Headers/issues/18
My code is trying to find the entropy of a signal (stored in 'data' and 'interframe' - in the full code these would contain the signal, here I've just put in some random values). When I compile with 'gcc temp.c' it compiles and runs fine.
Output:
entropy: 40.174477
features: 0022FD06
features[0]: 40
entropy: 40
But when I compile with 'gcc -mstackrealign -msse -Os -ftree-vectorize temp.c' then it compiles, but fails to execute beyond line 48. It needs to have all four flags in order to fail - any three of them and it runs fine.
The code probably looks weird - I've chopped just the failing bits out of a much bigger program. I only have the foggiest idea of what the compiler flags do, someone else put them in (and there's usually more of them, but I worked out that these were the bad ones).
All help much appreciated!
#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length);
int main()
{
int frame_length = 128;
int16_t data[128] = {1, 2, 3, 4};
int16_t interframe[128] = {1, 1, 1};
int16_t a = 0;
int16_t* features = &a;
calc_entropy(features, data, interframe, frame_length);
features += 1;
fprintf(stderr, "\nentropy: %d", a);
return 0;
}
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length)
{
float histo[65536] = {0};
float* histo_zero = histo + 32768;
volatile float entropy = 0.0f;
int i;
for(i=0; i<frame_length; i++){
histo_zero[data[i]]++;
histo_zero[interframe[i]]++;
}
for(i=-32768; i < 32768; i++){
if(histo_zero[i])
entropy -= histo_zero[i]*logf(histo_zero[i]/(float)(frame_length*2));
}
fprintf(stderr, "\nentropy: %f", entropy);
fprintf(stderr, "\nfeatures: %p", features);
features[0] = entropy; //execution fails here
fprintf(stderr, "\nfeatures[0]: %d", features[0]);
}
Edit: I'm using gcc 4.5.2, with x86 architecture. Also, if I compile and run it on VirtualBox running ubuntu (gcc -lm -mstackrealign -msse -Os -ftree-vectorize temp.c) it executes correctly.
Edit2: I get
entropy: 40.174477
features: 00000000
and then a message from windows telling me that the program has stopped running.
Edit3: In the five months since I originally posted the question I've updated to gcc 4.7.0, and the code now runs fine. I went back to gcc 4.5.2, and it failed. Still don't know why!
ottavio#magritte:/tmp$ gcc x.c -o x -lm -mstackrealign -msse -Os -ftree-vectorize
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fff5fe151ce
features[0]: 40
entropy: 40
ottavio#magritte;/tmp$ gcc x.c -o x -lm
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fffd7eff73e
features[0]: 40
entropy: 40
ottavio#magritte:/tmp$
So, what's wrong with it? gcc 4.6.1 and x86_64 architecture.
It seems to be running here as well, and the only thing I see that might be funky is that you are taking a 16 bit value (features[0]) and converting a 32 bit float (entropy)
features[0] = entropy; //execution fails here
into that value, which of course will shave it off.
It shouldn't matter, but for the heck of it, see if it makes any difference if change your int16_t values to int32_t values.