My code is trying to find the entropy of a signal (stored in 'data' and 'interframe' - in the full code these would contain the signal, here I've just put in some random values). When I compile with 'gcc temp.c' it compiles and runs fine.
Output:
entropy: 40.174477
features: 0022FD06
features[0]: 40
entropy: 40
But when I compile with 'gcc -mstackrealign -msse -Os -ftree-vectorize temp.c' then it compiles, but fails to execute beyond line 48. It needs to have all four flags in order to fail - any three of them and it runs fine.
The code probably looks weird - I've chopped just the failing bits out of a much bigger program. I only have the foggiest idea of what the compiler flags do, someone else put them in (and there's usually more of them, but I worked out that these were the bad ones).
All help much appreciated!
#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length);
int main()
{
int frame_length = 128;
int16_t data[128] = {1, 2, 3, 4};
int16_t interframe[128] = {1, 1, 1};
int16_t a = 0;
int16_t* features = &a;
calc_entropy(features, data, interframe, frame_length);
features += 1;
fprintf(stderr, "\nentropy: %d", a);
return 0;
}
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length)
{
float histo[65536] = {0};
float* histo_zero = histo + 32768;
volatile float entropy = 0.0f;
int i;
for(i=0; i<frame_length; i++){
histo_zero[data[i]]++;
histo_zero[interframe[i]]++;
}
for(i=-32768; i < 32768; i++){
if(histo_zero[i])
entropy -= histo_zero[i]*logf(histo_zero[i]/(float)(frame_length*2));
}
fprintf(stderr, "\nentropy: %f", entropy);
fprintf(stderr, "\nfeatures: %p", features);
features[0] = entropy; //execution fails here
fprintf(stderr, "\nfeatures[0]: %d", features[0]);
}
Edit: I'm using gcc 4.5.2, with x86 architecture. Also, if I compile and run it on VirtualBox running ubuntu (gcc -lm -mstackrealign -msse -Os -ftree-vectorize temp.c) it executes correctly.
Edit2: I get
entropy: 40.174477
features: 00000000
and then a message from windows telling me that the program has stopped running.
Edit3: In the five months since I originally posted the question I've updated to gcc 4.7.0, and the code now runs fine. I went back to gcc 4.5.2, and it failed. Still don't know why!
ottavio#magritte:/tmp$ gcc x.c -o x -lm -mstackrealign -msse -Os -ftree-vectorize
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fff5fe151ce
features[0]: 40
entropy: 40
ottavio#magritte;/tmp$ gcc x.c -o x -lm
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fffd7eff73e
features[0]: 40
entropy: 40
ottavio#magritte:/tmp$
So, what's wrong with it? gcc 4.6.1 and x86_64 architecture.
It seems to be running here as well, and the only thing I see that might be funky is that you are taking a 16 bit value (features[0]) and converting a 32 bit float (entropy)
features[0] = entropy; //execution fails here
into that value, which of course will shave it off.
It shouldn't matter, but for the heck of it, see if it makes any difference if change your int16_t values to int32_t values.
Related
I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.
I was trying today to check an Answer and I realized that if i use codeblocks (with gcc) i have to treat the error different from the command line (Ubuntu Linux) using gcc.
The program is like this:
#include<stdio.h>
#include<math.h>
int main(void){
double len,x,y =0;
int n=123456;
len=floor(log10(abs(n))) + 1;
x = n / pow(10, len / 2);
y = n - x * pow(10, len / 2);
printf("First Half = %f",x);
printf("\nSecond Half = %f",y);
return 0;
}
And if i try to compile it i get:
error: implicit declaration of function ‘abs’ [-Werror=implicit-function-declaration]|
So here is the funny thing. I added -lm to the Compiler => global compiler => settings => Other settings, but the result is the same.
It is working only if i include stdlib.h.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
int main(void){
double len,x,y =0;
int n=123456;
len=floor(log10(abs(n))) + 1;
x = n / pow(10, len / 2);
y = n - x * pow(10, len / 2);
printf("First Half = %f",x);
printf("\nSecond Half = %f",y);
return 0;
}
But if I use command line (in terminal) using the comand:
gcc program.c -o program -lm
The program compiled successfully.
My question: Why happens this ?
I did a research on interent and found that some people says the abs function is declared in stdlib.h, not math.h. but if i compile in command line (without including stdlib.h) with -lm works. I'm confused.
Short answer: Try
gcc -Wall -Wextra -pedantic -o program -lm
or
gcc -Wall -Wextra -Werror -pedantic -o program -lm
to make it fail on warnings as Codeblocks seems to do.
Long answer: Linking to a library is a completely different matter than including a header file. In C, for historic reasons, it is "allowed" to use a function that is not declared. The compiler in this case assumes a function returning int and taking whatever arguments you give it. For abs(), these assumptions hold. So later, the linker finds the function when linking with libm and everything is fine.
But there are quite some catches: First you will miss simple typos if you don't enable warnings. Second, the compiler is unable to check the arguments you give -> crashing program ahead. And even more problems are to expect if the function does return something other than int.
abs() is declared in stdlib.h. To use it, include this header. And always enable compiler warnings (Codeblocks obviously does it for you).
I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.
This is my first question ;-)
I try to use AVX in CUDA application (ccminer) but nvcc shows an error:
/usr/local/cuda/bin/nvcc -Xcompiler "-Wall -mavx" -O3 -I . -Xptxas "-abi=no -v" -gencode=arch=compute_50,code=\"sm_50,compute_50\" --maxrregcount=80 --ptxas-options=-v -I./compat/jansson -o x11/x11.o -c x11/x11.cu
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avxintrin.h(118): error: identifier "__builtin_ia32_addpd256" is undefined
[...]
This is just the first error. There are many 'undefined' builtin functions :-(
Everything is ok for 'C/C++' programs - with .c or .cpp extensions. But .cu - error :-( What do I do wrong ? I can compile ccminer but I cannot add AVX intrinsics to .cu files - only .c files. I use Intel intrinsics not gcc.
Any help greatly appreciated. Thanks in advance.
Linux Mint (ubuntu 13) 64bit, gcc 4.8.1, cuda 6.5.
I do not expect AVX to work on GPU. In .cu file there is small portion CPU based code which I want to vectorize.
Here is example to reproduce the error. I took the simplest example from:
http://computer-graphics.se/hello-world-for-cuda.html
Added line at the beginning:
#include <immintrin.h>
and tried to compile with the command:
nvcc cudahello.cu -Xcompiler -mavx
got an error:
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avxintrin.h(118): error:
identifier "__builtin_ia32_addpd256" is undefined
The same code without #include <immintrin.h>
compiles without problems.
Here is whole code:
#include <stdio.h>
#if defined(__AVX__)
#include <immintrin.h>
#endif
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}
Compile with
nvcc cudahello.cu -Xcompiler -mavx
to get the error or with
nvcc cudahello.cu
to compile clean.
I think I have an answer. Functions like:
_builtin_ia32_addpd256
are built into gcc and nvcc does not know about them. Since they are declared in immintrin.h nvcc returns errors while compiling .cu file with immintrin.h included. So we cannot mix cuda features with builtin gcc functions in one file.
This issue was actually fixed with CUDA 8, with the nvcc version shipping with CUDA 8 I can compile code that contains AVX intrinsics (which I couldn't with older versions).
The source code of square.c is:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int square(int *ptr)
{
int a;
a = *ptr;
return a * a;
}
int main(int argc, char **argv)
{
int a, aa;
srandom(time(NULL));
a = random() % 10 + 1;
aa = square(&a);
printf("%d\n", aa);
return 0;
}
The command-line to compile the source code is:
gcc square.c -o square
Is it possible to run the square executable in Linux so that the printed value will not be a square of any integer number?
Any method of running the program is allowed.
Yes. We can override printf.
Write the code in your post into square.c and compile it with gcc square.c
Make this file, fakesquare.c
int printf(char *str,int i)
{
return puts("7");
}
Compile fakesquare.c as a shared library:
gcc -fPIC -o libfakesquare.so -shared fakesquare.c
Run the square program with libfakesquare.so preloaded:
[15:27:27 0 /tmp] $ LD_PRELOAD=./libfakesqare.so ./a.out
7
[15:29:16 0 /tmp] $ LD_PRELOAD=./libfakesqare.so ./a.out
7
[15:29:16 0 /tmp] $ LD_PRELOAD=./libfakesqare.so ./a.out
7
Witout libfakeshared.so preloaded:
[15:29:40 0 /tmp] $ ./a.out
36
[15:29:41 0 /tmp] $ ./a.out
16
[15:29:42 0 /tmp] $ ./a.out
64
You could use this :
Fastest way to determine if an integer's square root is an integer
Their code seems optimized, but whichever is simplest should do the trick for you.
The only dependency at your code is libc. If libc stays unmodified then your code will always work.
Also your program will fail if before running it, all available memory is exhausted. You can always check if ptr!=NULL.
Assuming a standard C environment I don't see a reason why this should fail on a standard platform. The code might fail if printf is not doing what it is inteded to do, but probably this is not what you are asking for. It also might fail on a platform where int is as small as a byte and a byte is only 6 bits wide. In this case your square function might calculate 9*9=81 which will not fit in the result type int (0..63 for 6 bit-byte). But in my opinion this is a quite academic case.