Very slow speed of gcc compiled C-program under Linux - c

I have two OS on my PC with i7-3770 # 3.40 GHz. One OS is latest Linux Kubuntu 18.04, the other OS is Windows 10 Pro running on same HDD.
I have tested a simple funny program written in C language doing some arithmetic calculations from number theory. On Kubuntu compiled with gcc 7.3.0, on Windows compiled with gcc 5.2.0. built by MinGW-W64 project.
The result is amazing, running program was 4-times slower on Linux, than on Windows.
On Windows the elapsed time is just 6 seconds. On Linux is elapsed time 24 seconds! On the same hardware.
I tried on Kubuntu to compile with some CPU specific options like "gcc -corei7" etc., but nothing helped. In the program is used "math.h" library, so the compilation is done with "-lm" on both systems. The source code is the same.
Is there a reason for this slow speed under Linux?
Further more I have compiled the same code also on older 32-bit machine with Core Duo T2250 # 1.73 GHz under Linux Mint 19 with gcc 7.3.0. The elapsed time was 28 seconds! Not much difference than 64-bit machine running on double frequency under Linux.
The sorce code is below, you can compile it and test it.
/* Program for playing with sigma(n) and tau(n) functions */
/* Compilation of code: "gcc name.c -o name -lm" */
#include <stdio.h>
#include <math.h>
#include <time.h>
int main(void)
{
double i, nq, x, zacatek, konec, p;
double odx, soucet, delitel, celkem, ZM;
unsigned long cas1, cas2;
i=(double)0; soucet=(double)0; celkem=(double)0; nq=(double)0;
zacatek=(double)1; konec=(double)1000000; x=zacatek;
ZM=(double)16 / (double)10;
printf("\n Program for playing with sigma(n) and tau(n) functions \n");
printf("---------------------------------------------------------\n");
printf("Calculation is running in range from %.0lf to %.0lf\n\n\n", zacatek, konec);
printf("Finding numbers which have sigma(n)/n = %.3lf\n\n", ZM);
cas1=time(NULL);
while (x <= konec) {
i=1; celkem=0; nq=0;
odx=sqrt(x)+1;
while (i <= odx) {
if (fmod(x, i)==0) {
nq++;
celkem=celkem+x/i+i;
}
i++;
}
nq=2*nq-1;
if ((odx-floor(odx))==0) {celkem=celkem-odx;}
if (fabs(celkem - (ZM*x)) < 0.001) {
printf("%.0lf has sum of all divisors = %.3lf times the number itself (%.0lf, %.0lf)\n", x, ZM, celkem, nq+1);
}
x++;
}
cas2=time(NULL);
printf("\n\nProgram ended.\n\n");
printf("Elapsed time %lu seconds.\n\n", cas2-cas1);
return (0);
}

Related

Why `memmove` function has significant difference in two different computers?

I've tried to run the following C code from two different computers.
#include <string.h>
int a[100000];
int main(){
for(int sz = 100000; sz > 1; sz --){
memmove(a, a+1, 4*(sz - 1));
}
}
Computer A uses 800ms, while computer B uses 6200ms. The running time of B is always far more than A.
Environment
compile command
Shell is bash.
And the -O gcc command doesn't influence runtime.
gcc myfile.c -o mybin
time ./mybin
Computer A
gcc 9.3.0
glibc: ldd (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
uname_result(system='Linux', release='5.4.0-100-generic', machine='x86_64')
CPU: Intel(R) Xeon(R) Gold 6140 CPU # 2.30GHz
Computer B
gcc 9.3.0
glibc: ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
uname_result(system='Linux', release='4.4.0-210-generic', machine='x86_64')
CPU: Intel(R) Xeon(R) Platinum 8369B CPU # 2.70GHz
Question
Then I run the following file on same virtual machines, with different kernal version(4.4.210-0404210-generic x86_64 and 5.4.0-113-generic x86_64) with gcc 9.3.0. Two test cost less than 500ms.
#include <string.h>
#include <time.h>
#include <stdio.h>
#define TICK(X) clock_t X = clock()
#define TOCK(X) printf("time %s: %g sec.\n", (#X), (double)(clock() - (X)) / CLOCKS_PER_SEC)
int a[100000];
int main(){
TICK(timer);
for(int sz = 100000; sz > 100; sz --){
memmove(a, a+1, 4*(sz - 1));
}
TOCK(timer);
}
How can I find the cause?

gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.

Lower than expected speedup when using multithreading

Remark: I feel a little bit stupid about this, but this might help someone
So, I am trying to improve the performance of a program by using parallelism. However, I am encountering an issue with the measured speedup. I have 4 CPUs:
~% lscpu
...
CPU(s): 4
...
However, the speedup is much lower than fourfold. Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation).
Purely sequential (add_seq.c):
#include <stddef.h>
int main() {
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
OpenMP (add_omp.c):
#include <stddef.h>
int main() {
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
POSIX threads (add_pthread.c):
#include <pthread.h>
#include <stddef.h>
void* f(void* x) {
(void) x;
const size_t count = (1ull<<36) / 4;
for (size_t i = 0; i < count; i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return NULL;
}
int main() {
pthread_t t[4];
for (size_t i = 0; i < 4; i += 1) {
pthread_create(&t[i], NULL, f, NULL);
}
for (size_t i = 0; i < 4; i += 1) {
pthread_join(t[i], NULL);
}
return 0;
}
Makefile:
CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread # just to be sure
all: add_seq add_omp add_pthread
So, now, running this (using zsh's time builtin):
% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp -O3 -lpthread add_seq.c -o add_seq
cc -O3 -fopenmp -O3 -lpthread add_omp.c -o add_omp
cc -O3 -fopenmp -O3 -lpthread add_pthread.c -o add_pthread
./add_seq 24.49s user 0.00s system 99% cpu 24.494 total
./add_omp 52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread 52.92s user 0.00s system 398% cpu 13.266 total
Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. So counting billions of instructions:
>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916
So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. Why is it so?
Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; however, this had no effect on performance
Well, the answer is quite simple: there are really only two CPU cores:
% lscpu
...
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
...
So, although htop shows four CPUs, two are virtual and only there because of hyperthreading. Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does help run similar code faster (it is only useful when running two threads using different resources).
So, in the end, what happens is that time/clock() measures the usage of each logical core as that of the underlying physical core. Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup.
Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading.
Similar question
Related question

cacosf (Complex arc cos) function in C returns indefinite

I have an algorithm coded in MATLAB, which contains complex arc cos of some value (computation requires arccos of 15, which is approximately 3.4i). I want to code C or C++ counterpart of this code running on my Windows 7 PC. Actually, I want to produce it as a mex function compiled with Visual Studio C++.
I included "complex.h" and used cacosf function (complex arccos returning float _Complex) but I could not compile it as a mex function because Visual C++ compiler does not have "complex.h" support. However, mex file can take libraries as input, so I can compile my c code with another compiler that MATLAB does support (for example mingw, I integrated it to matlab with gnumex utility.) I downloaded Bloodshed C++ IDE which uses mingw at backend, I can compile my c++ code. The following C++ code represents a similar operation to my goal:
#include <stdio.h>
#include <complex.h>
int main() {
float _Complex myComplex;
myComplex = cacosf(5);
printf("Complex number result of acos(5) is : %f + %fi \r\n",crealf(myComplex),cimagf(myComplex));
return 0;
}
The output should be:
Complex number result of acos(5) is : 0.000000 + -2.292432i
However I get
Complex number result of acos(5) is : -1.#IND00, -0.000000
When I compile my C++ code with Linux GCC on Ubuntu 14.04 computer with Eclipse CDT Luna I get
The output should be:
Complex number result of acos(5) is : 0.000000 + -2.292432i
Where can I be wrong? Why can't I compile this code in Windows + mingw setup?
Note: I can compute cacosf(0) as 1.570796 + -0.000000 when I use mingw.
What version of mingwrt are you using? With mingwrt-3.21.1, the following works for me, (cross-compiling on a Linux host, and running under wine):
$ cat foo.c
#include <stdio.h>
#include <complex.h>
int main()
{
double _Complex Z = cacos(5.0);
printf( "arcos(5) = (%g, %gi)\n", __real__ Z, __imag__ Z );
return 0;
}
$ mingw32-gcc -o foo.exe foo.c
$ ./foo.exe
arcos(5) = (0, -2.29243i)
This seems to be consistent with your expected result. However, if you use any mingwrt version pre-dating mingwrt-3.21, (and the less said about utterly broken mingwrt-4.x the better), then there is a known bug resulting from arbitrarily deeming any purely real cacos() argument value greater than (1.0, 0.0i) to be outside the valid domain, (as would be the case for acos() on its real part), which would yield the result you report.
Visual C++, as the name says, is a C++ compiler. C++ uses the <complex> header and std::complex<float> type. Since C++ has overloading, you can call std::acos for complex values too.
Your code is in fact C, which is no longer supported by MSVC++. (They stopped doing that back in 1996 or so)

Why does this SIMD example code in C compile with minGW but the executable doesn't run on my windows machine?

I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.

Resources