Large loop was ignored by the intel compiler? - c

All:
I have a very simple C test code using the Intel compiler to do some timing for a large loop for float point operation, the code (test.c) is as follows:
#include <sys/time.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(char *argc, char **argv) {
const long N = 1000000000;
double t0, t1, t2, t3;
double sum=0.0;
clock_t start, end;
struct timeval r_start, r_end;
long i;
gettimeofday(&r_start, NULL);
start = clock();
for (i=0;i<N;i++)
sum += i*2.0+i/2.0; // doing some floating point operations
end = clock();
gettimeofday(&r_end, NULL);
double cputime_elapsed_in_seconds = (end - start)/(double)CLOCKS_PER_SEC;
double realtime_elapsed_in_seconds = ((r_end.tv_sec * 1000000 + r_end.tv_usec)
- (r_start.tv_sec * 1000000 + r_start.tv_usec))/1000000.0;
printf("cputime_elapsed_in_sec: %e\n", cputime_elapsed_in_seconds);
printf("realtime_elapsed_in_sec: %e\n", realtime_elapsed_in_seconds);
//printf("sum= %4.3e\n", sum);
return 0;
}
However when I tried to compile and run it with Intel 13.0 compiler, the large loop seems to be ignored and the execution resulted in zero timing:
$ icc test.c
$ ./a.out
cputime_elapsed_in_sec: 0.000000e+00
realtime_elapsed_in_sec: 9.000000e-06
Only if I print the sum (uncomment line 26), the loop will actually be executed:
$ icc test.c
$ ./a.out
cputime_elapsed_in_sec: 2.730000e+00
realtime_elapsed_in_sec: 2.736198e+00
sum= 1.250e+18
The question is why the loop seems not executed if I do not print the sum value?
The same issue does not occur with gcc-4.4.7 compilers, I guess the intel compiler might have done some optimization that if the variable is not referenced, the loop is probably ignored?
The system information is as follows:
$ uname -a
Linux node001 2.6.32-642.11.1.el6.x86_64 #1 SMP Wed Oct 26 10:25:23 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ icc -v
icc version 13.0.0 (gcc version 4.4.7 compatibility)
$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)
Thanks for any suggestions!
Roy

Given your observation that printing the final value slows it down(a), there's a fairly good chance that the optimiser is figuring out that you're not actually using sum for anything after you've calculated it, so it's optimising the entire calculation loop out of existence.
I actually saw something similar quite a while ago when we were testing the performance of the latest VAX 11/780 machine our university had received (showing my age there). It was faster by a factor of several thousand percent for exactly the same reason, the new optimising compiler having decided that the loop wasn't actually needed.
To be certain, you'd have to examine the assembly output. I believe this can be done with icc by using the -Fa <asmFileName> option and then examining the file whose name you used in place of <asmFileName>.
(a) The other possibility I thought of seems to be discounted here.
That was the possibility that, given the range of i is constant (based on N) and that the calculation otherwise involves constants, it may be that the compiler itself had calculated the final value while compiling it, resulting in a simple constant load operation.
I've seen gcc do this sort of thing at its -O3 "insane" optimisation level.
I discount that possibility since the printing of the value would most likely not affect this operation.

Related

Is there no gcc warning when a literal declared as long is assigned to an int in c?

I can compile and run a program that assigns a long int literal, albeit it one that would fit into an int, to an int variable.
$ cat assign-long-to-int.c
#include <stdio.h>
int main(void){
int i = 1234L; //assign long to an int
printf("i: %d\n", i);
return 0;
}
$ gcc assign-long-to-int.c -o assign-long-to-int
$ ./assign-long-to-int
i: 1234
I know that 1234 would fit into an int but would still expect to be able to enable a warning. I've been through all the gcc options but can't find anything suitable.
Is it possible to generate a warning for this situation? From the discussion here, and the gcc options, the short answer is no. It isn't possible.
Would there be any point in such a warning?
It's obvious in the trivial example I posted that 1234L is being assigned to an int variable, and that it will fit. However, what if the declaration and the assignment were separated by many lines of code? The programmer writing 1234L is signaling that they expect this literal integer to be assigned to a long. Otherwise, what's the point of appending the L?
In some situations, appending the L does make a difference. For example
$ cat sizeof-test.c
#include <stdio.h>
void main(void){
printf("%ld\n", sizeof(1234));
printf("%ld\n", sizeof(1234L));
}
$ ./sizeof-test
4
8
Although the compiler must know that 1234L would fit into a 4 byte int, it puts it into an 8 byte long.
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-17ubuntu1~20.04' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-HskZEa/gcc-9-9.3.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
Compilers should check the value range, not the type of the integer constant. Otherwise we would end up with a lot of whining whenever we initialize a small integer type, since there are no small integer constants smaller than int.
short i = 32768; does for example yield a warning with clang -Wconstant-conversion but not with gcc. There's -Wconversion but it's prone to false positives on either compiler.
If you want to guard against implicit conversions between various integer types, you should probably use a static analyser instead.
In the case of constants, the compiler can see that the value in question fits into the type being assigned to, so there's really no point in warning. If the constant was out of range, i.e. 5000000000L, then the compiler will see that and generate a warning.
What the compiler can do however is warn when an integer type that is not a compile type constant is assigned to a lower type:
long y = 1;
int x = y;
If you add the -Wconversion flag (not included in either -Wall or -Wextra), you'll get this warning:
x1.c:6:5: warning: conversion to ‘int’ from ‘long int’ may alter its value [-Wconversion]
int x = y;
The compiler will automatically convert between most primitive integer types. When you convert from a larger type to a smaller type, I'm pretty sure its a feature of the C language that the number will be truncated.
For example, the following code will print "0xef":
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t x = 0xdeadbeef;
uint8_t y = x;
printf("0x%x\n", y);
return 0;
}
To address your question specifically, I don't think there is a warning for this behavior, because this conversion is technically a defined feature of the C language.

Why Long double data type is working strange in my C code?

I'm using Windows 10 and MinGW GCC compiler(not the mingw64 one), but when I try this code,
This is my GCC version,
C:\Users\94768>gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=c:/mingw/bin/../libexec/gcc/mingw32/6.3.0/lto-wrapper.exe
Target: mingw32
Configured with: ../src/gcc-6.3.0/configure --build=x86_64-pc-linux-gnu --host=mingw32 --target=mingw32 --with-gmp=/mingw --with-mpfr --with-mpc=/mingw --with-isl=/mingw --prefix=/mingw --disable-win32-registry --with-arch=i586 --with-tune=generic --enable-languages=c,c++,objc,obj-c++,fortran,ada --with-pkgversion='MinGW.org GCC-6.3.0-1' --enable-static --enable-shared --enable-threads --with-dwarf2 --disable-sjlj-exceptions --enable-version-specific-runtime-libs --with-libiconv-prefix=/mingw --with-libintl-prefix=/mingw --enable-libstdcxx-debug --enable-libgomp --disable-libvtv --enable-nls
Thread model: win32
gcc version 6.3.0 (MinGW.org GCC-6.3.0-1)
The code is,
#include <stdio.h>
void main() {
float a = 1.12345;
double b = 1.12345;
long double c = 1.12345;
printf("float value is is %f\n", a);
printf("double value is %lf\n", b);
printf("long double value is %Lf\n", c);
}
I got this output which is not what I expected, I don't understand what's the issue here, I am an absolute beginner to C programming.
float value is 1.123450
double value is 1.123450
long double value is -0.000000
Immensely appreciate your guidance!
GCC 6.3? That's ancient!
Latest GCC release at the time of this writing is 11.1.
I recommend you move away from MinGW and switch to MinGW-w64 as it is much more up to date (and supports both 32-bit and 64-bit Windows).
You can get a recent standalone MinGW-w64 build from https://winlibs.com/
Built with that compiler the output of your code is:
float value is is 1.123450
double value is 1.123450
long double value is 1.123450

Very slow speed of gcc compiled C-program under Linux

I have two OS on my PC with i7-3770 # 3.40 GHz. One OS is latest Linux Kubuntu 18.04, the other OS is Windows 10 Pro running on same HDD.
I have tested a simple funny program written in C language doing some arithmetic calculations from number theory. On Kubuntu compiled with gcc 7.3.0, on Windows compiled with gcc 5.2.0. built by MinGW-W64 project.
The result is amazing, running program was 4-times slower on Linux, than on Windows.
On Windows the elapsed time is just 6 seconds. On Linux is elapsed time 24 seconds! On the same hardware.
I tried on Kubuntu to compile with some CPU specific options like "gcc -corei7" etc., but nothing helped. In the program is used "math.h" library, so the compilation is done with "-lm" on both systems. The source code is the same.
Is there a reason for this slow speed under Linux?
Further more I have compiled the same code also on older 32-bit machine with Core Duo T2250 # 1.73 GHz under Linux Mint 19 with gcc 7.3.0. The elapsed time was 28 seconds! Not much difference than 64-bit machine running on double frequency under Linux.
The sorce code is below, you can compile it and test it.
/* Program for playing with sigma(n) and tau(n) functions */
/* Compilation of code: "gcc name.c -o name -lm" */
#include <stdio.h>
#include <math.h>
#include <time.h>
int main(void)
{
double i, nq, x, zacatek, konec, p;
double odx, soucet, delitel, celkem, ZM;
unsigned long cas1, cas2;
i=(double)0; soucet=(double)0; celkem=(double)0; nq=(double)0;
zacatek=(double)1; konec=(double)1000000; x=zacatek;
ZM=(double)16 / (double)10;
printf("\n Program for playing with sigma(n) and tau(n) functions \n");
printf("---------------------------------------------------------\n");
printf("Calculation is running in range from %.0lf to %.0lf\n\n\n", zacatek, konec);
printf("Finding numbers which have sigma(n)/n = %.3lf\n\n", ZM);
cas1=time(NULL);
while (x <= konec) {
i=1; celkem=0; nq=0;
odx=sqrt(x)+1;
while (i <= odx) {
if (fmod(x, i)==0) {
nq++;
celkem=celkem+x/i+i;
}
i++;
}
nq=2*nq-1;
if ((odx-floor(odx))==0) {celkem=celkem-odx;}
if (fabs(celkem - (ZM*x)) < 0.001) {
printf("%.0lf has sum of all divisors = %.3lf times the number itself (%.0lf, %.0lf)\n", x, ZM, celkem, nq+1);
}
x++;
}
cas2=time(NULL);
printf("\n\nProgram ended.\n\n");
printf("Elapsed time %lu seconds.\n\n", cas2-cas1);
return (0);
}

Why does the math library only need to be linked when used outside of main? [duplicate]

This question already has answers here:
literal constant vs variable in math library
(4 answers)
Do math functions of constant expressions get pre-calculated at compile time?
(6 answers)
Closed 4 years ago.
Using gcc test.c, the first code sample compiles, while the second does not. Why?
They both work with explicit linking of the math library (i.e. gcc test.c -lm.
First Sample:
#include <stdio.h>
#include <math.h>
int main () {
printf("%lf\n", sqrt(4.0) );
return 0;
}
Second Sample:
#include <math.h>
#include <stdio.h>
double sqrt2(double a) { return sqrt(a); }
int main() {
printf("%lf\n", sqrt(4.0));
printf("%lf\n", sqrt2(4.0));
return 0;
}
Linker error with second sample:
/tmp/ccuYdso7.o: In function `sqrt2':
test.c:(.text+0x13): undefined reference to `sqrt'
collect2: error: ld returned 1 exit status
gcc -v:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-mul
Thread model: posix
gcc version 8.1.0 (GCC)
There is no linker error for uses of sqrt in the main function. Is there any particular reason that this is the case?
I also checked with clang, but neither compiled (linker errors) without the -lm.
gcc is a particularly clever compiler.
It will optimise out sqrt(4.0) as a compile time evaluable constant expression. It can do that since the definition of sqrt is defined by the standard and its return value is solely a function of its input. (Note futher that under IEEE754, sqrt must return the closest double to the final result. This further supports the optimisation hypothesis.)
In the second case, the presence of function (which could be used by other translation units) defeats this optimisation.
Since the argument to sqrt is known at compile time, and its behaviour is standardised, sqrt(4.0) can be calculated at compile time.

gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.

Resources