Is this clang optimization a bug? - c

I ran into an interesting issue when compiling some code with -O3 using clang on OSX High Sierra. The code is this:
#include <stdint.h>
#include <limits.h> /* for CHAR_BIT */
#include <stdio.h> /* for printf() */
#include <stddef.h> /* for size_t */
uint64_t get_morton_code(uint16_t x, uint16_t y, uint16_t z)
{
/* Returns the number formed by interleaving the bits in x, y, and z, also
* known as the morton code.
*
* See https://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableO
bvious.
*/
size_t i;
uint64_t a = 0;
for (i = 0; i < sizeof(x)*CHAR_BIT; i++) {
a |= (x & 1U << i) << (2*i) | (y & 1U << i) << (2*i + 1) | (z & 1U << i)
<< (2*i + 2);
}
return a;
}
int main(int argc, char **argv)
{
printf("get_morton_code(99,159,46) = %llu\n", get_morton_code(99,159,46));
return 0;
}
When compiling this with cc -O1 -o test_morton_code test_morton_code.c I get the following output:
get_morton_code(99,159,46) = 4631995
which is correct. However, when compiling with cc -O3 -o test_morton_code test_morton_code.c:
get_morton_code(99,159,46) = 4294967295
which is wrong.
What is also odd is that this bug appears in my code when switching from -O2 to -O3 whereas in the minimal working example above it appears when going from -O1 to -O2.
Is this a bug in the compiler optimization or am I doing something stupid that's only appearing when the compiler is optimizing more aggressively?
I'm using the following version of clang:
snotdaqs-iMac:snoFitter snoperator$ cc --version
Apple LLVM version 9.1.0 (clang-902.0.39.1)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

UndefinedBehaviorSanitizer is really helpful in catching such mistakes:
$ clang -fsanitize=undefined -O3 o3.c
$ ./a.out
o3.c:19:2: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
get_morton_code(99,159,46) = 4294967295
A possible fix would be replacing the 1Us with 1ULL, an unsigned long long is at least 64 bit and can be shifted that far.

When i is 15 in the loop, 2*i+2 is 32, and you are shifting an unsigned int by the number of bits in an unsigned int, which is undefined.
You apparently intend to work in a 64-bit field, so cast the left side of the shift to uint64_t.
A proper printf format for uint64_t is get_morton_code(99,159,46) = %" PRIu64 "\n". PRIu64 is defined in the <inttypes.h> header.

Related

"warning: left shift count >= width of type" seems to happen too late [duplicate]

This question already has answers here:
bitwise shift promotes unsigned char to int
(3 answers)
Closed 2 years ago.
Take the following code snippet:
#include <stdio.h>
#include <stdint.h>
void decodeStatus(uint8_t d)
{
uint64_t a = d << 31;
uint64_t b = d << 32;
}
When shifting an uint8_t eight or more places to the left, it becomes 0, which results in a warning. The compiler is obligated to make sure that an uint8_t is exactly 8 bits (C11 spec 7.20.1.1 sections 1 & 2), so that both shifts should result in a warning. But apparently, only the second shift creates a warning:
gcc -c main.c -o a.out
main.c: In function ‘decodeStatus’:
main.c:7:20: warning: left shift count >= width of type [-Wshift-count-overflow]
uint64_t b = d << 4*8;
Is this a compiler bug, or is there a logical explanation? I verified this on the following compilers:
$> gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
$> arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (GCC) 9.2.0
You need to cast the shifted data
uint64_t decodeStatus(uint8_t* buf)
{
uint64_t raw = (uint64_t)buf[0] | ((uint64_t)buf[1] << (1*8))
| ((uint64_t)buf[2] << (2*8)) | ((uint64_t)buf[3] << (3*8))
| ((uint64_t)buf[4] << (4*8)) | ((uint64_t)buf[5] << (5*8))
| ((uint64_t)buf[6] << (6*8)) | ((uint64_t)buf[7] << (7*8));
}

gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.

Why does clang produces wrong results for my c code compiled with -O1 but not with -O0?

For input 0xffffffff, the following c code works fine with no optimization, but produces wrong results when compiled with -O1. Other compilation options are -g -m32 -Wall. The code is tested with clang-900.0.39.2 in macOS 10.13.2.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc < 2) return 1;
char *endp;
int x = (int)strtoll(argv[1], &endp, 0);
int mask1 = 0x55555555;
int mask2 = 0x33333333;
int count = (x & mask1) + ((x >> 1) & mask1);
int v1 = count >> 2;
printf("v1 = %#010x\n", v1);
int v2 = v1 & mask2;
printf("v2 = %#010x\n", v2);
return 0;
}
Input: 0xffffffff
Outputs with -O0: (expected)
v1 = 0xeaaaaaaa
v2 = 0x22222222
Outputs with -O1: (wrong)
v1 = 0x2aaaaaaa
v2 = 0x02222222
Below are disassembled instructions for the line "int v1 = count >> 2;" with -O0 and -O1.
With -O0:
sarl $0x2, %esi
With -O1:
shrl $0x2, %esi
Below are disassembled instructions for the line "int v2 = v1 & mask2;" with -O0 and -O1.
With -O0:
andl -0x24(%ebp), %esi //-0x24(%ebp) stores 0x33333333
With -O1:
andl $0x13333333, %esi //why does the optimization changes 0x33333333 to 0x13333333?
In addition, if x is set to 0xffffffff locally instead of getting its value from arguments, the code will work as expected even with -O1.
P.S: The code is an experimental piece based on my solution to the Data Lab from the CS:APP course # CMU. The lab asks the student to implement a function that counts the number of 1 bit of an int variable without using any type other than int.
As several commenters have pointed out, right-shifting signed values is not well defined.
I changed the declaration and initialization of x to
unsigned int x = (unsigned int)strtoll(argv[1], &endp, 0);
and got consistent results under -O0 and -O1. (But before making that change, I was able to reproduce your result under clang under MacOS.)
As you have discovered, you raise Implementation-defined Behavior in your attempt to store 0xffffffff (4294967295) in int x (where INT_MAX is 7fffffff, or 2147483647). C11 Standard §6.3.1.3 (draft n1570) - Signed and unsigned integers Whenever using strtoll (or strtoull) (both versions with 1-l would be fine) and attempting to store the value as an int, you must check the result against INT_MAX before making the assignment with a cast. (or if using exact width types, against INT32_MAX, or UINT32_MAX for unsigned)
Further, in circumstance such as this where bit operations are involved, you can remove uncertainty and insure portability by using the exact width types provided in stdint.h and the associated format specifiers provided in inttypes.h. Here, there is no need for use of a signed int. It would make more sense to handle all values as unsigned (or uint32_t).
For example, the following provides a default value for the input to avoid the Undefined Behavior invoked if your code is executed without argument (you can also simply test argc), replaces the use of strtoll with strtoul, validates the input fits within the associated variable before assignment handling the error if it does not, and then makes use of the unambiguous exact types, e.g.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
int main (int argc, char *argv[]) {
uint64_t tmp = argc > 1 ? strtoul (argv[1], NULL, 0) : 0xffffffff;
if (tmp > UINT32_MAX) {
fprintf (stderr, "input exceeds UINT32_MAX.\n");
return 1;
}
uint32_t x = (uint32_t)tmp,
mask1 = 0x55555555,
mask2 = 0x33333333,
count = (x & mask1) + ((x >> 1) & mask1),
v1 = count >> 2,
v2 = v1 & mask2;
printf("v1 = 0x%" PRIx32 "\n", v1);
printf("v2 = 0x%" PRIx32 "\n", v2);
return 0;
}
Example Use/Output
$ ./bin/masktst
v1 = 0x2aaaaaaa
v2 = 0x22222222
Compiled with
$ gcc -Wall -Wextra -pedantic -std=gnu11 -Ofast -o bin/masktst masktst.c
Look things over and let me know if you have further questions.
this statement:
int x = (int)strtoll(argv[1], &endp, 0);
results in a signed overflow, which is undefined behavior.
(on my system, the result is: -1431655766
The resulting values tend to go downhill from there:
The variable: v1 receives: -357913942
The variable: v2 receives: 572662306
the %x format specifier only works correctly with unsigned variables

Code fails to execute when compiled with these flags

My code is trying to find the entropy of a signal (stored in 'data' and 'interframe' - in the full code these would contain the signal, here I've just put in some random values). When I compile with 'gcc temp.c' it compiles and runs fine.
Output:
entropy: 40.174477
features: 0022FD06
features[0]: 40
entropy: 40
But when I compile with 'gcc -mstackrealign -msse -Os -ftree-vectorize temp.c' then it compiles, but fails to execute beyond line 48. It needs to have all four flags in order to fail - any three of them and it runs fine.
The code probably looks weird - I've chopped just the failing bits out of a much bigger program. I only have the foggiest idea of what the compiler flags do, someone else put them in (and there's usually more of them, but I worked out that these were the bad ones).
All help much appreciated!
#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length);
int main()
{
int frame_length = 128;
int16_t data[128] = {1, 2, 3, 4};
int16_t interframe[128] = {1, 1, 1};
int16_t a = 0;
int16_t* features = &a;
calc_entropy(features, data, interframe, frame_length);
features += 1;
fprintf(stderr, "\nentropy: %d", a);
return 0;
}
static void calc_entropy(volatile int16_t *features, const int16_t* data,
const int16_t* interframe, int frame_length)
{
float histo[65536] = {0};
float* histo_zero = histo + 32768;
volatile float entropy = 0.0f;
int i;
for(i=0; i<frame_length; i++){
histo_zero[data[i]]++;
histo_zero[interframe[i]]++;
}
for(i=-32768; i < 32768; i++){
if(histo_zero[i])
entropy -= histo_zero[i]*logf(histo_zero[i]/(float)(frame_length*2));
}
fprintf(stderr, "\nentropy: %f", entropy);
fprintf(stderr, "\nfeatures: %p", features);
features[0] = entropy; //execution fails here
fprintf(stderr, "\nfeatures[0]: %d", features[0]);
}
Edit: I'm using gcc 4.5.2, with x86 architecture. Also, if I compile and run it on VirtualBox running ubuntu (gcc -lm -mstackrealign -msse -Os -ftree-vectorize temp.c) it executes correctly.
Edit2: I get
entropy: 40.174477
features: 00000000
and then a message from windows telling me that the program has stopped running.
Edit3: In the five months since I originally posted the question I've updated to gcc 4.7.0, and the code now runs fine. I went back to gcc 4.5.2, and it failed. Still don't know why!
ottavio#magritte:/tmp$ gcc x.c -o x -lm -mstackrealign -msse -Os -ftree-vectorize
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fff5fe151ce
features[0]: 40
entropy: 40
ottavio#magritte;/tmp$ gcc x.c -o x -lm
ottavio#magritte:/tmp$ ./x
entropy: 40.174477
features: 0x7fffd7eff73e
features[0]: 40
entropy: 40
ottavio#magritte:/tmp$
So, what's wrong with it? gcc 4.6.1 and x86_64 architecture.
It seems to be running here as well, and the only thing I see that might be funky is that you are taking a 16 bit value (features[0]) and converting a 32 bit float (entropy)
features[0] = entropy; //execution fails here
into that value, which of course will shave it off.
It shouldn't matter, but for the heck of it, see if it makes any difference if change your int16_t values to int32_t values.

What is the difference between `cc -std=c99` and `c99` on Mac OS?

Given the following program:
/* Find the sum of all the multiples of 3 or 5 below 1000. */
#include <stdio.h>
unsigned long int method_one(const unsigned long int n);
int
main(int argc, char *argv[])
{
unsigned long int sum = method_one(1000000000);
if (sum != 0) {
printf("Sum: %lu\n", sum);
} else {
printf("Error: Unsigned Integer Wrapping.\n");
}
return 0;
}
unsigned long int
method_one(const unsigned long int n)
{
unsigned long int i;
unsigned long int sum = 0;
for (i=1; i!=n; ++i) {
if (!(i % 3) || !(i % 5)) {
unsigned long int tmp_sum = sum;
sum += i;
if (sum < tmp_sum)
return 0;
}
}
return sum;
}
On a Mac OS system (Xcode 3.2.3) if I use cc for compilation using the -std=c99 flag everything seems just right:
nietzsche:problem_1 robert$ cc -std=c99 problem_1.c -o problem_1
nietzsche:problem_1 robert$ ./problem_1
Sum: 233333333166666668
However, if I use c99 to compile it this is what happens:
nietzsche:problem_1 robert$ c99 problem_1.c -o problem_1
nietzsche:problem_1 robert$ ./problem_1
Error: Unsigned Integer Wrapping.
Can you please explain this behavior?
c99 is a wrapper of gcc. It exists because POSIX requires it. c99 will generate a 32-bit (i386) binary by default.
cc is a symlink to gcc, so it takes whatever default configuration gcc has. gcc produces a binary in native architecture by default, which is x86_64.
unsigned long is 32-bit long on i386 on OS X, and 64-bit long on x86_64. Therefore, c99 will have a "Unsigned Integer Wrapping", which cc -std=c99 does not.
You could force c99 to generate a 64-bit binary on OS X by the -W 64 flag.
c99 -W 64 proble1.c -o problem_1
(Note: by gcc I mean the actual gcc binary like i686-apple-darwin10-gcc-4.2.1.)
Under Mac OS X, cc is symlink to gcc (defaults to 64 bit), and c99 is not (defaults to 32bit).
/usr/bin/cc -> gcc-4.2
And they use different default byte-sizes for data types.
/** sizeof.c
*/
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv)
{
printf("sizeof(unsigned long int)==%d\n", (int)sizeof(unsigned long int));
return EXIT_SUCCESS;
}
cc -std=c99 sizeof.c
./a.out
sizeof(unsigned long int)==8
c99 sizeof.c
./a.out
sizeof(unsigned long int)==4
Quite simply, you are overflowing (aka wrapping) your integer variable when using the c99 compiler.
.PMCD.

Resources