Compile-time bitwise operations in C/C++

Compile-time bitwise operations in C/C++ - c

I'm trying to understand how bitwise operations are handled by C/C++ compilers.
Specifically, I'm talking about C compiled with gcc, but I believe that the question is a bit more general than that.
Anyway, suppose I have a macros defined as follows:
#define SOME_CONSTANT 0x111UL
#define SOME_OFFSET 2
#define SOME_MASK 7
#define SOME_VALUE ((SOME_CONSTANT) << (SOME_OFFSET)) & (SOME_MASK)
static inline void foo() { printf("Value: %lu#n", SOME_VALUE); }
All the ingredients of SOME_VALUE are constants, and they are all known at compile time.
So my question is: will gcc evaluate SOME_VALUE at compile time, or will it be done at runtime only?
How do I check whether a gcc supports such optimization?

Your compiler does not know about SOME_VALUE. The C code is first passed through the C Preprocessor to C compiler. You can see the output of the C Preprocessor by running gcc as:
gcc -E code.c
You'll see that the real code fed to C compiler is:
int main(void) {
printf("Value: %lu#n", ((0x111UL) << (2)) & (7));
return 0;
}
So the question becomes "Does C Compiler of GCC optimize ((0x111UL) << (2)) & (7)", and the answer is yes (as indicated by other answerers who proved it by looking at the assembly code generated).

Yes, gcc will optimise this as it is a completely constant expression.
To check this look at the assembly code, for example with this tool https://gcc.godbolt.org/
#include <stdio.h>
#define SOME_CONSTANT 0x111UL
#define SOME_OFFSET 2
#define SOME_MASK 7
#define SOME_VALUE ((SOME_CONSTANT) << (SOME_OFFSET)) & (SOME_MASK)
void foo() { printf("Value: %lu#n", SOME_VALUE); }
I had to modify your code slightly as otherwise gcc optimises away the whole thing and leaves nothing!
.LC0:
.string "Value: %lu#n"
foo():
movl $4, %esi
movl $.LC0, %edi
xorl %eax, %eax
jmp printf

will gcc evaluate SOME_VALUE at compile time
I don't know about yours, mine does
How do I check whether a gcc supports such optimization?
I used -S flag to generate assembly code and checked it
movl $4, %esi

As others have answered, yes, it will. But do consider that is not required to; if you want the certainty of that, just pre-calculate it as you have all the elements to do it.

Related

How to write rotate code in C to compile into the `ror` x86 instruction?

I have some code that rotates my data. I know GAS syntax has a single assembly instruction that can rotate an entire byte. However, when I try to follow any of the advice on Best practices for circular shift (rotate) operations in C++, my C code compiles into at least 5 instructions, which use up three registers-- even when compiling with -O3. Maybe those are best practices in C++, and not in C?
In either case, how can I force C to use the ROR x86 instruction to rotate my data?
The precise line of code which is not getting compiled to the rotate instruction is:
value = (((y & mask) << 1 ) | (y >> (size-1))) //rotate y right 1
^ (((z & mask) << n ) | (z >> (size-n))) // rotate z left by n
// size can be 64 or 32, depending on whether we are rotating a long or an int, and
// mask would be 0xff or 0xffffffff, accordingly
I do not mind using __asm__ __volatile__ to do this rotate, if that's what I must do. But I don't know how to do so correctly.

Your macro compiles to a single ror instruction for me... specifically, I compiled this test file:
#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))
unsigned ror(unsigned x, unsigned y)
{
return ROR(x, y);
}
as C, using gcc 6, with -O2 -S, and this is the assembly I got:
.file "test.c"
.text
.p2align 4,,15
.globl ror
.type ror, #function
ror:
.LFB0:
.cfi_startproc
movl %edi, %eax
movl %esi, %ecx
rorl %cl, %eax
ret
.cfi_endproc
.LFE0:
.size ror, .-ror
.ident "GCC: (Debian 6.4.0-1) 6.4.0 20170704"
.section .note.GNU-stack,"",#progbits
Please try to do the same, and report the assembly you get. If your test program is substantially different from mine, please tell us how it differs. If you are using a different compiler or a different version of GCC please tell us exactly which one.
Incidentally, I get the same assembly output when I compile the code in the accepted answer for "Best practices for circular shift (rotate) operations in C++", as C.

How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra & masking of the count) confuses old compilers, like gcc before 4.9. Since you're not masking the shift count, it should be recognized with even older gcc.
Your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like
value = rotr32(y & mask, 1) ^ rotr32(z & mask, n);
Much more readable, and may help stop the compiler from trying to do things in the wrong order and breaking the idiom before recognizing it as a rotate.
Maybe those are best practices in C++, and not in C?
My answer on the linked question clearly says that it's the best practice for C as well as C++. They are different languages, but they overlap completely for this, according to my testing.
Here's a version of the Godbolt link using -xc to compile as C, not C++. I had a couple C++isms in the link in the original question, for experimenting with integer types for the rotate count.
Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. clang doesn't seem to provide any in x86intrin.h, but other compilers have _rotl / _rotr for 32-bit rotates, with other sizes available.
Actually, I talked about rotate intrinsics at length in the answer on the best-practices question, not just in the godbolt link. Did you even read the answer there, apart from the code block? (If you did, your question doesn't reflect it.)
Using intrinsics, or the idiom in your own inline function, is much better than using inline asm. Asm defeats constant-propagation, among other things. Also, compilers can use BMI2 rorx dst, src, imm8 to copy-and-rotate with one instruction, if you compile with -march=haswell or -mbmi2. It's a lot harder to write an inline-asm rotate that can use rorx for immediate-count rotates but ror r32, cl for variable-count rotates. You could try with _builtin_constant_p(), but clang evaluates that before inlining, so it's basically useless for meta-programming style choice of which code to use. It works with gcc though. But it's still much better not to use inline asm unless you've exhausted all other avenues (like asking on SO) to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm
Fun fact: the rotate functions in gcc's x86intrin.h are just pure C using the rotate idiom that gcc recognizes. Except for 16-bit rotates, where they use __builtin_ia32_rolhi.

You might need to be a bit more specific with what integral type / width you're rotating, and whether you have a fixed or variable rotation. ror{b,w,l,q} (8, 16, 32, 64-bit) has forms for (1), imm8, or the %cl register. As an example:
static inline uint32_t rotate_right (uint32_t u, size_t r)
{
__asm__ ("rorl %%cl, %0" : "+r" (u) : "c" (r));
return u;
}
I haven't tested this, it's just off the top of my head. And I'm sure multiple constraint syntax could be used to optimize cases where a constant (r) value is used, so %e/rcx is left alone.
If you're using a recent version of gcc or clang (or even icc). The intrinsics header <x86intrin.h>, may provide __ror{b|w|d|q} intrinsics. I haven't tried them.

Best Way:
#define rotr32(x, n) (( x>>n ) | (x<<(64-n)))
#define rotr64(x, n) (( x>>n ) | (x<<(32-n)))
More generic:
#define rotr(x, n) (( x>>n ) | (x<<((sizeof(x)<<3)-n)))
And it compiles (in GCC) with exactly the same code as the asm versions below.
For 64 bit:
__asm__ __volatile__("rorq %b1, %0" : "=g" (u64) : "Jc" (cShift), "0" (u64));
or
static inline uint64_t CC_ROR64(uint64_t word, int i)
{
__asm__("rorq %%cl,%0"
:"=r" (word)
:"0" (word),"c" (i));
return word;
}

Can I force gcc/clang to emit a function call, even when optimising?

I have some code that runs x64 programs under ptrace and manipulates their code. For testing this sort of thing, I have a placeholder function in one of my test programs:
uint32_t __attribute__((noinline)) func(void) {
return 0xCCCCCCCC;
}
The 0xCCCCCCCC is then replaced at runtime with some other constant.
When I compile with clang and optimisation, though, clang replaces calls to the function with the constant 0xCCCCCCCC (since the function is so simple). For example, here's part of the code generated for the call to check that the call to modified TestFunc() returns 0xCDCDCDCD, which it should after having been modified:
movl $3435973836, %edi ## imm = 0xCCCCCCCC
movl $3435973836, %eax ## imm = 0xCCCCCCCC
leaq 16843009(%rax), %rdx ## [16843009 = 0x01010101]
As you might imagine, this test then fails, because these two mismatching constant values won't match!
Can I force clang (and, hopefully in the same way, gcc...) to generate the function call?
(I don't want to disable optimisations globally - I already have a build that does that (and with that build, all tests pass). I'd just like to do a spot fix, if that's possible.)

Use attribute optnone or optimize to set the optimization level for func to -O0.
Only optnone is defined on my machine. So the code for me is the following, but yours might use the header __attribute__ ((optimize("0"))).
#include <stdlib.h>
uint32_t __attribute__ ((optnone)) func(void) {
return 0xCCCCCCCC;
}
int main()
{
return func();
}
See In clang, how do you use per-function optimization attributes?.

Does optimization change the behavior of casts?

I'm working with a small UART device and frequently need to switch the baud rate at which it operates.
Essentially the whole set-up boils down to
#define FOSC 2000000
#define BAUD 9600
uint8_t rate = (uint8_t) ((FOSC / (16.0 * BAUD)) - 1 + 0.5);
(Where +0.5 is used to round the result.)
I'm currently compiling with gcc 4.8.1, -O1.
Does the compiler optimize away the whole cast or am I left with a cast followed by a constant? Would this differ with different -O# values (besides -O0)? What about -Os (which I might have to compile with eventually)?
If it matters, I'm developing for the Atmel AT90USB647 (or the datasheet [pdf]).

It is extremely likely that any sane compiler will convert that entire expression (including the cast) into a constant when compiling with optimizations enabled.
However, to be sure, you'll need to look at the assembly output of your compiler.
But what about GCC 4.8.1 in particular?
Code
#include <stdint.h>
#include <stdio.h>
#define FOSC 2000000
#define BAUD 9600
int main() {
uint8_t rate = (uint8_t) (FOSC / (16.0 * BAUD)) - 1 + 0.5;
printf("%u", rate);
}
Portion of the generated assembly with gcc -O1 red.c
main:
.LFB11:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $12, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
We can see clearly that gcc has precomputed the value of 12 for rate.

Atmel AVR ships newlib C library which is a simple ANSI C library, math library, and collection of board support packages. You may refer to ANSI C specifications to find out. Specifically the Conversion section.

I would make sure that rate is being written to a volatile variable or pointer. Maybe the rate value is calculated OK, but when it's written to the peripheral destination it is lacking the volatile tag and the optimizer does not perform the write operation.

Massive fprintf speed difference without "-std=c99"

I had been struggling for weeks with a poor-performing translator I had written.
On the following simple bechmark
#include<stdio.h>
int main()
{
int x;
char buf[2048];
FILE *test = fopen("test.out", "wb");
setvbuf(test, buf, _IOFBF, sizeof buf);
for(x=0;x<1024*1024; x++)
fprintf(test, "%04d", x);
fclose(test);
return 0
}
we see the following result
bash-3.1$ gcc -O2 -static test.c -o test
bash-3.1$ time ./test
real 0m0.334s
user 0m0.015s
sys 0m0.016s
As you can see, the moment the "-std=c99" flag is added in, performance comes crashing down:
bash-3.1$ gcc -O2 -static -std=c99 test.c -o test
bash-3.1$ time ./test
real 0m2.477s
user 0m0.015s
sys 0m0.000s
The compiler I'm using is gcc 4.6.2 mingw32.
The file generated is about 12M, so this is a difference between of about 21MB/s between the two.
Running diff shows the the generated files are identical.
I assumed this has something to do with file locking in fprintf, of which the program makes heavy use, but I haven't been able to find a way to switch that off in the C99 version.
I tried flockfile on the stream I use at the beginning of the program, and an corresponding funlockfile at the end, but was greeted with compiler errors about implicit declarations, and linker errors claiming undefined references to those functions.
Could there be another explanation for this problem, and more importantly, is there any way to use C99 on windows without paying such an enormous performance price?
Edit:
After looking at the code generated by these options, it looks like in the slow versions, mingw sticks in the following:
_fprintf:
LFB0:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
leal 40(%esp), %eax
movl %eax, 8(%esp)
movl 36(%esp), %eax
movl %eax, 4(%esp)
movl 32(%esp), %eax
movl %eax, (%esp)
call ___mingw_vfprintf
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
In the fast version, this simply does not exist; otherwise, both are exactly the same. I assume __mingw_vfprintf seems to be the slowpoke here, but I have no idea what behavior it needs to emulate that makes it so slow.

After some digging in the source code, I have found why the MinGW function is so terribly slow:
At the beginning of a [v,f,s]printf in MinGW, there is some innocent-looking initialization code:
__pformat_t stream = {
dest, /* output goes to here */
flags &= PFORMAT_TO_FILE | PFORMAT_NOLIMIT, /* only these valid initially */
PFORMAT_IGNORE, /* no field width yet */
PFORMAT_IGNORE, /* nor any precision spec */
PFORMAT_RPINIT, /* radix point uninitialised */
(wchar_t)(0), /* leave it unspecified */
0, /* zero output char count */
max, /* establish output limit */
PFORMAT_MINEXP /* exponent chars preferred */
};
However, PFORMAT_MINEXP is not what it appears to be:
#ifdef _WIN32
# define PFORMAT_MINEXP __pformat_exponent_digits()
# ifndef _TWO_DIGIT_EXPONENT
# define _get_output_format() 0
# define _TWO_DIGIT_EXPONENT 1
# endif
static __inline__ __attribute__((__always_inline__))
int __pformat_exponent_digits( void )
{
char *exponent_digits = getenv( "PRINTF_EXPONENT_DIGITS" );
return ((exponent_digits != NULL) && ((unsigned)(*exponent_digits - '0') < 3))
|| (_get_output_format() & _TWO_DIGIT_EXPONENT)
? 2
: 3
;
}
This winds up getting called every time I want to print, and getenv on windows must not be very quick. Replacing that define with a 2 brings the runtime back to where it should be.
So, the answer comes down to this: when using -std=c99 or any ANSI-compliant mode, MinGW switches the CRT runtime with its own. Normally, this wouldn't be an issue, but the MinGW lib had a bug which slowed its formatting functions down far beyond anything imaginable.

Using -std=c99 disable all GNU extensions.
With GNU extensions and optimization, your fprintf(test, "B") is probably replaced by a fputc('B', test)
Note this answer is obsolete, see https://stackoverflow.com/a/13973562/611560 and https://stackoverflow.com/a/13973933/611560

After some consideration of your assembler, it looks like the slow version is using the *printf() implementation of MinGW, based undoubtedly in the GCC one, while the fast version is using the Microsoft implementation from msvcrt.dll.
Now, the MS one is notably for lacking a lot of features, that the GCC one does implement. Some of these are GNU extensions but some others are for C99 conformance. And since you are using -std=c99 you are requesting the conformance.
But why so slow? Well, one factor is simplicity, the MS version is far simpler so it is expected that it will run faster, even in the trivial cases. Other factor is that you are running under Windows, so it is expected that the MS version be more efficient that one copied from the Unix world.
Does it explain a factor of x10? Probably not...
Another thing you can try:
Replace fprintf() with sprintf(), printing into a memory buffer without touching the file at all. Then you can try doing fwrite() without printfing. That way you can guess if the loss is in the formatting of the data or in the writing to the FILE.

Since MinGW32 3.15, compliant printf functions are available to use instead of those found in Microsoft C runtime (CRT).
The new printf functions are used when compiling in strict ANSI, POSIX and/or C99 modes.
For more information see the mingw32 changelog
You can use __msvcrt_fprintf() to use the fast (non compliant) function.

C: Comparison to NULL

Religious arguments aside:
Option1:
if (pointer[i] == NULL) ...
Option2:
if (!pointer[i]) ...
In C is option1 functionally equivalent to option2?
Does the later resolve quicker due to absence of a comparison ?

I prefer the explicit style (first version). It makes it obvious that there is a pointer involved and not an integer or something else but it's just a matter of style.
From a performance point of view, it should make no difference.

Equivalent. It says so in the language standard. And people have the damndest religious preferences!

I like the second, other people like the first.
Actually, I prefer a third kind to the first:
if (NULL == ptr) {
...
}
Because then I:
won't be able to miss and just type one =
won't miss the == NULL and mistake it for the opposite if the condition is long (multiple lines)
Functionally they are equivalent.
Even if a NULL pointer is not "0" (all zero bits), if (!ptr) compares with the NULL pointer.
The following is incorrect. It's still here because there are many comments referring to it:
Do not compare a pointer with literal zero, however. It will work almost everywhere but is undefined behavior IIRC.

It is often useful to assume that compiler writers have at least a minimum of intelligence. Your compiler is not written by concussed ducklings. It is written by human beings, with years of programming experience, and years spent studying compiler theory. This doesn't mean that your compiler is perfect, and always knows best, but it does mean that it is perfectly capable of handling trivial automatic optimizations.
If the two forms are equivalent, then why wouldn't the compiler just translate one into the other to ensure both are equally efficient?
If if (pointer[i] == NULL) was slower than if (!pointer[i]), wouldn't the compiler just change it into the second, more efficient form?
So no, assuming they are equivalent, they are equally efficient.
As for the first part of the question, yes, they are equivalent. The language standard actually states this explicitly somewhere -- a pointer evaluates to true if it is non-NULL, and false if it is NULL, so the two are exactly identical.

Almost certainly no difference in performance. I prefer the implicit style of the second, though.

NULL should be declared in one of the standard header files as such:
#define NULL ((void*)0)
So either way, you are comparing against zero, and the compiler should optimize both the same way. Every processor has some "optimization" or opcode for comparing with zero.

Early optimization is bad. Micro optimization is also bad, unless you are trying to squeeze every last bit of Hz from your CPU, there is no point it doing it. As people have already shown, the compiler will optimize most of your code away anyways.
Its best to make your code as concise and readable as possible. If this is more readable
if (!ptr)
than this
if (NULL==ptr)
then use it. As long as everyone who will be reading your code agrees.
Personally I use the fully defined value (NULL==ptr) so it is clear what I am checking for. Might be longer to type, but I can easily read it. I'd think the !ptr would be easy to miss ! if reading to quickly.

It really depends on the compiler. I'd be surprised if most modern C compilers didn't generate virtually identical code for the specific scenario you describe.
Get your compiler to generate an assembly listing for each of those scenarios and you can answer your own question (for your particular compiler :)).
And even if they are different, the performance difference will probably be irrelevant in practical applications.

Turn on compiler optimization and they're basically the same
tested this on gcc 4.3.3
int main (int argc, char** argv) {
char c = getchar();
int x = (c == 'x');
if(x == NULL)
putchar('y');
return 0;
}
vs
int main (int argc, char** argv) {
char c = getchar();
int x = (c == 'x');
if(!x)
putchar('y');
return 0;
}
gcc -O -o test1 test1.c
gcc -O -o test2 test2.c
diff test1 test2
produced no output :)

I did a assembly dump, and found the difference between the two versions:
## -11,8 +11,7 ##
pushl %ecx
subl $20, %esp
movzbl -9(%ebp), %eax
- movsbl %al,%eax
- testl %eax, %eax
+ testb %al, %al
It looks like the latter actually generates one instruction and the first generates two, but this is pretty unscientific.
This is gcc, no optimizations:
test1.c:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char *pointer[5];
if(pointer[0] == NULL) {
exit(1);
}
exit(0);
}
test2.c: Change pointer[0] == NULL to !pointer[0]
gcc -s test1.c, gcc -s test2.c, diff -u test1.s test2.s

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char pointer[5];
/* This is insense you are comparing a pointer to a value */
if(pointer[0] == NULL) {
exit(1);
}
...
}
=> ...
movzbl 9(%ebp), %eax # your code compares a 1 byte value to a signed 4 bytes one
movsbl %al,%eax # Will result in sign extension...
testl %eax, %eax
...
Beware, gcc should have bumped out a warning, if not the case compile with -Wall flag on
Though, you should always compile to optimized gcc code.
BTW, precede your variable with volatile keyword in order to avoid gcc from ignoring it...
Always mention your compiler build version :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight