Floating point anomaly when an unused statement is not commented out? - c

When the program as shown below is run, it produces ok output:
j= 0 9007199616606190.000000 = x
k= 0 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
But when the commented-out line (i.e. //if (argc>1) r = atol(argv[1]);) is uncommented, it produces:
j= 20000 9007199616606190.000000 = x
k= 17285 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
even though that line should have no effect, since argc>1 is false. Has anybody got a plausible explanation for this problem? Is it reproducible on any other systems?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
int j, k, m=10000;
double r=31443101, jroot=sqrt(83), x;
//if (argc>1) r = atol(argv[1]);
x = r * r * jroot;
j = m*(x-floor(x));
k = floor(m*(x-floor(x)));
printf ("j= %9d %24.6f = x\n", j, x);
printf ("k= %9d %24.6f = [x]\n", k, floor(x));
printf ("r= %9.0f %24.6f = m*(x-[x]) \n", r, m*(x-floor(x)));
return 0;
}
Note, test system = AMD Athlon 64 5200+ system with Linux 2.6.35.14-96.fc14.i686 (i.e., booted to run a 32-bit OS on 64-bit HW) with gcc (GCC) 4.5.1 20100924 (Red Hat 4.5.1-4)
Update -- A few hours ago I posted a comment that code generated with and without the if statement differed only in stack offsets and some skipped code. I now find that comment was not entirely correct; i.e. it is true for non-optimized code, but not true for the -O3 code I executed.
Effect of optimization switch on problem:
-O0 : Both program versions run ok
-O2 or -O3 : Version with comment has error as above, where j=20000 and k=17285
-O1 : Version with comment has j=20000 (an error) and k=0 (OK)
Anyhow, looking at -O3 -S code listings, the two cases differ mostly in skipped if code and stack offsets up to the line before call floor, at which point the with-if code has one more fstpl than the without-if code:
... ;; code without comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
call floor
movl $.LC2, (%esp)
fnstcw 86(%esp)
movzwl 86(%esp), %eax
...
... ;; versus code with comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
fstpl 64(%esp)
call floor
movl $.LC3, (%esp)
fnstcw 102(%esp)
movzwl 102(%esp), %eax
...
I haven't figured out the reason for the difference.

Not duplicated on my system, Win7 running CygWin with gcc 4.3.4. Both with and without the if statement, the value of j is set to zero, not 20K.
My only suggestion would be to use gcc -S to get a look at the assembler output. That should hopefully tell you what's going wrong.
Specifically, generate the assembler output to two separate files, one each for the working and non-working variant, then vgrep them (eyeball them side by side) to try and ascertain the difference.
This is a serious failure in your environment by the way. With m being 10000, that means the x - floor(x) must be equal to 2. I can't for the life of me think of any real number where that would be the case :-)

I think there are two reasons why that line could have an effect:
Without that line, the values of all of these variables can be (and, IMHO, most likely are) determined at compile-time; with that line, the computations have to be performed at run-time. But obviously, the compiler's precomputed values are supposed to be the same as values computed at run-time, and I'm inclined to discount this as the actual reason for the different observed behavior. (It would certainly show up as a huge difference in the assembler output, though!)
On many machines, floating-point arithmetic is performed using more bits in intermediate values than can actually be stored in a double-precision floating-point number. Your second version, by creating two different code-paths to set x, basically restricts x to what can be stored in a double-precision floating-point number, whereas your first version can allow the initially-calculated value for x to still be available as an intermediate value, with extra bits, when computing subsequent values. (This could be the case whether all of these values are computed at compile-time or at run-time.)

The reason that uncommenting that line might affect the result is that without that line, the compiler can see that r and jroot cannot change after initialisation, so it can calculate x at compile-time rather than runtime. When the line is uncommented, r might change, so the calculation of x must be deferred to runtime, which can result it in being done with a different precision (particularly if 387 floating point math is being used).
You can try using -mfpmath=sse -march=native to use the SSE unit for floating point calculations, which doesn't exhibit excess precision; or you can try using the -ffloat-store switch.
Your subtraction x - floor(x) exhibits catastrophic cancellation - this is the root cause of the problem something to be avoided ;).

EDITED:
I also do not see a difference when I compile your code on my computer using the -O0, -O1, -O2 and -O3.
AMD Phenom Quad 64 bit.
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I also tried clang (llvm) from release 3.0 with and without the if same results.
I agree that the compiler can pre-compute everything without that if line, you would definitely see that in the assembly output though.
Floating point and C can be nasty, lots of stuff to know to get it to really work. Forcing the int to double conversions is good for accuracy (c libraries in the compiler, even if the fpu is good have been known to have problems and the compilers C library it uses and the C library compiled into or used by your program can/will differ), but int to/from float is where FPU's tend to have their bugs (I think I saw that mentioned with TestFloat or somewhere like that). Might try running TestFloat on your system to see if your FPU is good. Between the famous pentium floating point bug and into the PentiumIV and beyond days most processors had floating point bugs, the pentium III I had was solid but the Pentium IV I had would fail. I rarely use floating point anymore so dont bother to test my systems.
Playing with optimization did change your results according to your edit so this is most likely a gcc problem or a combination of your code and gcc (and not a hardware fpu problem). Then try a different version of gcc on the same computer. 4.4.x instead of 4.5.x for example.

Related

What is happening here in pow function?

I have seen various answer here that depicts Strange behavior of pow function in C.
But I Have something different to ask here.
In the below code I have initialized int x = pow(10,2) and int y = pow(10,n) (int n = 2).
In first case it when I print the result it shows 100 and in the other case it comes out to be 99.
I know that pow returns double and it gets truncated on storing in int, but I want to ask why the output comes to be different.
CODE1
#include<stdio.h>
#include<math.h>
int main()
{
int n = 2;
int x;
int y;
x = pow(10,2); //Printing Gives Output 100
y = pow(10,n); //Printing Gives Output 99
printf("%d %d" , x , y);
}
Output : 100 99
Why is the output coming out to be different. ?
My gcc version is 4.9.2
Update :
Code 2
int main()
{
int n = 2;
int x;
int y;
x = pow(10,2); //Printing Gives Output 100
y = pow(10,n); //Printing Gives Output 99
double k = pow(10,2);
double l = pow(10,n);
printf("%d %d\n" , x , y);
printf("%f %f\n" , k , l);
}
Output : 100 99
100.000000 100.000000
Update 2 Assembly Instructions FOR CODE1
Generated Assembly Instructions GCC 4.9.2 using gcc -S -masm=intel :
.LC1:
.ascii "%d %d\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
push ebp
mov ebp, esp
and esp, -16
sub esp, 48
call ___main
mov DWORD PTR [esp+44], 2
mov DWORD PTR [esp+40], 100 //Concerned Line
fild DWORD PTR [esp+44]
fstp QWORD PTR [esp+8]
fld QWORD PTR LC0
fstp QWORD PTR [esp]
call _pow //Concerned Line
fnstcw WORD PTR [esp+30]
movzx eax, WORD PTR [esp+30]
mov ah, 12
mov WORD PTR [esp+28], ax
fldcw WORD PTR [esp+28]
fistp DWORD PTR [esp+36]
fldcw WORD PTR [esp+30]
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+8], eax
mov eax, DWORD PTR [esp+40]
mov DWORD PTR [esp+4], eax
mov DWORD PTR [esp], OFFSET FLAT:LC1
call _printf
leave
ret
.section .rdata,"dr"
.align 8
LC0:
.long 0
.long 1076101120
.ident "GCC: (tdm-1) 4.9.2"
.def _pow; .scl 2; .type 32; .endef
.def _printf; .scl 2; .type 32; .endef
I know that pow returns double and it gets truncated on storing in int, but I want to ask why the output comes to be different.
You must first, if you haven't already, divest yourself of the idea that floating-point numbers are in any way sensible or predictable. double only approximates real numbers and almost anything you do with a double is likely to be an approximation to the actual result.
That said, as you have realized, pow(10, n) resulted in a value like 99.99999999999997, which is an approximation accurate to 15 significant figures. And then you told it to truncate to the largest integer less than that, so it threw away most of those.
(Aside: there is rarely a good reason to convert a double to an int. Usually you should either format it for display with something like sprintf("%.0f", x), which does rounding correctly, or use the floor function, which can handle floating-point numbers that may be out of the range of an int. If neither of those suit your purpose, like in currency or date calculations, possibly you should not be using floating point numbers at all.)
There are two weird things going on here. First, why is pow(10, n) inaccurate? 10, 2, and 100 are all precisely representable as double. The best answer I can offer is that the C standard library you are using has a bug. (The compiler and the standard library, which I assume are gcc and glibc, are developed on different release schedules and by different teams. If pow is returning inaccurate results, that is probably a bug in glibc, not gcc.)
In the comments on your question, amdn found a glibc bug to do with FP rounding that might be related and another Q&A that goes into more detail about why this happens and how it's not a violation of the C standard. chux's answer also addresses this. (C doesn't require implementation of IEEE 754, but even if it did, pow isn't required to use correct rounding.) I will still call this a glibc bug, because it's an undesirable property.
(It's also conceivable, though unlikely, that your processor's FPU is wrong.)
Second, why is pow(10, n) different from pow(10, 2)? This one is far easier. gcc optimizes away function calls for which the result can be calculated at compile time, so pow(10, 2) is almost certainly being optimized to 100.0. If you look at the generated assembly code, you will find only one call to pow.
The GCC manual, section 6.59 describes which standard library functions may be treated in this way (follow the link for the full list):
The remaining functions are provided for optimization purposes.
With the exception of built-ins that have library equivalents such as the standard C library functions discussed below, or that expand to library calls, GCC built-in functions are always expanded inline and thus do not have corresponding entry points and their address cannot be obtained. Attempting to use them in an expression other than a function call results in a compile-time error.
[...]
The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil, cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf, isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit, tolower, toupper, labs, ldexp, log10, log, malloc, memchr, memcmp, memcpy, memset, modf, pow, printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt, sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat, strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf, vprintf and vsprintf are all recognized as built-in functions unless -fno-builtin is specified (or -fno-builtin-function is specified for an individual function).
So it would seem you can disable this behavior with -fno-builtin-pow.
Why is the output coming out to be different. ? (in the updated appended code)
We do not know the values are that different.
When comparing the textual out of int/double, be sure to print the double with sufficient precision to see if it is 100.000000 or just near 100.000000 or in hex to remove all doubt.
printf("%d %d\n" , x , y);
// printf("%f %f\n" , k , l);
// Is it the FP number just less than 100?
printf("%.17e %.17e\n" , k , l); // maybe 9.99999999999999858e+01
printf("%a %a\n" , k , l); // maybe 0x1.8ffffffffffff0000p+6
Why is the output coming out to be different. ? (in the original code)
C does not specify the accuracy of most <math.h> functions. The following are all compliant results.
// Higher quality functions return 100.0
pow(10,2) --> 100.0
// Lower quality and/or faster one may return nearby results
pow(10,2) --> 100.0000000000000142...
pow(10,2) --> 99.9999999999999857...
Assigning a floating point (FP) number to an int simple drops the fraction regardless of how close the fraction is to 1.0
When converting FP to an integer, better to control the conversion and round to cope with minor computational differences.
// long int lround(double x);
long i = lround(pow(10.0,2.0));
You're not the first to find this. Here's a discussion form 2013:
pow() cast to integer, unexpected result
I'm speculating that the assembly code produced by the tcc guys is causing the second value to be rounded down after calculating a result that is REALLY close to 100.
Like mikijov said in that historic post, looks like the bug has been fixed.
As others have mentioned, Code 2 returns 99 due to floating point truncation. The reason why Code 1 returns a different and correct answer is because of a libc optimization.
When the power is a small positive integer, it is more efficient to perform the operation as repeated multiplication. The simpler path removes roundoff. Since this is inlined you don't see function calls being made.
You've fooled it into thinking that the inputs are real and so it gives an approximate answer, which happens to be slightly under 100, e.g. 99.999999 that is then truncated to 99.

AVR GCC, assembly C stub functions, eor and the required constant value

I'm having this code:
uint16_t swap_bytes(uint16_t x)
{
asm volatile(
"eor, %A0, %B0" "\n\t"
"eor, %B0, %A0" "\n\t"
"eor, %A0, %B0" "\n\t"
: "=r" (x)
: "0" (x)
);
return x;
}
Which translates (by avr-gcc version 4.8.1 with -std=gnu99 -save-temps) to:
.global swap_bytes
.type swap_bytes, #function
swap_bytes:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
/* #APP */
; 43 "..\lib\own\ownlib.c" 1
eor, r24, r25
eor, r25, r24
eor, r24, r25
; 0 "" 2
/* #NOAPP */
ret
.size swap_bytes, .-swap_bytes
But then the compiler is complaining like that:
|65|Error: constant value required|
|65|Error: garbage at end of line|
|66|Error: constant value required|
|66|Error: garbage at end of line|
|67|Error: constant value required|
|67|Error: garbage at end of line|
||=== Build failed: 6 error(s), 0 warning(s) (0 minute(s), 0 second(s)) ===|
The mentioned lines are the ones with the eor commands. Why does the compiler having problems with that? The registers are even upper (>= r16) ones where nearly all operations are possible. constant value required sounds to me like it expects a literal... I dont get it.
Just to clarify for future googlers:
eor, r24, r25
has an extra comma after the eor. This should be written as:
eor r24, r25
I would also encourage you (again) to consider using gcc's __builtin_bswap16. In case you are not familiar with the gcc 'builtin' functions, they are functions that are built into the compiler, and (despite looking like functions) are typically inlined. They have been written and optimized by people who understand all the ins and outs of the various processors and can take into account things you may not have considered.
I understand the desire to keep code as small as possible. And I accept that it is possible that (somehow) this builtin on your specific processor is producing sub-optimal code (I assume you have checked?). On the other hand, it may produce exactly the same code. Or it may use some even more clever trick to do this. Or it might interleave instructions from the surrounding code to take advantage of pipelining (or some other avr-specific thing that I have never heard of because I don't speak 'avr').
What's more, consider this code:
int main()
{
return __builtin_bswap16(12345);
}
Your code always takes 3 instructions to process a swap. However with builtins, the compiler can recognize that the arg is constant and compute the value at compile time instead of at run time. Hard to be more efficient than that.
I could also point out the benefits of "easier to support." Writing inline asm is HARD to do correctly. And future maintainers hate to touch it cuz they're never quite sure how it works. And of course, the builtin is going to be more cross-platform portable.
Still not convinced? My last pitch: Even after you fix the commas, your inline asm code still isn't quite right. Consider this code:
int main(int argc, char *argv[])
{
return swap_bytes(argc) + swap_bytes(argc);
}
Because of the way you have written written swap_bytes (ie using volatile), gcc must compute the value twice (see the definition of volatile). Had you omitted volatile (or if you had used the builtin which does this correctly), it would have realized that argc doesn't change and re-used the output from the first call. Did I mention that correctly writing inline asm is HARD?
I don't know your code, constraints, level of expertise or requirements. Maybe your solution really is the best. The most I can do is to encourage you to think long and hard before using inline asm in production code.

How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something.
If you want to combine a scalar float with an existing vector, there doesn't seem to be a way to do it without high-element-zeroing or broadcasting the scalar into a vector, using Intel intrinsics. I haven't investigated GNU C native vector extensions and the associated builtins.
This wouldn't be too bad if the extra intrinsic optimized away, but it doesn't with gcc (5.4 or 6.2). There's also no nice way to use pmovzx or insertps as loads, for the related reason that their intrinsics only take vector args. (And gcc doesn't fold a scalar->vector load into the asm instruction.)
__m128 replace_lower_two_elements(__m128 v, float x) {
__m128 xv = _mm_set_ss(x); // WANTED: something else for this step, some compilers actually compile this to a separate insn
return _mm_shuffle_ps(v, xv, 0); // lower 2 elements are both x, and the garbage is gone
}
gcc 5.3 -march=nehalem -O3 output, to enable SSE4.1 and tune for that Intel CPU: (It's even worse without SSE4.1; multiple instructions to zero the upper elements).
insertps xmm1, xmm1, 0xe # pointless zeroing of upper elements. shufps only reads the low element of xmm1
shufps xmm0, xmm1, 0 # The function *should* just compile to this.
ret
TL:DR: the rest of this question is just asking if you can actually do this efficiently, and if not why not.
clang's shuffle-optimizer gets this right, and doesn't waste instructions on zeroing high elements (_mm_set_ss(x)), or duplicating the scalar into them (_mm_set1_ps(x)). Instead of writing something the compiler has to optimize away, shouldn't there be a way to write it "efficiently" in C in the first place? Even very recent gcc doesn't optimize it away, so this is a real (but minor) problem.
This would be possible if there was a scalar->128b equivalent of __m256 _mm256_castps128_ps256 (__m128 a). i.e. produce a __m128 with undefined garbage in upper elements, and the float in the low element, compiling to zero asm instructions if the scalar float/double was already in an xmm register.
None of the following intrinsics exist, but they should.
a scalar->__m128 equivalent of _mm256_castps128_ps256 as described above. The most general solution for the scalar-already-in-register case.
__m128 _mm_move_ss_scalar (__m128 a, float s): replace low element of vector a with scalar s. This isn't actually necessary if there's a general-purpose scalar->__m128 (previous bullet point). (The reg-reg form of movss merges, unlike the load form which zeros, and unlike movd which zeros upper elements in both cases. To copy a register holding a scalar float without false dependencies, use movaps).
__m128i _mm_loadzxbd (const uint8_t *four_bytes) and other sizes of PMOVZX / PMOVSX: AFAICT, there's no good safe way to use the PMOVZX intrinsics as a load, because the inconvenient safe way doesn't optimize away with gcc.
__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8). INSERTPS behaves differently as a load: the upper 2 bits of the imm8 are ignored, and it always takes the scalar at the effective address (instead of an element from a vector in memory). This lets it work with addresses that aren't 16B-aligned, and work even without faulting if the float right before an unmapped page.
Like with PMOVZX, gcc fails to fold an upper-element-zeroing _mm_load_ss() into a memory operand for INSERTPS. (Note that if the upper 2 bits of the imm8 aren't both zero, then _mm_insert_ps(xmm0, _mm_load_ss(), imm8) can compile to insertps xmm0,xmm0,foo, with a different imm8 that zeros elements in vec as-if the src element was actually a zero produced by MOVSS from memory. Clang actually uses XORPS/BLENDPS in that case)
Are there any viable workarounds to emulate any of those that are both safe (don't break at -O0 by e.g. loading 16B that might touch the next page and segfault), and efficient (no wasted instructions at -O3 with current gcc and clang at least, preferably also other major compilers)? Preferably also in a readable way, but if necessary it could be put behind an inline wrapper function like __m128 float_to_vec(float a){ something(a); }.
Is there any good reason for Intel not to introduce intrinsics like that? They could have added a float->__m128 with undefined upper elements at the same time as adding _mm256_castps128_ps256. Is this a matter of compiler internals making it hard to implement? Perhaps specifically ICC internals?
The major calling conventions on x86-64 (SysV or MS __vectorcall) take the first FP arg in xmm0 and return scalar FP args in xmm0, with upper elements undefined. (See the x86 tag wiki for ABI docs). This means it's not uncommon for the compiler to have a scalar float/double in a register with unknown upper elements. This will be rare in a vectorized inner loop, so I think avoiding these useless instructions will mostly just save a bit of code size.
The pmovzx case is more serious: that is something you might use in an inner loop (e.g. for a LUT of VPERMD shuffle masks, saving a factor of 4 in cache footprint vs. storing each index padded to 32 bits in memory).
The pmovzx-as-a-load issue has been bothering me for a while now, and the original version of this question got me thinking about the related issue of using a scalar float in an xmm register. There are probably more use-cases for pmovzx as a load than for scalar->__m128.
It's doable with GNU C inline asm, but this is ugly and defeats many optimizations, including constant-propagation (https://gcc.gnu.org/wiki/DontUseInlineAsm). This will not be the accepted answer. I'm adding this as an answer instead of part of the question so the question stays short isn't huge.
// don't use this: defeating optimizations is probably worse than an extra instruction
#ifdef __GNUC__
__m128 float_to_vec_inlineasm(float x) {
__m128 retval;
asm ("" : "=x"(retval) : "0"(x)); // matching constraint: provide x in the same xmm reg as retval
return retval;
}
#endif
This does compile to a single ret, as desired, and will inline to let you shufps a scalar into a vector:
gcc5.3
float_to_vec_and_shuffle_asm(float __vector(4), float):
shufps xmm0, xmm1, 0 # tmp93, xv,
ret
See this code on the Godbolt compiler explorer.
This is obviously trivial in pure assembly language, where you don't have to fight with a compiler to get it not to emit instructions you don't want or need.
I haven't found any real way to write a __m128 float_to_vec(float a){ something(a); } that compiles to just a ret instruction. An attempt for double using _mm_undefined_pd() and _mm_move_sd() actually makes worse code with gcc (see the Godbolt link above). None of the existing float->__m128 intrinsics help.
Off-topic: actual _mm_set_ss() code-gen strategies: When you do write code that has to zero upper elements, compilers pick from an interesting range of strategies. Some good, some weird. The strategies also differ between double and float on the same compiler (gcc or clang), as you can see on the Godbolt link above.
One example: __m128 float_to_vec(float x){ return _mm_set_ss(x); } compiles to:
# gcc5.3 -march=core2
movd eax, xmm0 # movd xmm0,xmm0 would work; IDK why gcc doesn't do that
movd xmm0, eax
ret
# gcc5.3 -march=nehalem
insertps xmm0, xmm0, 0xe
ret
# clang3.8 -march=nehalem
xorps xmm1, xmm1
blendps xmm0, xmm1, 14 # xmm0 = xmm0[0],xmm1[1,2,3]
ret

Massive fprintf speed difference without "-std=c99"

I had been struggling for weeks with a poor-performing translator I had written.
On the following simple bechmark
#include<stdio.h>
int main()
{
int x;
char buf[2048];
FILE *test = fopen("test.out", "wb");
setvbuf(test, buf, _IOFBF, sizeof buf);
for(x=0;x<1024*1024; x++)
fprintf(test, "%04d", x);
fclose(test);
return 0
}
we see the following result
bash-3.1$ gcc -O2 -static test.c -o test
bash-3.1$ time ./test
real 0m0.334s
user 0m0.015s
sys 0m0.016s
As you can see, the moment the "-std=c99" flag is added in, performance comes crashing down:
bash-3.1$ gcc -O2 -static -std=c99 test.c -o test
bash-3.1$ time ./test
real 0m2.477s
user 0m0.015s
sys 0m0.000s
The compiler I'm using is gcc 4.6.2 mingw32.
The file generated is about 12M, so this is a difference between of about 21MB/s between the two.
Running diff shows the the generated files are identical.
I assumed this has something to do with file locking in fprintf, of which the program makes heavy use, but I haven't been able to find a way to switch that off in the C99 version.
I tried flockfile on the stream I use at the beginning of the program, and an corresponding funlockfile at the end, but was greeted with compiler errors about implicit declarations, and linker errors claiming undefined references to those functions.
Could there be another explanation for this problem, and more importantly, is there any way to use C99 on windows without paying such an enormous performance price?
Edit:
After looking at the code generated by these options, it looks like in the slow versions, mingw sticks in the following:
_fprintf:
LFB0:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
leal 40(%esp), %eax
movl %eax, 8(%esp)
movl 36(%esp), %eax
movl %eax, 4(%esp)
movl 32(%esp), %eax
movl %eax, (%esp)
call ___mingw_vfprintf
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
In the fast version, this simply does not exist; otherwise, both are exactly the same. I assume __mingw_vfprintf seems to be the slowpoke here, but I have no idea what behavior it needs to emulate that makes it so slow.
After some digging in the source code, I have found why the MinGW function is so terribly slow:
At the beginning of a [v,f,s]printf in MinGW, there is some innocent-looking initialization code:
__pformat_t stream = {
dest, /* output goes to here */
flags &= PFORMAT_TO_FILE | PFORMAT_NOLIMIT, /* only these valid initially */
PFORMAT_IGNORE, /* no field width yet */
PFORMAT_IGNORE, /* nor any precision spec */
PFORMAT_RPINIT, /* radix point uninitialised */
(wchar_t)(0), /* leave it unspecified */
0, /* zero output char count */
max, /* establish output limit */
PFORMAT_MINEXP /* exponent chars preferred */
};
However, PFORMAT_MINEXP is not what it appears to be:
#ifdef _WIN32
# define PFORMAT_MINEXP __pformat_exponent_digits()
# ifndef _TWO_DIGIT_EXPONENT
# define _get_output_format() 0
# define _TWO_DIGIT_EXPONENT 1
# endif
static __inline__ __attribute__((__always_inline__))
int __pformat_exponent_digits( void )
{
char *exponent_digits = getenv( "PRINTF_EXPONENT_DIGITS" );
return ((exponent_digits != NULL) && ((unsigned)(*exponent_digits - '0') < 3))
|| (_get_output_format() & _TWO_DIGIT_EXPONENT)
? 2
: 3
;
}
This winds up getting called every time I want to print, and getenv on windows must not be very quick. Replacing that define with a 2 brings the runtime back to where it should be.
So, the answer comes down to this: when using -std=c99 or any ANSI-compliant mode, MinGW switches the CRT runtime with its own. Normally, this wouldn't be an issue, but the MinGW lib had a bug which slowed its formatting functions down far beyond anything imaginable.
Using -std=c99 disable all GNU extensions.
With GNU extensions and optimization, your fprintf(test, "B") is probably replaced by a fputc('B', test)
Note this answer is obsolete, see https://stackoverflow.com/a/13973562/611560 and https://stackoverflow.com/a/13973933/611560
After some consideration of your assembler, it looks like the slow version is using the *printf() implementation of MinGW, based undoubtedly in the GCC one, while the fast version is using the Microsoft implementation from msvcrt.dll.
Now, the MS one is notably for lacking a lot of features, that the GCC one does implement. Some of these are GNU extensions but some others are for C99 conformance. And since you are using -std=c99 you are requesting the conformance.
But why so slow? Well, one factor is simplicity, the MS version is far simpler so it is expected that it will run faster, even in the trivial cases. Other factor is that you are running under Windows, so it is expected that the MS version be more efficient that one copied from the Unix world.
Does it explain a factor of x10? Probably not...
Another thing you can try:
Replace fprintf() with sprintf(), printing into a memory buffer without touching the file at all. Then you can try doing fwrite() without printfing. That way you can guess if the loss is in the formatting of the data or in the writing to the FILE.
Since MinGW32 3.15, compliant printf functions are available to use instead of those found in Microsoft C runtime (CRT).
The new printf functions are used when compiling in strict ANSI, POSIX and/or C99 modes.
For more information see the mingw32 changelog
You can use __msvcrt_fprintf() to use the fast (non compliant) function.

Can I compute pow(10,x) at compile-time in c?

Is it possible to compute pow(10,x) at compile time?
I've got a processor without floating point support and slow integer division. I'm trying to perform as many calculations as possible at compile time. I can dramatically speed up one particular function if I pass both x and C/pow(10,x) as arguments (x and C are always constant integers, but they are different constants for each call). I'm wondering if I can make these function calls less error prone by introducing a macro which does the 1/pow(10,x) automatically, instead of forcing the programmer to calculate it?
Is there a pre-processor trick? Can I force the compiler optimize out the library call?
There are very few values possible before you overflow int (or even long). For clarities sake, make it a table!
edit: If you are using floats (looks like you are), then no it's not going to be possible to call the pow() function at compile time without actually writing code that runs in the make process and outputs the values to a file (such as a header file) which is then compiled.
GCC will do this at a sufficiently high optimization level (-O1 does it for me). For example:
#include <math.h>
int test() {
double x = pow(10, 4);
return (int)x;
}
Compiles at -O1 -m32 to:
.file "test.c"
.text
.globl test
.type test, #function
test:
pushl %ebp
movl %esp, %ebp
movl $10000, %eax
popl %ebp
ret
.size test, .-test
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",#progbits
This works without the cast as well - of course, you do get a floating-point load instruction in there, as the Linux ABI passes floating point return values in FPU registers.
You can do it with Boost.Preprocessor:
http://www.boost.org/doc/libs/1_39_0/libs/preprocessor/doc/index.html
Code:
#include <boost/preprocessor/repeat.hpp>
#define _TIMES_10(z, n, data) * 10
#define POW_10(n) (1 BOOST_PP_REPEAT(n, _TIMES_10, _))
int test[4] = {POW_10(0), POW_10(1), POW_10(2), POW_10(3)};
Actually, by exploiting the C preprocessor, you can get it to compute C pow(10, x) for any real C and integral x. Observe that, as #quinmars noted, C allows you to use scientific syntax to express numerical constants:
#define myexp 1.602E-19 // == 1.602 * pow(10, -19)
to be used for constants. With this in mind, and a bit of cleverness, we can construct a preprocessor macro that takes C and x and combine them into an exponentiation token:
#define EXP2(a, b) a ## b
#define EXP(a, b) EXP2(a ## e,b)
#define CONSTPOW(C,x) EXP(C, x)
This can now be used as a constant numerical value:
const int myint = CONSTPOW(3, 4); // == 30000
const double myfloat = CONSTPOW(M_PI, -2); // == 0.03141592653
You can use the scientific notation for floating point values which is part of the C language. It looks like that:
e = 1.602E-19 // == 1.602 * pow(10, -19)
The number before the E ( the E maybe capital or small 1.602e-19) is the fraction part where as the (signed) digit sequence after the E is the exponent part. By default the number is of the type double, but you can attach a floating point suffix (f, F, l or L) if you need a float or a long double.
I would not recommend to pack this semantic into a macro:
It will not work for variables, floating point values, etc.
The scientific notation is more readable.
Actually, you have M4 which is a pre-processor way more powerful than the GCC’s. A main difference between those two is GCC’s is not recursive whereas M4 is. It makes possible things like doing arithmetic at compile-time (and much more!). The below code sample is what you would like to do, isn’t it? I made it bulky in a one-file source; but I usually put M4's macro definitions in separate files and tune my Makefile rules. This way, your code is kept from ugly intrusive M4 definitions into the C source code I've done here.
$ cat foo.c
define(M4_POW_AUX, `ifelse($2, 1, $1, `eval($1 * M4_POW_AUX($1, decr($2)))')')dnl
define(M4_POW, `ifelse($2, 0, 1, `M4_POW_AUX($1, $2)')')dnl
#include <stdio.h>
int main(void)
{
printf("2^0 = %d\n", M4_POW(2, 0));
printf("2^1 = %d\n", M4_POW(2, 1));
printf("2^4 = %d\n", M4_POW(2, 4));
return 0;
}
The command line to compile this code sample uses the ability of GCC and M4 to read from the standard input.
$ cat foo.c | m4 - | gcc -x c -o m4_pow -
$ ./m4_pow
2^0 = 1
2^1 = 2
2^4 = 16
Hope this help!
If you just need to use the value at compile time, use the scientific notation like 1e2 for pow(10, 2)
If you want to populate the values at compile time and then use them later at runtime then simply use a lookup table because there are only 23 different powers of 10 that are exactly representable in double precision
double POW10[] = {1., 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9, 1e10,
1e11, 1e12, 1e13, 1e14, 1e15, 1e16, 1e17, 1e18, 1e19, 1e20, 1e21, 1e22};
You can get larger powers of 10 at runtime from the above lookup table to quickly get the result without needing to multiply by 10 again and again, but the result is just a value close to a power of 10 like when you use 10eX with X > 22
double pow10(int x)
{
if (x > 22)
return POW10[22] * pow10(x - 22);
else if (x >= 0)
return POW10[x];
else
return 1/pow10(-x);
}
If negative exponents is not needed then the final branch can be removed.
You can also reduce the lookup table size further if memory is a constraint. For example by storing only even powers of 10 and multiply by 10 when the exponent is odd, the table size is now only a half.
Recent versions of GCC ( around 4.3 ) added the ability to use GMP and MPFR to do some compile-time optimizations by evaluating more complex functions that are constant. That approach leaves your code simple and portable, and trust the compiler to do the heavy lifting.
Of course, there are limits to what it can do. Here's a link to the description in the changelog, which includes a list of functions that are supported by this. 'pow' is one them.
Unfortunately, you can't use the preprocessor to precalculate library calls. If x is integral you could write your own function, but if it's a floating-point type I don't see any good way to do this.
bdonlan's replay is spot on but keep in mind that you can perform nearly any optimization you chose on the compile box provided you are willing to parse and analyze the code in your own custom preprocessor. It is a trivial task in most version of unix to override the implicit rules that call the compiler to call a custom step of your own before it hits the compiler.

Resources