I was having some problems with a sample code i was testing, since my abs function was not returning the correct result. abs(-2) was outputing -2 (this, by the way, is suposed to be the absolute value function, if that was unclear)
After getting a bit desperate, i eventually had the following code
#include <stdio.h>
unsigned int abs(int x) {
return 1;
}
int main() {
printf("%d\n", abs(-2));
return 0;
}
This does nothing useful but it serves to show my problem. This was outputing -2, when it was expected to output 1.
if i change the function name to something else (abs2 for example), the result is now correct. Also, if i change it to receive two arguments instead of one, it also fixes the problem.
My obvious guess: a conflict with the standart abs function. But this still doesn't explain why the output is -2 (it should be 2, if using the standart abs function). I tried checking the assembly output of both versions (with the function named abs and abs2)
Here's the diff output for both assemblys:
23,25c23,25
< .globl abs
< .type abs, #function
< abs:
---
> .globl abs2
> .type abs2, #function
> abs2:
54c54
< .size abs, .-abs
---
> .size abs2, .-abs2
71c71,74
< movl -4(%rbp), %edx
---
> movl -4(%rbp), %eax
> movl %eax, %edi
> call abs2
> movl %eax, %edx
From what i understand, the first version (where the function is named abs) is simply discarding the function call, thus using the parameter x instead of abs(x)
So to sum up: why does this happen, especially since i couldn't find a way to get any sort of warning or error about this.
Tested on Debian Squeeze, ggc 4.4.5, and also on gcc 4.1.2
GCC is playing tricks on you due to the interplay of the following:
abs is a built-in function;
you're declaring abs to return unsigned int while the standard (and built-in) abs returns signed int.
Try compiling with gcc -fno-builtin; on my box, that gives the expected result of 1. Compiling without that option but with abs declared as returning signed int causes the program to print 2.
(The real solution to this problem is to not use library identifiers for your own functions. Also note that you shouldn't be printing an unsigned int with %d.)
gcc optimizes the call to abs() to use its built-in abs(). So if you use the -fno-builtin option (or define your abs() as returning int), you'll notice you get the correct result. According to this (quoting):
GCC includes built-in versions of many of the functions in the
standard C library. The versions prefixed with _builtin will always
be treated as having the same meaning as the C library function even
if you specify the -fno-builtin option. (see C Dialect Options) Many
of these functions are only optimized in certain cases; if they are
not optimized in a particular case, a call to the library function
will be emitted.
Had you included stdlib.h, which declares abs() in the first place, you'd get an error at compile time.
Sounds a lot like this bug, which is from 2007 and noted as being fixed.
You should of course try to compile without GCC's intrinics, i.e. pass -fno-builtin (or just -fno-builtin-abs to snipe out only abs()) when compiling.
Related
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
A bit of context here: I told a friend that I wrote a program not using dynamic memory allocation ("malloc") and allocate all memory static (by modeling all my state variables in a struct, which I then declared global). He then told me that if I'm using automatic variables, I still make use of dynamic memory. I argued that my automatic variables are not state variables but control variables, so my program is still to be considered static. We then discussed that there has to be a way to make a statement about the absolute worst-case behaviour about my program, so I came up with the above question.
Bonus question: If the assumptions above hold, I could simply declare all automatic variables static and would end up with a "truly" static program?
Even if array sizes are constant a C implementation could allocate arrays and even structures dynamically. I'm not aware of any that do (anyone) and it would appear quite unhelpful. But the C Standard doesn't make such guarantees.
There is also (almost certainly) some further overhead in the stack frame (the data added to the stack on call and released on return).
You would need to declare all your functions as taking no parameters and returning void to ensure no program variables in the stack. Finally the 'return address' of where execution of a function is to continue after return is pushed onto the stack (at least logically).
So having removed all parameters, automatic variables and return values to you 'state' struct there will still be something going on to the stack - probably.
I say probably because I'm aware of a (non-standard) embedded C compiler that forbids recursion that can determine the maximum size of the stack by examining the call tree of the whole program and identify the call chain that reaches the peek size of the stack.
You could achieve this a monstrous pile of goto statements (some conditional where a functon is logically called from two places or by duplicating code.
It's often important in embedded code on devices with tiny memory to avoid any dynamic memory allocation and know that any 'stack-space' will never overflow.
I'm happy this is a theoretical discussion. What you suggest is a mad way to write code and would throw away most of (ultimately limited) services C provides to infrastructure of procedural coding (pretty much the call stack)
Footnote: See the comment below about the 8-bit PIC architecture.
Bonus question: If the assumptions above hold, I could simply declare
all automatic variables static and would end up with a "truly" static
program?
No. This would change the function of the program. static variables are initialized only once.
Compare this 2 functions:
int canReturn0Or1(void)
{
static unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
int willAlwaysReturn0(void)
{
unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
No, because of function pointers..... Read n1570.
Consider the following code, where rand(3) is some pseudo random number generator (it could also be some input from a sensor) :
typedef int foosig(int);
int foo(int x) {
foosig* fptr = (x>rand())?&foo:NULL;
if (fptr)
return (*fptr)(x);
else
return x+rand();
}
An optimizing compiler (such as some recent GCC suitably invoked with enough optimizations) would make a tail-recursive call for (*fptr)(x). Some other compiler won't.
Depending on how you compile that code, it would use a bounded stack or could produce a stack overflow. With some ABI and calling conventions, both the argument and the result could go thru a processor register and won't consume any stack space.
Experiment with a recent GCC (e.g. on Linux/x86-64, some GCC 10 in 2020) invoked as gcc -O2 -fverbose-asm -S foo.c then look inside foo.s. Change the -O2 to a -O0.
Observe that the naive recursive factorial function could be compiled into some iterative machine code with a good enough C compiler and optimizer. In practice GCC 10 on Linux compiling the below code:
int fact(int n)
{
if (n<1) return 1;
else return n*fact(n-1);
}
as gcc -O3 -fverbose-asm tmp/fact.c -S -o tmp/fact.s produces the following assembler code:
.type fact, #function
fact:
.LFB0:
.cfi_startproc
endbr64
# tmp/fact.c:3: if (n<1) return 1;
movl $1, %eax #, <retval>
testl %edi, %edi # n
jle .L1 #,
.p2align 4,,10
.p2align 3
.L2:
imull %edi, %eax # n, <retval>
subl $1, %edi #, n
jne .L2 #,
.L1:
# tmp/fact.c:5: }
ret
.cfi_endproc
.LFE0:
.size fact, .-fact
.ident "GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0"
And you can observe that the call stack is not increasing above.
If you have serious and documented arguments against GCC, please submit a bug report.
BTW, you could write your own GCC plugin which would choose to randomly apply or not such an optimization. I believe it stays conforming to the C standard.
The above optimization is essential for many compilers generating C code, such as Chicken/Scheme or Bigloo.
A related theorem is Rice's theorem. See also this draft report funded by the CHARIOT project.
See also the Compcert project.
Sometimes a function doesn't use an argument (perhaps because another "flags" argument doesn't enable a specific feature).
However, you have to specify something, so usually you just put 0. But if you do that, and the function is external, gcc will emit code to "really make sure" that parameter gets set to 0.
Is there a way to tell gcc that a particular argument to a function doesn't matter and it can leave alone whatever value it is that happens to be in the argument register right then?
Update: Someone asked about the XY problem. The context behind this question is I want to implement a varargs function in x86_64 without using the compiler varargs support. This is simplest when the parameters are on the stack, so I declare my functions to take 5 or 6 dummy parameters first, so that the last non-vararg parameter and all of the vararg parameters end up on the stack. This works fine, except it's clearly not optimal - when looking at the assembly code it's clear that gcc is initializing all those argument registers to zero in the caller.
Please don't take below answer seriously. The question asks for a hack so there you go.
GCC will effectively treat value of uninitialized variable as "don't care" so we can try exploiting this:
int foo(int x, int y);
int bar_1(int y) {
int tmp = tmp; // Suppress uninitialized warnings
return foo(tmp, y);
}
Unfortunately my version of GCC still cowardly initializes tmp to zero but yours may be more aggressive:
bar_1:
.LFB0:
.cfi_startproc
movl %edi, %esi
xorl %edi, %edi
jmp foo
.cfi_endproc
Another option is (ab)using inline assembly to fake GCC into thinking that tmp is defined (when in fact it isn't):
int bar_2(int y) {
int tmp;
asm("" : "=r"(tmp));
return foo(tmp, y);
}
With this GCC managed to get rid of parameter initializations:
bar_2:
.LFB1:
.cfi_startproc
movl %edi, %esi
jmp foo
.cfi_endproc
Note that inline asm must be immediately before the function call, otherwise GCC will think it has to preserve output values which would harm register allocation.
I stumbled over an interesting question in a forum a long time ago and I want to know the answer.
Consider the following C function:
f1.c
#include <stdbool.h>
bool f1()
{
int var1 = 1000;
int var2 = 2000;
int var3 = var1 + var2;
return (var3 == 0) ? true : false;
}
This should always return false since var3 == 3000. The main function looks like this:
main.c
#include <stdio.h>
#include <stdbool.h>
int main()
{
printf( f1() == true ? "true\n" : "false\n");
if( f1() )
{
printf("executed\n");
}
return 0;
}
Since f1() should always return false, one would expect the program to print only one false to the screen. But after compiling and running it, executed is also displayed:
$ gcc main.c f1.c -o test
$ ./test
false
executed
Why is that? Does this code have some sort of undefined behavior?
Note: I compiled it with gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2.
As noted in other answers, the problem is that you use gcc with no compiler options set. If you do this, it defaults to what is called "gnu90", which is a non-standard implementation of the old, withdrawn C90 standard from 1990.
In the old C90 standard there was a major flaw in the C language: if you didn't declare a prototype before using a function, it would default to int func () (where ( ) means "accept any parameter"). This changes the calling convention of the function func, but it doesn't change the actual function definition. Since the size of bool and int are different, your code invokes undefined behavior when the function is called.
This dangerous nonsense behavior was fixed in the year 1999, with the release of the C99 standard. Implicit function declarations were banned.
Unfortunately, GCC up to version 5.x.x still uses the old C standard by default. There is probably no reason why you should want to compile your code as anything but standard C. So you have to explicitly tell GCC that it should compile your code as modern C code, instead of some 25+ years old, non-standard GNU crap.
Fix the problem by always compiling your program as:
gcc -std=c11 -pedantic-errors -Wall -Wextra
-std=c11 tells it to make a half-hearted attempt to compile according the (current) C standard (informally known as C11).
-pedantic-errors tells it to whole-heartedly do the above, and give compiler errors when you write incorrect code which violates the C standard.
-Wall means give me some extra warnings that might be good to have.
-Wextra means give me some other extra warnings that might be good to have.
You don't have a prototype declared for f1() in main.c, so it is implicitly defined as int f1(), meaning it is a function that takes an unknown number of arguments and returns an int.
If int and bool are of different sizes, this will result in undefined behavior. For example, on my machine, int is 4 bytes and bool is one byte. Since the function is defined to return bool, it puts one byte on the stack when it returns. However, since it's implicitly declared to return int from main.c, the calling function will try to read 4 bytes from the stack.
The default compilers options in gcc won't tell you that it's doing this. But if you compile with -Wall -Wextra, you'll get this:
main.c: In function ‘main’:
main.c:6: warning: implicit declaration of function ‘f1’
To fix this, add a declaration for f1 in main.c, before main:
bool f1(void);
Note that the argument list is explicitly set to void, which tells the compiler the function takes no arguments, as opposed to an empty parameter list which means an unknown number of arguments. The definition f1 in f1.c should also be changed to reflect this.
I think it's interesting to see where the size-mismatch mentioned in Lundin's excellent answer actually happens.
If you compile with --save-temps, you will get assembly files that you can look at. Here's the part where f1() does the == 0 comparison and returns its value:
cmpl $0, -4(%rbp)
sete %al
The returning part is sete %al. In C's x86 calling conventions, return values 4 bytes or smaller (which includes int and bool) are returned via register %eax. %al is the lowest byte of %eax. So, the upper 3 bytes of %eax are left in an uncontrolled state.
Now in main():
call f1
testl %eax, %eax
je .L2
This checks whether the whole of %eax is zero, because it thinks it's testing an int.
Adding an explicit function declaration changes main() to:
call f1
testb %al, %al
je .L2
which is what we want.
Please compile with a command such as this one:
gcc -Wall -Wextra -Werror -std=gnu99 -o main.exe main.c
Output:
main.c: In function 'main':
main.c:14:5: error: implicit declaration of function 'f1' [-Werror=impl
icit-function-declaration]
printf( f1() == true ? "true\n" : "false\n");
^
cc1.exe: all warnings being treated as errors
With such a message, you should know what to do to correct it.
Edit: After reading a (now deleted) comment, I tried to compile your code without the flags. Well, This led me to linker errors with no compiler warnings instead of compiler errors. And those linker errors are more difficult to understand, so even if -std-gnu99 is not necessary, please try to allways use at least -Wall -Werror it will save you a lot of pain in the ass.
I am using C99 under GCC.
I have a function declared static inline in a header that I cannot modify.
The function never returns but is not marked __attribute__((noreturn)).
How can I call the function in a way that tells the compiler it will not return?
I am calling it from my own noreturn function, and partly want to suppress the "noreturn function returns" warning but also want to help the optimizer etc.
I have tried including a declaration with the attribute but get a warning about the repeated declaration.
I have tried creating a function pointer and applying the attribute to that, but it says the function attribute cannot apply to a pointed function.
From the function you defined, and which calls the external function, add a call to __builtin_unreachable which is built into at least GCC and Clang compilers and is marked noreturn. In fact, this function does nothing else and should not be called. It's only here so that the compiler can infer that program execution will stop at this point.
static inline external_function() // lacks the noreturn attribute
{ /* does not return */ }
__attribute__((noreturn)) void your_function() {
external_function(); // the compiler thinks execution may continue ...
__builtin_unreachable(); // ... and now it knows it won't go beyond here
}
Edit: Just to clarify a few points raised in the comments, and generally give a bit of context:
A function has has only two ways of not returning: loop forever, or short-circuit the usual control-flow (e.g. throw an exception, jump out of the function, terminate the process, etc.)
In some cases, the compiler may be able to infer and prove through static analysis that a function will not return. Even theoretically, this is not always possible, and since we want compilers to be fast only obvious/easy cases are detected.
__attribute__((noreturn)) is an annotation (like const) which is a way for the programmer to inform the compiler that he's absolutely sure a function will not return. Following the trust but verify principle, the compiler tries to prove that the function does indeed not return. If may then issue an error if it proves the function may return, or a warning if it was not able to prove whether the function returns or not.
__builtin_unreachable has undefined behaviour because it is not meant to be called. It's only meant to help the compiler's static analysis. Indeed the compiler knows that this function does not return, so any following code is provably unreachable (except through a jump).
Once the compiler has established (either by itself, or with the programmer's help) that some code is unreachable, it may use this information to do optimizations like these:
Remove the boilerplate code used to return from a function to its caller, if the function never returns
Propagate the unreachability information, i.e. if the only execution path to a code points is through unreachable code, then this point is also unreachable. Examples:
if a function does not return, any code following its call and not reachable through jumps is also unreachable. Example: code following __builtin_unreachable() is unreachable.
in particular, it the only path to a function's return is through unreachable code, the function can be marked noreturn. That's what happens for your_function.
any memory location / variable only used in unreachable code is not needed, therefore settings/computing the content of such data is not needed.
any computations which is probably (1) unnecessary (previous bullet) and (2) has no side effects (such as pure functions) may be removed.
Illustration:
The call to external_function cannot be removed because it might have side-effects. In fact, it probably has at least the side effect of terminating the process!
The return boiler plate of your_function may be removed
Here's another example showing how code before the unreachable point may be removed
int compute(int) __attribute((pure)) { return /* expensive compute */ }
if(condition) {
int x = compute(input); // (1) no side effect => keep if x is used
// (8) x is not used => remove
printf("hello "); // (2) reachable + side effect => keep
your_function(); // (3) reachable + side effect => keep
// (4) unreachable beyond this point
printf("word!\n"); // (5) unreachable => remove
printf("%d\n", x); // (6) unreachable => remove
// (7) mark 'x' as unused
} else {
// follows unreachable code, but can jump here
// from reachable code, so this is reachable
do_stuff(); // keep
}
Several solutions:
redeclaring your function with the __attribute__
You should try to modify that function in its header by adding __attribute__((noreturn)) to it.
You can redeclare some functions with new attribute, as this stupid test demonstrates (adding an attribute to fopen) :
#include <stdio.h>
extern FILE *fopen (const char *__restrict __filename,
const char *__restrict __modes)
__attribute__ ((warning ("fopen is used")));
void
show_map_without_care (void)
{
FILE *f = fopen ("/proc/self/maps", "r");
do
{
char lin[64];
fgets (lin, sizeof (lin), f);
fputs (lin, stdout);
}
while (!feof (f));
fclose (f);
}
overriding with a macro
At last, you could define a macro like
#define func(A) {func(A); __builtin_unreachable();}
(this uses the fact that inside a macro, the macro name is not macro-expanded).
If your never-returning func is declaring as returning e.g. int you'll use a statement expression like
#define func(A) ({func(A); __builtin_unreachable(); (int)0; })
Macro-based solutions like above won't always work, e.g. if func is passed as a function pointer, or simply if some guy codes (func)(1) which is legal but ugly.
redeclaring a static inline with the noreturn attribute
And the following example:
// file ex.c
// declare exit without any standard header
void exit (int);
// define myexit as a static inline
static inline void
myexit (int c)
{
exit (c);
}
// redeclare it as notreturn
static inline void myexit (int c) __attribute__ ((noreturn));
int
foo (int *p)
{
if (!p)
myexit (1);
if (p)
return *p + 2;
return 0;
}
when compiled with GCC 4.9 (from Debian/Sid/x86-64) as gcc -S -fverbose-asm -O2 ex.c) gives an assembly file containing the expected optimization:
.type foo, #function
foo:
.LFB1:
.cfi_startproc
testq %rdi, %rdi # p
je .L5 #,
movl (%rdi), %eax # *p_2(D), *p_2(D)
addl $2, %eax #, D.1768
ret
.L5:
pushq %rax #
.cfi_def_cfa_offset 16
movb $1, %dil #,
call exit #
.cfi_endproc
.LFE1:
.size foo, .-foo
You could play with #pragma GCC diagnostic to selectively disable a warning.
Customizing GCC with MELT
Finally, you could customize your recent gcc using the MELT plugin and coding your simple extension (in the MELT domain specific language) to add the attribute noreturn when encoutering the desired function. It is probably a dozen of MELT lines, using register_finish_decl_first and a match on the function name.
Since I am the main author of MELT (free software GPLv3+) I could perhaps even code that for you if you ask, e.g. here or preferably on gcc-melt#googlegroups.com; give the concrete name of your never-returning function.
Probably the MELT code is looking like:
;;file your_melt_mode.melt
(module_is_gpl_compatible "GPLv3+")
(defun my_finish_decl (decl)
(let ( (tdecl (unbox :tree decl))
)
(match tdecl
(?(tree_function_decl_named
?(tree_identifier ?(cstring_same "your_function_name")))
;;; code to add the noreturn attribute
;;; ....
))))
(register_finish_decl_first my_finish_decl)
The real MELT code is slightly more complex. You want to define your_adding_attr_mode there. Ask me for more.
Once you coded your MELT extension your_melt_mode.melt for your needs (and compiled that MELT extension into your_melt_mode.quicklybuilt.so as documented in the MELT tutorials) you'll compile your code with
gcc -fplugin=melt \
-fplugin-arg-melt-extra=your_melt_mode.quicklybuilt \
-fplugin-arg-melt-mode=your_adding_attr_mode \
-O2 -I/your/include -c yourfile.c
In other words, you just add a few -fplugin-* flags to your CFLAGS in your Makefile !
BTW, I'm just coding in the MELT monitor (on github: https://github.com/bstarynk/melt-monitor ..., file meltmom-process.melt something quite similar.
With a MELT extension, you won't get any additional warning, since the MELT extension would alter the internal GCC AST (a GCC Tree) of the declared function on the fly!
Customizing GCC with MELT is probably the most bullet-proof solution, since it is modifying the GCC internal AST. Of course, it is probably the most costly solution (and it is GCC specific and might need -small- changes when GCC is evolving, e.g. when using the next version of GCC), but as I am trying to show it is quite easy in your case.
PS. In 2019, GCC MELT is an abandoned project. If you want to customize GCC (for any recent version of GCC, e.g. GCC 7, 8, or 9), you need to write your own GCC plugin in C++.
I had been struggling for weeks with a poor-performing translator I had written.
On the following simple bechmark
#include<stdio.h>
int main()
{
int x;
char buf[2048];
FILE *test = fopen("test.out", "wb");
setvbuf(test, buf, _IOFBF, sizeof buf);
for(x=0;x<1024*1024; x++)
fprintf(test, "%04d", x);
fclose(test);
return 0
}
we see the following result
bash-3.1$ gcc -O2 -static test.c -o test
bash-3.1$ time ./test
real 0m0.334s
user 0m0.015s
sys 0m0.016s
As you can see, the moment the "-std=c99" flag is added in, performance comes crashing down:
bash-3.1$ gcc -O2 -static -std=c99 test.c -o test
bash-3.1$ time ./test
real 0m2.477s
user 0m0.015s
sys 0m0.000s
The compiler I'm using is gcc 4.6.2 mingw32.
The file generated is about 12M, so this is a difference between of about 21MB/s between the two.
Running diff shows the the generated files are identical.
I assumed this has something to do with file locking in fprintf, of which the program makes heavy use, but I haven't been able to find a way to switch that off in the C99 version.
I tried flockfile on the stream I use at the beginning of the program, and an corresponding funlockfile at the end, but was greeted with compiler errors about implicit declarations, and linker errors claiming undefined references to those functions.
Could there be another explanation for this problem, and more importantly, is there any way to use C99 on windows without paying such an enormous performance price?
Edit:
After looking at the code generated by these options, it looks like in the slow versions, mingw sticks in the following:
_fprintf:
LFB0:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
leal 40(%esp), %eax
movl %eax, 8(%esp)
movl 36(%esp), %eax
movl %eax, 4(%esp)
movl 32(%esp), %eax
movl %eax, (%esp)
call ___mingw_vfprintf
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
In the fast version, this simply does not exist; otherwise, both are exactly the same. I assume __mingw_vfprintf seems to be the slowpoke here, but I have no idea what behavior it needs to emulate that makes it so slow.
After some digging in the source code, I have found why the MinGW function is so terribly slow:
At the beginning of a [v,f,s]printf in MinGW, there is some innocent-looking initialization code:
__pformat_t stream = {
dest, /* output goes to here */
flags &= PFORMAT_TO_FILE | PFORMAT_NOLIMIT, /* only these valid initially */
PFORMAT_IGNORE, /* no field width yet */
PFORMAT_IGNORE, /* nor any precision spec */
PFORMAT_RPINIT, /* radix point uninitialised */
(wchar_t)(0), /* leave it unspecified */
0, /* zero output char count */
max, /* establish output limit */
PFORMAT_MINEXP /* exponent chars preferred */
};
However, PFORMAT_MINEXP is not what it appears to be:
#ifdef _WIN32
# define PFORMAT_MINEXP __pformat_exponent_digits()
# ifndef _TWO_DIGIT_EXPONENT
# define _get_output_format() 0
# define _TWO_DIGIT_EXPONENT 1
# endif
static __inline__ __attribute__((__always_inline__))
int __pformat_exponent_digits( void )
{
char *exponent_digits = getenv( "PRINTF_EXPONENT_DIGITS" );
return ((exponent_digits != NULL) && ((unsigned)(*exponent_digits - '0') < 3))
|| (_get_output_format() & _TWO_DIGIT_EXPONENT)
? 2
: 3
;
}
This winds up getting called every time I want to print, and getenv on windows must not be very quick. Replacing that define with a 2 brings the runtime back to where it should be.
So, the answer comes down to this: when using -std=c99 or any ANSI-compliant mode, MinGW switches the CRT runtime with its own. Normally, this wouldn't be an issue, but the MinGW lib had a bug which slowed its formatting functions down far beyond anything imaginable.
Using -std=c99 disable all GNU extensions.
With GNU extensions and optimization, your fprintf(test, "B") is probably replaced by a fputc('B', test)
Note this answer is obsolete, see https://stackoverflow.com/a/13973562/611560 and https://stackoverflow.com/a/13973933/611560
After some consideration of your assembler, it looks like the slow version is using the *printf() implementation of MinGW, based undoubtedly in the GCC one, while the fast version is using the Microsoft implementation from msvcrt.dll.
Now, the MS one is notably for lacking a lot of features, that the GCC one does implement. Some of these are GNU extensions but some others are for C99 conformance. And since you are using -std=c99 you are requesting the conformance.
But why so slow? Well, one factor is simplicity, the MS version is far simpler so it is expected that it will run faster, even in the trivial cases. Other factor is that you are running under Windows, so it is expected that the MS version be more efficient that one copied from the Unix world.
Does it explain a factor of x10? Probably not...
Another thing you can try:
Replace fprintf() with sprintf(), printing into a memory buffer without touching the file at all. Then you can try doing fwrite() without printfing. That way you can guess if the loss is in the formatting of the data or in the writing to the FILE.
Since MinGW32 3.15, compliant printf functions are available to use instead of those found in Microsoft C runtime (CRT).
The new printf functions are used when compiling in strict ANSI, POSIX and/or C99 modes.
For more information see the mingw32 changelog
You can use __msvcrt_fprintf() to use the fast (non compliant) function.