Can I compute pow(10,x) at compile-time in c?

Can I compute pow(10,x) at compile-time in c? - c

Is it possible to compute pow(10,x) at compile time?
I've got a processor without floating point support and slow integer division. I'm trying to perform as many calculations as possible at compile time. I can dramatically speed up one particular function if I pass both x and C/pow(10,x) as arguments (x and C are always constant integers, but they are different constants for each call). I'm wondering if I can make these function calls less error prone by introducing a macro which does the 1/pow(10,x) automatically, instead of forcing the programmer to calculate it?
Is there a pre-processor trick? Can I force the compiler optimize out the library call?

There are very few values possible before you overflow int (or even long). For clarities sake, make it a table!
edit: If you are using floats (looks like you are), then no it's not going to be possible to call the pow() function at compile time without actually writing code that runs in the make process and outputs the values to a file (such as a header file) which is then compiled.

GCC will do this at a sufficiently high optimization level (-O1 does it for me). For example:
#include <math.h>
int test() {
double x = pow(10, 4);
return (int)x;
}
Compiles at -O1 -m32 to:
.file "test.c"
.text
.globl test
.type test, #function
test:
pushl %ebp
movl %esp, %ebp
movl $10000, %eax
popl %ebp
ret
.size test, .-test
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",#progbits
This works without the cast as well - of course, you do get a floating-point load instruction in there, as the Linux ABI passes floating point return values in FPU registers.

You can do it with Boost.Preprocessor:
http://www.boost.org/doc/libs/1_39_0/libs/preprocessor/doc/index.html
Code:
#include <boost/preprocessor/repeat.hpp>
#define _TIMES_10(z, n, data) * 10
#define POW_10(n) (1 BOOST_PP_REPEAT(n, _TIMES_10, _))
int test[4] = {POW_10(0), POW_10(1), POW_10(2), POW_10(3)};

Actually, by exploiting the C preprocessor, you can get it to compute C pow(10, x) for any real C and integral x. Observe that, as #quinmars noted, C allows you to use scientific syntax to express numerical constants:
#define myexp 1.602E-19 // == 1.602 * pow(10, -19)
to be used for constants. With this in mind, and a bit of cleverness, we can construct a preprocessor macro that takes C and x and combine them into an exponentiation token:
#define EXP2(a, b) a ## b
#define EXP(a, b) EXP2(a ## e,b)
#define CONSTPOW(C,x) EXP(C, x)
This can now be used as a constant numerical value:
const int myint = CONSTPOW(3, 4); // == 30000
const double myfloat = CONSTPOW(M_PI, -2); // == 0.03141592653

You can use the scientific notation for floating point values which is part of the C language. It looks like that:
e = 1.602E-19 // == 1.602 * pow(10, -19)
The number before the E ( the E maybe capital or small 1.602e-19) is the fraction part where as the (signed) digit sequence after the E is the exponent part. By default the number is of the type double, but you can attach a floating point suffix (f, F, l or L) if you need a float or a long double.
I would not recommend to pack this semantic into a macro:
It will not work for variables, floating point values, etc.
The scientific notation is more readable.

Actually, you have M4 which is a pre-processor way more powerful than the GCC’s. A main difference between those two is GCC’s is not recursive whereas M4 is. It makes possible things like doing arithmetic at compile-time (and much more!). The below code sample is what you would like to do, isn’t it? I made it bulky in a one-file source; but I usually put M4's macro definitions in separate files and tune my Makefile rules. This way, your code is kept from ugly intrusive M4 definitions into the C source code I've done here.
$ cat foo.c
define(M4_POW_AUX, `ifelse($2, 1, $1, `eval($1 * M4_POW_AUX($1, decr($2)))')')dnl
define(M4_POW, `ifelse($2, 0, 1, `M4_POW_AUX($1, $2)')')dnl
#include <stdio.h>
int main(void)
{
printf("2^0 = %d\n", M4_POW(2, 0));
printf("2^1 = %d\n", M4_POW(2, 1));
printf("2^4 = %d\n", M4_POW(2, 4));
return 0;
}
The command line to compile this code sample uses the ability of GCC and M4 to read from the standard input.
$ cat foo.c | m4 - | gcc -x c -o m4_pow -
$ ./m4_pow
2^0 = 1
2^1 = 2
2^4 = 16
Hope this help!

If you just need to use the value at compile time, use the scientific notation like 1e2 for pow(10, 2)
If you want to populate the values at compile time and then use them later at runtime then simply use a lookup table because there are only 23 different powers of 10 that are exactly representable in double precision
double POW10[] = {1., 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9, 1e10,
1e11, 1e12, 1e13, 1e14, 1e15, 1e16, 1e17, 1e18, 1e19, 1e20, 1e21, 1e22};
You can get larger powers of 10 at runtime from the above lookup table to quickly get the result without needing to multiply by 10 again and again, but the result is just a value close to a power of 10 like when you use 10eX with X > 22
double pow10(int x)
{
if (x > 22)
return POW10[22] * pow10(x - 22);
else if (x >= 0)
return POW10[x];
else
return 1/pow10(-x);
}
If negative exponents is not needed then the final branch can be removed.
You can also reduce the lookup table size further if memory is a constraint. For example by storing only even powers of 10 and multiply by 10 when the exponent is odd, the table size is now only a half.

Recent versions of GCC ( around 4.3 ) added the ability to use GMP and MPFR to do some compile-time optimizations by evaluating more complex functions that are constant. That approach leaves your code simple and portable, and trust the compiler to do the heavy lifting.
Of course, there are limits to what it can do. Here's a link to the description in the changelog, which includes a list of functions that are supported by this. 'pow' is one them.

Unfortunately, you can't use the preprocessor to precalculate library calls. If x is integral you could write your own function, but if it's a floating-point type I don't see any good way to do this.

bdonlan's replay is spot on but keep in mind that you can perform nearly any optimization you chose on the compile box provided you are willing to parse and analyze the code in your own custom preprocessor. It is a trivial task in most version of unix to override the implicit rules that call the compiler to call a custom step of your own before it hits the compiler.

Related

Why doesn't the same generated assembler code lead to the same output?

Sample code (t0.c):
#include <stdio.h>
float f(float a, float b, float c) __attribute__((noinline));
float f(float a, float b, float c)
{
return a * c + b * c;
}
int main(void)
{
void* p = V;
printf("%a\n", f(4476.0f, 20439.0f, 4915.0f));
return 0;
}
Invocation & execution (via godbolt.org):
# icc 2021.1.2 on Linux on x86-64
$ icc t0.c -fp-model=fast -O3 -DV=f
0x1.d32322p+26
$ icc t0.c -fp-model=fast -O3 -DV=0
0x1.d32324p+26
Generated assembler code is the same: https://godbolt.org/z/osra5jfYY.
Why doesn't the same generated assembler code lead to the same output?
Why does void* p = f; matter?

Godbolt shows you the assembly emitted by running the compiler with -S. But in this case, that's not the code that actually gets run, because further optimizations can be done at link time.
Try checking the "Compile to binary" box instead (https://godbolt.org/z/ETznv9qP4), which will actually compile and link the binary and then disassemble it. We see that in your -DV=f version, the code for f is:
addss xmm0,xmm1
mulss xmm0,xmm2
ret
just as before. But with -DV=0, we have:
movss xmm0,DWORD PTR [rip+0x2d88]
ret
So f has been converted to a function which simply returns a constant loaded from memory. At link time, the compiler was able to see that f was only ever called with a particular set of constant arguments, and so it could perform interprocedural constant propagation and have f merely return the precomputed result.
Having an additional reference to f evidently defeats this. Probably the compiler or linker sees that f had its address taken, and didn't notice that nothing was ever done with the address. So it assumes that f might be called elsewhere in the program, and therefore it has to emit code that would give the correct result for arbitrary arguments.
As to why the results are different: The precomputation is done strictly, evaluating both a*c and b*c as float and then adding them. So its result of 122457232 is the "right" one by the rules of C, and it is also what you get when compiling with -O0 or -fp-model=strict. The runtime version has been optimized to (a+b)*c, which is actually more accurate because it avoids an extra rounding; it yields 122457224, which is closer to the exact value of 122457225.

How to write rotate code in C to compile into the `ror` x86 instruction?

I have some code that rotates my data. I know GAS syntax has a single assembly instruction that can rotate an entire byte. However, when I try to follow any of the advice on Best practices for circular shift (rotate) operations in C++, my C code compiles into at least 5 instructions, which use up three registers-- even when compiling with -O3. Maybe those are best practices in C++, and not in C?
In either case, how can I force C to use the ROR x86 instruction to rotate my data?
The precise line of code which is not getting compiled to the rotate instruction is:
value = (((y & mask) << 1 ) | (y >> (size-1))) //rotate y right 1
^ (((z & mask) << n ) | (z >> (size-n))) // rotate z left by n
// size can be 64 or 32, depending on whether we are rotating a long or an int, and
// mask would be 0xff or 0xffffffff, accordingly
I do not mind using __asm__ __volatile__ to do this rotate, if that's what I must do. But I don't know how to do so correctly.

Your macro compiles to a single ror instruction for me... specifically, I compiled this test file:
#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))
unsigned ror(unsigned x, unsigned y)
{
return ROR(x, y);
}
as C, using gcc 6, with -O2 -S, and this is the assembly I got:
.file "test.c"
.text
.p2align 4,,15
.globl ror
.type ror, #function
ror:
.LFB0:
.cfi_startproc
movl %edi, %eax
movl %esi, %ecx
rorl %cl, %eax
ret
.cfi_endproc
.LFE0:
.size ror, .-ror
.ident "GCC: (Debian 6.4.0-1) 6.4.0 20170704"
.section .note.GNU-stack,"",#progbits
Please try to do the same, and report the assembly you get. If your test program is substantially different from mine, please tell us how it differs. If you are using a different compiler or a different version of GCC please tell us exactly which one.
Incidentally, I get the same assembly output when I compile the code in the accepted answer for "Best practices for circular shift (rotate) operations in C++", as C.

How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra & masking of the count) confuses old compilers, like gcc before 4.9. Since you're not masking the shift count, it should be recognized with even older gcc.
Your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like
value = rotr32(y & mask, 1) ^ rotr32(z & mask, n);
Much more readable, and may help stop the compiler from trying to do things in the wrong order and breaking the idiom before recognizing it as a rotate.
Maybe those are best practices in C++, and not in C?
My answer on the linked question clearly says that it's the best practice for C as well as C++. They are different languages, but they overlap completely for this, according to my testing.
Here's a version of the Godbolt link using -xc to compile as C, not C++. I had a couple C++isms in the link in the original question, for experimenting with integer types for the rotate count.
Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. clang doesn't seem to provide any in x86intrin.h, but other compilers have _rotl / _rotr for 32-bit rotates, with other sizes available.
Actually, I talked about rotate intrinsics at length in the answer on the best-practices question, not just in the godbolt link. Did you even read the answer there, apart from the code block? (If you did, your question doesn't reflect it.)
Using intrinsics, or the idiom in your own inline function, is much better than using inline asm. Asm defeats constant-propagation, among other things. Also, compilers can use BMI2 rorx dst, src, imm8 to copy-and-rotate with one instruction, if you compile with -march=haswell or -mbmi2. It's a lot harder to write an inline-asm rotate that can use rorx for immediate-count rotates but ror r32, cl for variable-count rotates. You could try with _builtin_constant_p(), but clang evaluates that before inlining, so it's basically useless for meta-programming style choice of which code to use. It works with gcc though. But it's still much better not to use inline asm unless you've exhausted all other avenues (like asking on SO) to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm
Fun fact: the rotate functions in gcc's x86intrin.h are just pure C using the rotate idiom that gcc recognizes. Except for 16-bit rotates, where they use __builtin_ia32_rolhi.

You might need to be a bit more specific with what integral type / width you're rotating, and whether you have a fixed or variable rotation. ror{b,w,l,q} (8, 16, 32, 64-bit) has forms for (1), imm8, or the %cl register. As an example:
static inline uint32_t rotate_right (uint32_t u, size_t r)
{
__asm__ ("rorl %%cl, %0" : "+r" (u) : "c" (r));
return u;
}
I haven't tested this, it's just off the top of my head. And I'm sure multiple constraint syntax could be used to optimize cases where a constant (r) value is used, so %e/rcx is left alone.
If you're using a recent version of gcc or clang (or even icc). The intrinsics header <x86intrin.h>, may provide __ror{b|w|d|q} intrinsics. I haven't tried them.

Best Way:
#define rotr32(x, n) (( x>>n ) | (x<<(64-n)))
#define rotr64(x, n) (( x>>n ) | (x<<(32-n)))
More generic:
#define rotr(x, n) (( x>>n ) | (x<<((sizeof(x)<<3)-n)))
And it compiles (in GCC) with exactly the same code as the asm versions below.
For 64 bit:
__asm__ __volatile__("rorq %b1, %0" : "=g" (u64) : "Jc" (cShift), "0" (u64));
or
static inline uint64_t CC_ROR64(uint64_t word, int i)
{
__asm__("rorq %%cl,%0"
:"=r" (word)
:"0" (word),"c" (i));
return word;
}

Floating point anomaly when an unused statement is not commented out?

When the program as shown below is run, it produces ok output:
j= 0 9007199616606190.000000 = x
k= 0 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
But when the commented-out line (i.e. //if (argc>1) r = atol(argv[1]);) is uncommented, it produces:
j= 20000 9007199616606190.000000 = x
k= 17285 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
even though that line should have no effect, since argc>1 is false. Has anybody got a plausible explanation for this problem? Is it reproducible on any other systems?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
int j, k, m=10000;
double r=31443101, jroot=sqrt(83), x;
//if (argc>1) r = atol(argv[1]);
x = r * r * jroot;
j = m*(x-floor(x));
k = floor(m*(x-floor(x)));
printf ("j= %9d %24.6f = x\n", j, x);
printf ("k= %9d %24.6f = [x]\n", k, floor(x));
printf ("r= %9.0f %24.6f = m*(x-[x]) \n", r, m*(x-floor(x)));
return 0;
}
Note, test system = AMD Athlon 64 5200+ system with Linux 2.6.35.14-96.fc14.i686 (i.e., booted to run a 32-bit OS on 64-bit HW) with gcc (GCC) 4.5.1 20100924 (Red Hat 4.5.1-4)
Update -- A few hours ago I posted a comment that code generated with and without the if statement differed only in stack offsets and some skipped code. I now find that comment was not entirely correct; i.e. it is true for non-optimized code, but not true for the -O3 code I executed.
Effect of optimization switch on problem:
-O0 : Both program versions run ok
-O2 or -O3 : Version with comment has error as above, where j=20000 and k=17285
-O1 : Version with comment has j=20000 (an error) and k=0 (OK)
Anyhow, looking at -O3 -S code listings, the two cases differ mostly in skipped if code and stack offsets up to the line before call floor, at which point the with-if code has one more fstpl than the without-if code:
... ;; code without comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
call floor
movl $.LC2, (%esp)
fnstcw 86(%esp)
movzwl 86(%esp), %eax
...
... ;; versus code with comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
fstpl 64(%esp)
call floor
movl $.LC3, (%esp)
fnstcw 102(%esp)
movzwl 102(%esp), %eax
...
I haven't figured out the reason for the difference.

Not duplicated on my system, Win7 running CygWin with gcc 4.3.4. Both with and without the if statement, the value of j is set to zero, not 20K.
My only suggestion would be to use gcc -S to get a look at the assembler output. That should hopefully tell you what's going wrong.
Specifically, generate the assembler output to two separate files, one each for the working and non-working variant, then vgrep them (eyeball them side by side) to try and ascertain the difference.
This is a serious failure in your environment by the way. With m being 10000, that means the x - floor(x) must be equal to 2. I can't for the life of me think of any real number where that would be the case :-)

I think there are two reasons why that line could have an effect:
Without that line, the values of all of these variables can be (and, IMHO, most likely are) determined at compile-time; with that line, the computations have to be performed at run-time. But obviously, the compiler's precomputed values are supposed to be the same as values computed at run-time, and I'm inclined to discount this as the actual reason for the different observed behavior. (It would certainly show up as a huge difference in the assembler output, though!)
On many machines, floating-point arithmetic is performed using more bits in intermediate values than can actually be stored in a double-precision floating-point number. Your second version, by creating two different code-paths to set x, basically restricts x to what can be stored in a double-precision floating-point number, whereas your first version can allow the initially-calculated value for x to still be available as an intermediate value, with extra bits, when computing subsequent values. (This could be the case whether all of these values are computed at compile-time or at run-time.)

The reason that uncommenting that line might affect the result is that without that line, the compiler can see that r and jroot cannot change after initialisation, so it can calculate x at compile-time rather than runtime. When the line is uncommented, r might change, so the calculation of x must be deferred to runtime, which can result it in being done with a different precision (particularly if 387 floating point math is being used).
You can try using -mfpmath=sse -march=native to use the SSE unit for floating point calculations, which doesn't exhibit excess precision; or you can try using the -ffloat-store switch.
Your subtraction x - floor(x) exhibits catastrophic cancellation - this is the root cause of the problem something to be avoided ;).

EDITED:
I also do not see a difference when I compile your code on my computer using the -O0, -O1, -O2 and -O3.
AMD Phenom Quad 64 bit.
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I also tried clang (llvm) from release 3.0 with and without the if same results.
I agree that the compiler can pre-compute everything without that if line, you would definitely see that in the assembly output though.
Floating point and C can be nasty, lots of stuff to know to get it to really work. Forcing the int to double conversions is good for accuracy (c libraries in the compiler, even if the fpu is good have been known to have problems and the compilers C library it uses and the C library compiled into or used by your program can/will differ), but int to/from float is where FPU's tend to have their bugs (I think I saw that mentioned with TestFloat or somewhere like that). Might try running TestFloat on your system to see if your FPU is good. Between the famous pentium floating point bug and into the PentiumIV and beyond days most processors had floating point bugs, the pentium III I had was solid but the Pentium IV I had would fail. I rarely use floating point anymore so dont bother to test my systems.
Playing with optimization did change your results according to your edit so this is most likely a gcc problem or a combination of your code and gcc (and not a hardware fpu problem). Then try a different version of gcc on the same computer. 4.4.x instead of 4.5.x for example.

Mixing variables and integer constants in the Boost Preprocessor

I am using BOOST_PP for to do precompile computations in the preprocessor. I am focusing on an application where code size is extremely important to me. (So please don't say the compiler should or usually does that, I need to control what is performed at compile time and what code is generated). However, I want to be able to use the same name of the macro/function for both integer constants and variables. As trivial example, I can have
#define TWICE(n) BOOST_PP_MUL(n,2)
//.....
// somewhere else in code
int a = TWICE(5);
This does what I want it to, evaluating to
int a = 10;
during compile time.
However, I also want it to be used at
int b = 5;
int a = TWICE(b);
This should be preprocessed to
int b = 5;
int a = 5 * 2;
Of course, I can do so by using the traditional macros like
#define TWICE(n) n * 2
But then it doesnt do what I want it to do for integer constants (evaluating them during compile time).
So, my question is, is there a trick to check whether the argument is a literal or a variable, and then use different definitions. i.e., something like this:
#define TWICE(n) BOOST_PP_IF( _IS_CONSTANT(n), \
BOOST_PP_MUL(n,2), \
n * 2 )
edit:
So what I am really after is some way to check if something is a constant available at compile time, and hence a good argument for the BOOST_PP_ functions. I realize that this is different from what most people expect from a preprocessor and general programming recommendations. But there is no wrong way of programming, so please don't hate on the question if you disagree with its philosophy. There is a reason the BOOST_PP library exists, and this question is in the same spirit. It might just not be possible though.

You're trying to do something that is better left to the optimizations of the compiler.
int main (void) {
int b = 5;
int a = b * 2;
return a; // return it so we use a and it's not optimized away
}
gcc -O3 -s t.c
.file "t.c"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
movl $10, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 4.5.0-6) 4.5.1 20100617 (prerelease)"
.section .note.GNU-stack,"",#progbits
Optimizing compiler will optimize.
EDIT: I'm aware that you don't want to hear that the compiler "should" or "usually" does that. However, what you are trying to do is not something that is meant to be done in the C preprocessor; the people who designed the C language and the C preprocessor designed it to work with the preprocessing token as it's basic atom. The CPP is, in many ways, "dumb". This is not a bad thing (in fact, this is in many cases what makes it so useful), but at the end of the day, it is a preprocessor. It preprocesses source files before they are parsed. It preprocesses source files before semantic analysis occurs. It preprocesses source files before the validity of a given source file is checked. I understand that you do not want to hear that this is something that the parser and semantic analyzer should handle or usually does. However, this is the reality of the situation. If you want to design code that is incredibly small, then you should rely on your compiler to do it's job instead of trying to create preprocessor constructs to do the work. Think of it this way: thousands of hours of work went into your compiler, so try to reuse as much of that work as possible!

not quite direct approach, however:
struct operation {
template<int N>
struct compile {
static const int value = N;
};
static int runtime(int N) { return N; }
};
operation::compile<5>::value;
operation::runtime(5);
alternatively
operation<5>();
operation(5);

There is no real chance of mixing the two levels (preprocessor and evaluation of variables). From what I understand from you question, b should be a symbolic constant?
I think you should use the traditional one
#define TWICE(n) ((n) * 2)
but then instead of initializing variables with expressions, you should initialize them with compile time constants. The only thing that I see to force evaluation at compile time and to have symbolic constants in C are integral enumeration constants. These are defined to have type int and are evaluated at compile time.
enum { bInit = 5 };
int b = bInit;
enum { aInit = TWICE(bInit) };
int a = aInit;
And generally you should not be too thrifty with const (as for your b) and check the produced assembler with -S.

double-precision numbers in inline assembly (GCC, IA-32)

I'm just starting to learn assembly and I want to round a floating-point value using a specified rounding mode. I've tried to implement this using fstcw, fldcw, and frndint.
Right now I get a couple of errors:
~ $ gc a02p
gcc -Wall -g a02p.c -o a02p
a02p.c: In function `roundD':
a02p.c:33: error: parse error before '[' token
a02p.c:21: warning: unused variable `mode'
~ $
I'm not sure if I am even doing this right at all. I don't want to use any predefined functions. I want to use GCC inline assembly.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#define PRECISION 3
#define RND_CTL_BIT_SHIFT 10
// floating point rounding modes: IA-32 Manual, Vol. 1, p. 4-20
typedef enum {
ROUND_NEAREST_EVEN = 0 << RND_CTL_BIT_SHIFT,
ROUND_MINUS_INF = 1 << RND_CTL_BIT_SHIFT,
ROUND_PLUS_INF = 2 << RND_CTL_BIT_SHIFT,
ROUND_TOWARD_ZERO = 3 << RND_CTL_BIT_SHIFT
} RoundingMode;
double roundD (double n, RoundingMode roundingMode)
{
short c;
short mode = (( c & 0xf3ff) | (roundingMode));
asm("fldcw %[nIn] \n"
"fstcw %%eax \n" // not sure why i would need to store the CW
"fldcw %[modeIn] \n"
"frndint \n"
"fistp %[nOut] \n"
: [nOut] "=m" (n)
: [nIn] "m" (n)
: [modeIn] "m" (mode)
);
return n;
}
int main (int argc, char **argv)
{
double n = 0.0;
if (argc > 1)
n = atof(argv[1]);
printf("roundD even %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_NEAREST_EVEN));
printf("roundD down %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_MINUS_INF));
printf("roundD up %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_PLUS_INF));
printf("roundD zero %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_TOWARD_ZERO));
return 0;
}
Am I even remotely close to getting this right?

A better process is to write a simple function that rounds a floating point value. Next, instruct your compiler to print an assembly listing for the function. You may want to put the function in a separate file.
This process will show you the calling and exiting conventions used by the compiler. By placing the function in a separate file, you won't have to build other files. Also, it will give you the opportunity to replace the C language function with an assembly language function.
Although inline assembly is supported, I prefer to replace an entire function in assembly language and not use inline assembly (inline assembly isn't portable, so the source will have to be changed when porting to a different platform).

GCC's inline assembler syntax is arcane to say the least, and I do not claim to be an expert, but when I have used it I used this howto guide. In all examples all template markers are of the form %n where n is a number, rather then the %[ttt] form that you have used.
I also note that the line numbers reported in your error messages do not seem to correspond with the code you posted. So I wonder if they are in fact for this exact code?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight