Performance of math functions?

Performance of math functions? - c

I'm working with graphing accelerometer data here, and I'm trying to correct for gravity. To do this, I get the acceleration vector in spherical coordinates, decrease the radius by 1g, and convert back to cartesian. This method is called on a timer every 0.03 seconds:
//poll accleration
ThreeAxisAcceleration current = self.accelerationData;
//math to correct for gravity:
float radius = sqrt(pow(current.x, 2) + pow(current.y, 2) + pow(current.z, 2));
float theta = atan2(current.y, current.x);
float phi = acos(current.z/radius);
//NSLog(#"SPHERICAL--- RADIUS: %.2f -- THETA: %.2f -- PHI: %.2f", radius, theta, phi);
radius = radius - 1.0;
float newX = radius * cos(theta) * sin(phi);
float newY = radius * sin(theta) * sin(phi);
float newZ = radius * cos(phi);
current = (ThreeAxisAcceleration){newX, newY, newZ};
//end math
NSValue *arrayVal = [NSValue value:&current withObjCType:#encode(ThreeAxisAcceleration)];
if ([_dataHistoryBuffer count] > self.bounds.size.width) {
[_dataHistoryBuffer removeObjectAtIndex:0];
}
[_dataHistoryBuffer addObject:arrayVal];
[self setNeedsDisplay];
Somehow, the addition of the gravity correction is gradually slowing my code horrendously. I find it hard to believe that this amount of math can slow down the program, but yet without it it can still run through my entire display method which is quite lengthy. Are there any options I can consider here to avoid this? Am I missing something or is the math just that slow? I can comment out between the //math and //end math tags and be just fine.
Thanks for any help.
P.S. incase it may matter, to whom it may interest, I'm programming in cocoa, and this method belongs to a subclass of CALayer, with -drawInContext: implemented.

Are you on iPhone? Try using the float variants of these functions: powf, sqrtf, etc
There's more info in point #4 of Kendall Helmstetter Gelner's answer to this SO question.

Besides the fact that it's theoretically impossible to simply factor out Earth's gravity, the first step I would take would be to benchmark each of the operations that you're performing (multiplication, division, sin, atan2, etc) and then engineer a way around the operations that take significantly longer to compute (or avoid computing the problematic operations). Make sure to use the same data types in your benchmarking as you will in your finished product.
This is a classic example of the time/accuracy trade-off. There are usually multiple algorithms for performing the same computation and you also have LUTs/interpolation at your disposal.
I ran into the same issues when I made my own Wii-style remote controller. If you identify the expensive operation and are having trouble engineering around it then start another question. :)

The normal way to shorten a vector would be along the lines of:
float originalMagnitude = sqrtf(current.x * current.x, current.y * current.y, current.z* current.z);
float desiredMagnitude = originalMagnitude - 1.0f;
float scaleFactor = (originalMagnitude != 0) ? desiredMagnitude / originalMagnitude : 0.0f; // avoid divide-by-zero
current.x *= scaleFactor;
current.y *= scaleFactor;
current.z *= scaleFactor;
That said, no, calling a few trig functions 33 times a second shouldn’t be slowing you down much. On the other hand, -[NSMutableArray removeObjectAtIndex:] could potentially be slow for a big array. A ring buffer (either using NSMutableArray or a C array of structs) would be more efficient.

Profile, don't speculate. Don't change a damn thing until you know what to change.
Assuming that you get a profile that shows that all the math really is slowing you down:
Don't ever write pow(someFloat,2). The compiler should be able to optimize this away for you, but often times, on newer hardware, those optimizations may not yet be in place. This should always be written someFloat*someFloat. The pow( ) function is generally the most expensive function in the math library. Simple multiplication will always be at least as fast as calling pow( ), and will always be at least as accurate (assuming IEEE-754 compliant arithmetic). Plus, it's easier for the compiler to optimize.
When working with floats in C, use the suffixed forms of the math library function. sinf is faster than sin. sqrtf is faster than sqrt. Beyond the functions themselves being faster, you avoid unnecessary conversions to and from double.
If you're seeing the slowdown on a ARMv6 processor (not the 3GS or the new iPod Touch), make sure you are not compiling to thumb code when you are doing a lot of floating-point computation. The thumb instruction set (prior to thumb2) cannot access the VFP registers, and thus needs a shim for every floating point operation. This can be quite expensive.
If you just want to decrease the length of the acceleration vector by 1.0 (hint: this doesn't do what you want), there are more efficient algorithms to do so.

Those math lines look fine. I don't know enough Objective C to know what the current = ... line is doing though. Could it be allocating memory on the heap which isn't being reclaimed?
What happens if you just comment it out? Have you watched the processes' execution with top, to see if it starts slurping more CPU or memory?

Other than the other commentor's use of the floating point (as opposed to double operators), doing all that _dataHistoryBuffer stuff will be what's killing your app. That'll churn up the memory like there's no tomorrow, and since you are using the NSValue, then all those objects will be added to the autorelease pool making memory consumption spike. You're much better off avoiding keeping a list of values unless you really, really need it - and if so, figuring out a more appropriate (read: fixed size, non-object) mechanism to store them in. Even a circular buffer of structs (e.g. an array of 10 structs, and then have a counter which does i++ % 10 to index into it) would be better.

Profile it to see exactly where the problem is. If necessary, comment out a subset of the "math" part at a time. Performance is something people usually guess wrong, even smart, thoughtful, experienced people.

Just out of interest - do you know how the Math SQRT function is implemented? If it is using an inefficient approximation algorithm, then it might be your culprit. Your best option is to create some sort of test harness that can get an average performance for each of the instructions that you are using.
Another question - does increasing or reducing the precision of the operators (i.e. by using double value floats rather than singles) change the performance in any way?

As others have said, you should profile to be sure. Having said that, yes, it is quite likely that adding the extra calculations did slow it down.
By default, all code for iPhone is compiled for the Thumb-1 instruction set. Thumb-1 does not have native floating point support, so it ends up calling out to a SOFTWARE floating point implementation. There are 2 ways to handle this:
Compile the code for ARM. The processor in the iPhone can freely intermix Thumb and ARM code, so you can just compile the the necessary pieces as ARM. You should note that GCC (and by proxy Xcode) cannot compile an individual function as ARM, you will need to isolate all the relevent code into its on compilation unit. It is probably easiest just to set the entire project to compile for ARM to see if it fixes things (Uncheck "Build Options" > "Compile for Thumb"). You should note that while ARM will speed up floating point, it reduces instruction density thereby hurting cache efficiency and degrading all of your other code, so try to avoid it where you can.
Compile for Thumb-2. Thumb-2 is an enhanced version of Thumb that adds support for some floating point operations. It is only available on iPhone 3GS and the new iPod Touch, so this may not be an option for you. You can do that by switching your architecture to "Optimized," which will build a fat binary with current slow version for older devices, and the faster version for ones that support.
You can also combine both of these options, if that seems like the best choice.

Unless I misunderstand your code, you basically scale your point by some factor.
I think the following code should be equivalent to what you do.
double radius = sqrt(current.x * current.x
+ current.y * current.y
+ current.z * current.z);
double newRadius = radius - 1.0;
double scale = newRadius/radius;
current.x *= scale;
current.y *= scale;
current.z *= scale;

This method will find out what the problem is. The worse your slowdown is, the quicker it will find it. Guesses are things that you suspect but don't know, such as thinking the math is the problem. Guesses are usually wrong, at least to begin with. If you are right, the samples will show you. If you are wrong, they will show you what is right. It never misses.

My guess is since you're using autoreleased memory (for NSValue) every 0.03 seconds you're probably not giving the pool much time to release itself. I could be wrong - profiling is the only way to tell.
Try manually allocating and releasing the NSValue and see if it makes a difference.

Related

What optimizations should be left for compiler?

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:
v[i*3+0], v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:
inline double sqr(double x)
{
return x*x;
}
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);
return 0.5 * mass * sum;
}
To reduce the number of multiplications, it can be written as:
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
{
double *w = v + i*3;
sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
}
return 0.5 * mass * sum;
}
(one may write a function with even fewer multiplications, but that's not the point of this question)
Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?

where should the developer rely on the compiler and where should she/he try to do some optimization manually?
Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...
From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).

One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.
I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.
This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.

Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.
As a matter of fact, the code generated handles 2 components at a time using XMM registers.
If performance is a must, then here are steps that will save the day:
the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.
Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.
Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):
double get_kinetic_energy(const double v[], int n, double mass)
{
double sum = 0.0;
for (int i = 0; i < 3 * n; i++)
sum += v[i] * v[i];
return 0.5 * mass * sum;
}

Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.
They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.
Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.
For example, given
void test(char *p)
{
char *e = p+5;
do
{
p[0] = p[1];
p++;
}while(p < e);
}
when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);. While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.
What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.

C code optimization

I'm trying to optimize some C code, and it's my first time.
As a first step i dumped my executable file in order to see the assembler code.
For example for this function:
void init_twiddle(int N)
{
int i=0;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * 2.0 * PI / (float)N);
twiddle_table[i].im = (float) - sin((float)i * 2.0 * PI / (float)N);
}
}
wouldn't be better if i do this instead:
void init_twiddle(int N)
{
int i=0;
float puls = 2.0 * PI / (float)N;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * puls);
twiddle_table[i].im = (float) - sin((float)i * puls);
}
in order to avoid mult and div operation of being repeated thousands of times?
}

Unfortunately, your first step was already kindof wrong.
Don't blindly walk through your code optimizing arbitrary loops which might or (more probably) might not affect performance (maybe because that code is so rarely called that it doesn't really use any run-time).
Optimizing means: You need to find out first where is the time spent in my program? Use timing measurements to narrow down where your program spends most of its time (you can use homegrown logging using us timers or a profiling application for that). Without at least some figures you wouldn't even see where the compiler has already helped you and maybe has already maxed out all possibilities, even if your code looks like it has some potential left for being faster (modern compilers are really very good at that).
Only if you know the hot spots in your application you should start optimizing those.

The problem is that it is a floating point expression and floating point operations are not commutative. So the optimization is invalid in general for any compiler that follows IEEE 754. So either you have to do this optimization manually, or you have to tell the compiler to treat floating point as commutative for optimization purposes (in gcc and clang you use -ffast-math to do this). This will introduce slight changes in the resulting values.
For comparison of the assembly:
Without -ffast-math
With -ffast-math

You can do this much faster, indeed you need only 1 sine and 1 cosine (which are disastrously slow). What you're actually doing is calculating the coordinates of a little vector that you spin around the origin, the alternative way to do it is by actually spinning that vector around the origin, one step at the time. The rotation matrix for a single step is what costs the single sine and cosine.
Of course this may be a bit less accurate, but no significant trouble should build up if you make a a reasonable number of steps.

The root of all evil is premature optimization
– Donald Knuth
You should optimize, if you have a problem with execution duration. There are tools that record the duration of every single statement or at least function call.
I think that most compilers detect such constant expressions in a loop and there is nothing to optimize, because it is already optimized.

First of all, use double, not float. In C, library routines are all in double, so you're just doing a lot of converting.
Second, calculate the angle once and put it in a variable, not twice.
Maybe the compiler recognizes that it can do this for you, but I prefer not to tempt the compiler not to.
Third, is there a sincos function? The sine and cosine functions are closely related, so one can be calculated at the same time as the other.
Fourth, when thinking about performance, switch your brain to thinking in percent of total time, not doing something "thousands of times". That way, you will concentrate on what has the greatest overall benefit, not things that might well be irrelevant.

This probably won't change your code performance, since this a standard loop invariants optimization that is performed by any standard compiler (assuming optimizations aren't turned off)..

Is there a preferred way to order floating-point operands?

Suppose I have a very small float a (for instance a=0.5) that enters the following expression:
6000.f * a * a;
Does the order of the operands make any difference? Is it better to write
6000.f * (a*a);
Or even
float result = a*a;
result *= 6000.f;
I've checked the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic but couldn't find anything.
Is there an optimal way to order operands in a floating point operation?

It really depends on the values and your goals. For instance if a is very small, a*a might be zero, whereas 6000.0*a*a (which means (6000.0*a)*a) could still be nonzero. For avoiding overflow and underflow, the general rule is to apply the associative law to first perform multiplications where the operands' logs have opposite sign, which means squaring first is generally a worst strategy. On the other hand, for performance reasons, squaring first might be a very good strategy if you can reuse the value of the square. You may encounter yet another issue, which could matter more for correctness than overflow/underflow issues if your numbers will never be very close to zero or infinity: certain multiplications may be guaranteed to have exact answers, while others involve rounding. In general you'll get the most accurate results by minimizing the number of rounding steps that happen.

The optimal way depends on the purpose, really.
First of all, multiplication is faster than division.
So if you have to write a = a / 2;, it is better to write a = a * 0.5f;.
Your compiler is usually smart enough to replace division with multiplication on constants if the results is the same, but it will not do that with variables of course.
Sometimes, you can optimize a bit by replacing divisions with multiplications, but there may be problems with precision.
Some other operations may be faster but less precise.
Let's take an example.
float f = (a * 100000) / (b * 10);
float g = (a / b) * (100000 / 10);
These are mathematically equivalent but the result can be a little different.
The first uses two multiplication and one division, the second uses one division and one multiplication. In both cases there may be a loss in precision, it depends on the size of a and b, if they are small values first works better, if they are large values second works better
Then... if you have several constants and you want speed, group contants together.
float a = 6.3f * a * 2.0f * 3.1f;
Just write
a = a * (6.3f * 2.0f * 3.1f);
Some compiler optimize well, some other optimize less, but in both cases there is no risk in keeping all constants together.
After we say this we should talk for hours on how processors works.
Even the same family like intel works in a different way between generations!
Some compilers uses SSE instructions, some other doesn't.
Some processor supports SSE2, some SSE, some only MMX... some system don't have an FPU neither!
Each system do better some calculations than other, finding a common thing is hard.
You should just write a readable code, clean and simple, without worryng too much about these unpredictable very low level optimizations.
If your expression looks complicated, do some algebra and\or go to wolframalpha search engine and ask him to optimize that for you :)
Said that, you don't really need to declare one variable and replace its content over and over, compiler usually can optimize less in this situation.
a = 5 + b;
a /= 2 * c;
a += 2 - c;
a *= 7;
just write your expression avoiding this mess :)
a = ((5 + b) / (2 * c) + 2 - c) * 7;
About your specific example, 6000.f * a * a, just write it as you write it, no need to change it; it is fine as it is.

Not typically, no.
That being said, if you're doing multiple operations with large values, it may make sense to order them in a way that avoids overflows or reduces precision errors, based on their precedence and associativity, if the algorithm provides a way to make that obvious. This would, however, require advance knowledge of the values involved, and not just be based on the syntax.

There are indeed algorithms to minimize cumulative error in a sequence of floating-point operations. One such is http://en.wikipedia.org/wiki/Kahan_summation_algorithm. Others exist for other operations: http://www.cs.cmu.edu/~quake-papers/related/Priest.ps.

Is it still worth trying to create optimizations for sqrt() in C?

Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?

Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt() itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt() is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt() in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt() as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt() with inline code that shuffled the FPU state and at the core used the FSQRT instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT with some code to check for exceptional conditions and a call to the real sqrt() function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt() inline and used in the context of a larger all double expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a) or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.

Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".

If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.

This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
wikipedia article
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))
wikipedia article

It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".

Why not? You probably learn a lot!

I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.

I still find it useful even now, though this is the context of normalizing a million+ vectors every frame in response to deforming meshes.
That said, I'm generally not creating my own optimizations but relying on a crude approximation of inverse square root provided as a SIMD instruction: rsqrtps. That is still really useful in speeding up some real-world cases if you're willing to sacrifice precision for speed. Using rsqrtps can actually reduce the entirety of the operation which includes deforming and normalizing vertex normals to almost half the time, but at the cost of the precision of the results (that said, in ways that can barely be noticed by the human eye).
I've also still found the fast inverse sqrt as often credited incorrectly to John Carmack to still improve performance in scalar cases, though I don't use it much nowadays. It's generally natural to get some speed boost if you're willing to sacrifice accuracy. That said, I wouldn't even attempt to beat C's sqrt if you aren't trying to sacrifice precision for speed.
You generally have to sacrifice the generality of the solution (like its precision) if you want to beat standard implementations, and that tends to apply whether it's a mathematical function or, say, malloc. I can easily beat malloc with a narrowly-applicable free list lacking thread-safety that's suitable for very specific contexts. It's another thing to beat it with a general-purpose allocator which can allocate variable-sized chunks of memory and free any one of them at any given time.

Make a cosine table with the gcc preprocessor

I wish to make a cosine table at compile time. Is there a way to do this without hard coding anything?

Why not hardcode it? I am not aware of any changes in the result of the cosine function that they are planning, well not for another 100 years or so.

I am not convinced that precalculating a sine table would result in a performance improvement. I suggest:
Benchmark your application calling fcos() to decide whether it's fast enough. If it is, stop here.
If it really is too slow, consider using -ffast-math if it is acceptable for your usage.
Lookup tables, particularly large ones, will increase the size of your program that needs to be held in the CPU cache, which reduces its hit rate. This in turn will slow other parts of your application down.
I am assuming you're doing this in an incredibly tight loop, as that's the only case it could possibly matter in anyway.
If you actually DID discover that using a lookup table was beneficial, why not just precalculate it at runtime? It's going to have hardly any impact on startup time (unless it's a huuuuuge one). It may actually be faster to do it at runtime, because your CPU may be able to do sines faster than your disc can load floats in.

With C++, you can use templates metaprogramming to generate your lookup table at runtime.
Now, here is a standard C trick that may or may not accomplish what you want.
Write a program (say, cosgen) that generates the cosine table C statement (i.e., the code that you desire).
Run cosgen and dump the output (c code) to a file, say cos_table.c
In your main program, use a #include "cos_table.c" to insert the table where you want.

You could generate it with any scripting language you liked and include the result. Use make to have the scripting language do its thing anytime you change the source. It's hard coded to C but not to you, really.

With the magic of computers, the apparently impossible becomes possible:
#include <stdio.h>
#include <math.h>
#define MAX_ANGLE 90
double kinopiko_krazy_kosines[MAX_ANGLE];
int main ()
{
int i;
for (i = 0; i <= 90; i++) {
double angle = (M_PI * i) / (2.0*90.0);
kinopiko_krazy_kosines[i] = cos (angle);
printf ("#define cos_%d %f\n", i, kinopiko_krazy_kosines[i]);
}
}
http://codepad.org/G6JTATne

Since you're targetting Cell, you're probably targetting the SPE's? They do have proper FP support, vectorised in fact, but do not have large working memories. For that reason it's in fact a BAD IDEA to use tables - you're sacrificing a very limited resource.

I'd create a hard-coded lookup table - once with a scripting language - but I'm not sure it'll be faster than just using the standard math library.
I guess it depends on the size of the table, but I would suspect getting the FPU to do the calculation might be faster than accessing memory. So once you've got your table solution, I'd benchmark it to see if it's faster than the standard function.

Wave tables are the way to go. You can hard code it as suggested, or run it during application start up.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight