Comparing execution time of two functions? - c

I wrote two function which do the same thing, but use different algorithms. I compare the execution time using clock() from time.h, but the result is inconsistent.
I have tried to change the sequence of execution of function, but it seems like the first function to be executed will always have the longer running time
#include <stdio.h>
#include <time.h>
long exponent(int a, int b);
long exponentFast(int a, int b);
int main(void)
{
int base;
int power;
clock_t begin;
clock_t end;
double time_spent;
base = 2;
power = 25;
begin = clock();
long result1 = exponentFast(base, power);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Result1: %li, Time: %.9f\n", result1, time_spent);
begin = clock();
long result2 = exponent(base, power);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Result2: %li, Time: %.9f\n", result2, time_spent);
}
long exponent(int a, int b)
{
...
}
long exponentFast(int a, int b)
{
...
}
I expect time_spent for result1 to have a lower value than that for result2, but the output is
Result1: 33554432, Time: 0.000002000
Result2: 33554432, Time: 0.000001000
Executing exponent() before exponentFast() also yield the same result, suggesting that my implementation of benchmarking is wrong.

It can be surprisingly difficult to perform timings on function calls like these accurately and significantly. Here's a modification of your program that illustrates the difficulties:
#include <stdio.h>
#include <time.h>
#include <math.h>
long exponent(int a, int b);
long exponentFast(int a, int b);
void tester(long (*)(int, int));
#define NTRIALS 1000000000
int main(void)
{
clock_t begin;
clock_t end;
double time_spent;
begin = clock();
tester(exponentFast);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("exponentFast: Time: %.9f = %.10f/call\n", time_spent, time_spent / NTRIALS);
begin = clock();
tester(exponent);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("exponent: Time: %.9f = %.10f/call\n", time_spent, time_spent / NTRIALS);
}
void tester(long (*func)(int, int))
{
int base = 2;
int power = 25;
int i;
unsigned long accum = 0;
for(i = 0; i < NTRIALS; i++) {
accum += (*func)(base, power);
base = (base + 1) % 5;
power = (power + 7) % 16;
}
printf("(accum = %lu)\n", accum);
}
long exponent(int a, int b)
{
return pow(a, b);
}
long exponentFast(int a, int b)
{
long ret = 1;
int i;
for(i = 0; i < b; i++)
ret *= a;
return ret;
}
You will notice that:
I've arranged to perform multiple trials, which involved adding a new function tester() to do this. (tester() therefore takes a pointer to the function being tested, which is a technique you may not have been familiar with yet.)
I've arranged to vary the arguments to the function under test, between calls.
I've arranged to do something with the return values of the functions under test, namely add them all up.
(The second and third bullets follow suggestions by Jonathan Leffler, and are intended to ensure that a too-clever compiler doesn't optimize out some or all of the interesting work.)
When I ran this on my computer (an ordinary consumer-grade laptop), these were the results I got:
(accum = 18165558496053920)
exponentFast: Time: 20.954286000 = 0.0000000210/call
(accum = 18165558496053920)
exponent: Time: 23.409001000 = 0.0000000234/call
There are two huge things to notice here.
I ran each function one billion times. That's right, a billion, a thousand million. And yet this only took about 20 seconds.
Even with that many trials, there's still almost no visible difference between the "regular" and "fast" versions. On average, one took 21 nanoseconds (nanoseconds!); the other took 23 nanoseconds. Big whoop.
(Actually, however, this first trial was significantly misleading. More on that in a bit.)
Before continuing, it's worth asking a couple of questions.
Do these results even make sense? Is it actually possible for these functions to be taking mere tens of nanoseconds to do their work?
Presuming they're accurate, do these results tell us that we're wasting our time, that the difference between the "regular" and "fast" versions is so slight as to not be worth the effort to write and debug the "fast" one?
To the first point, let's do a quick back-of-the-envelope analysis. My machine claims to have a 2.2 GHz CPU. That means (roughly speaking) that it can do 2.2 billion things per second, or about 0.45 nanoseconds per thing. So a function that's taking 21 nanoseconds can be doing roughly 21 ÷ 0.45 = 46 things. And since my sample exponentFast function is doing roughly as many multiplications as the value of the exponent, it looks like we're probably in the right ballpark.
The other thing I did to confirm that I was getting at least quasi-reasonable results was to vary the number of trials. With NTRIALS reduced to 100000000, the overall program took just about exactly a tenth of the time to run, meaning that the time per call was consistent.
Now, to point 2, I still remember one of my formative experiences as a programmer, when I wrote a new and improved version of a standard function which I just knew was going to be gobs faster, and after several hours spent debugging it to get it to work at all, I discovered that it wasn't measurably faster, until I increased the number of trials up into the millions, and the penny (as they say) dropped.
But as I said, the results I've presented so far were, by a funny coincidence, misleading. When I first threw together some simple code to vary the arguments given to the function calls, as shown above, I had:
int base = 2;
int power = 25;
and then, within the loop
base = (base + 1) % 5;
power = (power + 7) % 16;
This was intended to allow base to run from 0 to 4, and power from 0 to 15, with the numbers chosen to ensure that the result wouldn't overflow even when base was 4. But this means that power was, on average, only 8, meaning that my simpleminded exponentFast call was only having to make 8 trips through its loop, on average, not 25 as in your original post.
When I changed the iteration step to
power = 25 + (power - 25 + 1) % 5;
-- that is, not varying base (and therefore allowing it to remain as the constant 2) and allowing power to vary between 25 and 30, now the time per call for exponentFast rose to about 63 nanoseconds. The good news is that this makes sense (roughly three times as many iterations, on average, made it about three times slower), but the bad news is that it looks like my "exponentFast" function isn't very fast! (Obviously, though, I didn't expect it to be, with its simpleminded, brute-force loop. If I wanted to make it faster, the first thing I'd do, at little additional cost in complexity, would be to apply "binary exponentiation".)
There's at least one more thing to worry about, though, which is that if we call these functions one billion times, we're not only counting one billion times the amount of time each function takes to do its work, but also one billion times the function call overhead. If the function call overhead is on a par with the amount of work the function is doing, we will (a) have a hard time measuring the actual work time, but also (b) have a hard time speeding things up! (We could get rid of the function call overhead by inlining the functions for our test, but that obviously wouldn't be meaningful if the actual use of the functions in the end program were going to involve real function calls.)
And then yet one more inaccuracy is that we're introducing a timing artifact by computing new and different values of base and/or power for each call, and adding up all the results, so that the amortized time to do that work goes into what we've been calling "time per call". (This problem, at least, since it affects either exponentiation function equally, won't detract from our ability to assess which one, if either, is faster.)
Addendum: Since my initial exponent of "exponentFast" really was pretty embarrassingly simpleminded, and since binary exponentiation is so easy and elegant, I performed one more test, rewriting exponentFast as
long exponentFast(int a, int b)
{
long ret = 1;
long fac = a;
while(1) {
if(b & 1) ret *= fac;
b >>= 1;
if(b == 0) break;
fac *= fac;
}
return ret;
}
Now -- Hooray! -- the average call to exponentFast goes down to about 16 ns per call on my machine. But the "Hooray!" is qualified. It's evidently about 25% faster than calling pow(), and that's nicely significant, but not an order of magnitude or anything. If the program where I'm using this is spending all its time exponentiating, I'll have made that program 25% faster, too, but if not, the improvement will be less. And there are cases where the improvement (the time saved over all anticipated runs of the program) will be less than the time I spent writing and testing my own version. And I haven't yet spent any time doing proper regression tests on my improved exponentFast function, but if this were anything other than a Stack Overflow post, I'd have to. It's got several sets of edge cases, and might well contain lurking bugs.

Related

Endless sine generation in C

I am working on a project which incorporates computing a sine wave as input for a control loop.
The sine wave has a frequency of 280 Hz, and the control loop runs every 30 µs and everything is written in C for an Arm Cortex-M7.
At the moment we are simply doing:
double time;
void control_loop() {
time += 30e-6;
double sine = sin(2 * M_PI * 280 * time);
...
}
Two problems/questions arise:
When running for a long time, time becomes bigger. Suddenly there is a point where the computation time for the sine function increases drastically (see image). Why is this? How are these functions usually implemented? Is there a way to circumvent this (without noticeable precision loss) as speed is a huge factor for us? We are using sin from math.h (Arm GCC).
How can I deal with time in general? When running for a long time, the variable time will inevitably reach the limits of double precision. Even using a counter time = counter++ * 30e-6; only improves this, but it does not solve it. As I am certainly not the first person who wants to generate a sine wave for a long time, there must be some ideas/papers/... on how to implement this fast and precise.
Instead of calculating sine as a function of time, maintain a sine/cosine pair and advance it through complex number multiplication. This doesn't require any trigonometric functions or lookup tables; only four multiplies and an occasional re-normalization:
static const double a = 2 * M_PI * 280 * 30e-6;
static const double dx = cos(a);
static const double dy = sin(a);
double x = 1, y = 0; // complex x + iy
int counter = 0;
void control_loop() {
double xx = dx*x - dy*y;
double yy = dx*y + dy*x;
x = xx, y = yy;
// renormalize once in a while, based on
// https://www.gamedev.net/forums/topic.asp?topic_id=278849
if((counter++ & 0xff) == 0) {
double d = 1 - (x*x + y*y - 1)/2;
x *= d, y *= d;
}
double sine = y; // this is your sine
}
The frequency can be adjusted, if needed, by recomputing dx, dy.
Additionally, all the operations here can be done, rather easily, in fixed point.
Rationality
As #user3386109 points out below (+1), the 280 * 30e-6 = 21 / 2500 is a rational number, thus the sine should loop around after 2500 samples exactly. We can combine this method with theirs by resetting our generator (x=1,y=0) every 2500 iterations (or 5000, or 10000, etc...). This would eliminate the need for renormalization, as well as get rid of any long-term phase inaccuracies.
(Technically any floating point number is a diadic rational. However 280 * 30e-6 doesn't have an exact representation in binary. Yet, by resetting the generator as suggested, we'll get an exactly periodic sine as intended.)
Explanation
Some requested an explanation down in the comments of why this works. The simplest explanation is to use the angle sum trigonometric identities:
xx = cos((n+1)*a) = cos(n*a)*cos(a) - sin(n*a)*sin(a) = x*dx - y*dy
yy = sin((n+1)*a) = sin(n*a)*cos(a) + cos(n*a)*sin(a) = y*dx + x*dy
and the correctness follows by induction.
This is essentially the De Moivre's formula if we view those sine/cosine pairs as complex numbers, in accordance to Euler's formula.
A more insightful way might be to look at it geometrically. Complex multiplication by exp(ia) is equivalent to rotation by a radians. Therefore, by repeatedly multiplying by dx + idy = exp(ia), we incrementally rotate our starting point 1 + 0i along the unit circle. The y coordinate, according to Euler's formula again, is the sine of the current phase.
Normalization
While the phase continues to advance with each iteration, the magnitude (aka norm) of x + iy drifts away from 1 due to round-off errors. However we're interested in generating a sine of amplitude 1, thus we need to normalize x + iy to compensate for numeric drift. The straight forward way is, of course, to divide it by its own norm:
double d = 1/sqrt(x*x + y*y);
x *= d, y *= d;
This requires a calculation of a reciprocal square root. Even though we normalize only once every X iterations, it'd still be cool to avoid it. Fortunately |x + iy| is already close to 1, thus we only need a slight correction to keep it at bay. Expanding the expression for d around 1 (first order Taylor approximation), we get the formula that's in the code:
d = 1 - (x*x + y*y - 1)/2
TODO: to fully understand the validity of this approximation one needs to prove that it compensates for round-off errors faster than they accumulate -- and thus get a bound on how often it needs to be applied.
The function can be rewritten as
double n;
void control_loop() {
n += 1;
double sine = sin(2 * M_PI * 280 * 30e-6 * n);
...
}
That does exactly the same thing as the code in the question, with exactly the same problems. But it can now be simplified:
280 * 30e-6 = 280 * 30 / 1000000 = 21 / 2500 = 8.4e-3
Which means that when n reaches 2500, you've output exactly 21 cycles of the sine wave. Which means that you can set n back to 0.
The resulting code is:
int n;
void control_loop() {
n += 1;
if (n == 2500)
n = 0;
double sine = sin(2 * M_PI * 8.4e-3 * n);
...
}
As long as your code can run for 21 cycles without problems, it'll run forever without problems.
I'm rather shocked at the existing answers. The first problem you detect is easily solved, and the next problem magically disappears when you solve the first problem.
You need a basic understanding of math to see how it works. Recall, sin(x+2pi) is just sin(x), mathematically. The large increase in time you see happens when your sin(float) implementation switches to another algorithm, and you really want to avoid that.
Remember that float has only 6 significant digits. 100000.0f*M_PI+x uses those 6 digits for 100000.0f*M_PI, so there's nothing left for x.
So, the easiest solution is to keep track of x yourself. At t=0 you initialize x to 0.0f. Every 30 us, you increment x+= M_PI * 280 * 30e-06;. The time does not appear in this formula! Finally, if x>2*M_PI, you decrement x-=2*M_PI; (Since sin(x)==sin(x-2*pi)
You now have an x that stays nicely in the range 0 to 6.2834, where sin is fast and the 6 digits of precision are all useful.
How to generate a lovely sine.
DAC is 12bits so you have only 4096 levels. It makes no sense to send more than 4096 samples per period. In real life you will need much less samples to generate a good quality waveform.
Create C file with the lookup table (using your PC). Redirect the output to the file (https://helpdeskgeek.com/how-to/redirect-output-from-command-line-to-text-file/).
#define STEP ((2*M_PI) / 4096.0)
int main(void)
{
double alpha = 0;
printf("#include <stdint.h>\nconst uint16_t sine[4096] = {\n");
for(int x = 0; x < 4096 / 16; x++)
{
for(int y = 0; y < 16; y++)
{
printf("%d, ", (int)(4095 * (sin(alpha) + 1.0) / 2.0));
alpha += STEP;
}
printf("\n");
}
printf("};\n");
}
https://godbolt.org/z/e899d98oW
Configure the timer to trigger the overflow 4096*280=1146880 times per second. Set the timer to generate the DAC trigger event. For 180MHz timer clock it will not be precise and the frequency will be 279.906449045Hz. If you need better precision change the number of samples to match your timer frequency or/and change the timer clock frequency (H7 timers can run up to 480MHz)
Configure DAC to use DMA and transfer the value from the lookup table created in the step 1 to the DAC on the trigger event.
Enjoy beautiful sine wave using your oscilloscope. Note that your microcontroller core will not be loaded at all. You will have it for other tasks. If you want to change the period simple reconfigure the timer. You can do it as many times per second as you wish. To reconfigure the timer use timer DMA burst mode - which will reload PSC & ARR registers on the upddate event automatically not disturbing the generated waveform.
I know it is advanced STM32 programming and it will require register level programming. I use it to generate complex waveforms in our devices.
It is the correct way of doing it. No control loops, no calculations, no core load.
I'd like to address the embedded programming issues in your code directly - #0___________'s answer is the correct way to do this on a microcontroller and I won't retread the same ground.
Variables representing time should never be floating point. If your increment is not a power of two, errors will always accumulate. Even if it is, eventually your increment will be smaller than the smallest increment and the timer will stop. Always use integers for time. You can pick an integer size big enough to ignore roll over - an unsigned 32 bit integer representing milliseconds will take 50 days to roll over, while an unsigned 64 bit integer will take over 500 million years.
Generating any periodic signal where you do not care about the signal's phase does not require a time variable. Instead, you can keep an internal counter which resets to 0 at the end of a period. (When you use DMA with a look-up table, that's exactly what you're doing - the counter is the DMA controller's next-read pointer.)
Whenever you use a transcendental function such as sine in a microcontroller, your first thought should be "can I use a look-up table for this?" You don't have access to the luxury of a modern operating system optimally shuffling your load around on a 4 GHz+ multi-core processor. You're often dealing with a single thread that will stall waiting for your 200 MHz microcontroller to bring the FPU out of standby and perform the approximation algorithm. There is a significant cost to transcendental functions. There's a cost to LUTs too, but if you're hitting the function constantly, there's a good chance you'll like the tradeoffs of the LUT a lot better.
As noted in some of the comments, the time value is continually growing with time. This poses two problems:
The sin function likely has to perform a modulus internally to get the internal value into a supported range.
The resolution of time will become worse and worse as the value increases, due to adding on higher digits.
Making the following changes should improve the performance:
double time;
void control_loop() {
time += 30.0e-6;
if((1.0/280.0) < time)
{
time -= 1.0/280.0;
}
double sine = sin(2 * M_PI * 280 * time);
...
}
Note that once this change is made, you will no longer have a time variable.
Use a look-up table. Your comment in the discussion with Eugene Sh.:
A small deviation from the sine frequency (like 280.1Hz) would be ok.
In that case, with a control interval of 30 µs, if you have a table of 119 samples that you repeat over and over, you will get a sine wave of 280.112 Hz. Since you have a 12-bit DAC, you only need 119 * 2 = 238 bytes to store this if you would output it directly to the DAC. If you use it as input for further calculations like you mention in the comments, you can store it as float or double as desired. On an MCU with embedded static RAM, it only takes a few cycles at most to load from memory.
If you have a few kilobytes of memory available, you can eliminate this problem completely with a lookup table.
With a sampling period of 30 µs, 2500 samples will have a total duration of 75 ms. This is exactly equal to the duration of 21 cycles at 280 Hz.
I haven't tested or compiled the following code, but it should at least demonstrate the approach:
double sin2500() {
static double *table = NULL;
static int n = 2499;
if (!table) {
table = malloc(2500 * sizeof(double));
for (int i=0; i<2500; i++) table[i] = sin(2 * M_PI * 280 * i * 30e-06);
}
n = (n+1) % 2500;
return table[n];
}
How about a variant of others' modulo-based concept:
int t = 0;
int divisor = 1000000;
void control_loop() {
t += 30 * 280;
if (t > divisor) t -= divisor;
double sine = sin(2 * M_PI * t / (double)divisor));
...
}
It calculates the modulo in integer then causes no roundoff errors.
There is an alternative approach to calculating a series of values of sine (and cosine) for angles that increase by some very small amount. It essentially devolves down to calculating the X and Y coordinates of a circle, and then dividing the Y value by some constant to produce the sine, and dividing the X value by the same constant to produce the cosine.
If you are content to generate a "very round ellipse", you can use a following hack, which is attributed to Marvin Minsky in the 1960s. It's much faster than calculating sines and cosines, although it introduces a very small error into the series. Here is an extract from the Hakmem Document, Item 149. The Minsky circle algorithm is outlined.
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse centered at the origin with its size determined by the initial point. epsilon determines the angular velocity of the circulating point, and slightly affects the eccentricity. If epsilon is a power of 2, then we don't even need multiplication, let alone square roots, sines, and cosines! The "circle" will be perfectly stable because the points soon become periodic.
The circle algorithm was invented by mistake when I tried to save one register in a display hack! Ben Gurley had an amazing display hack using only about six or seven instructions, and it was a great wonder. But it was basically line-oriented. It occurred to me that it would be exciting to have curves, and I was trying to get a curve display hack with minimal instructions.
Here is a link to the hakmem: http://inwap.com/pdp10/hbaker/hakmem/hacks.html
I think it would be possible to use a modulo because sin() is periodic.
Then you don’t have to worry about the problems.
double time = 0;
long unsigned int timesteps = 0;
double sine;
void controll_loop()
{
timesteps++;
time += 30e-6;
if( time > 1 )
{
time -= 1;
}
sine = sin( 2 * M_PI * 280 * time );
...
}
Fascinating thread. Minsky's algorithm mentioned in Walter Mitty's answer reminded me of a method for drawing circles that was published in Electronics & Wireless World and that I kept. (Credit: https://www.electronicsworld.co.uk/magazines/). I'm attaching it here for interest.
However, for my own similar projects (for audio synthesis) I use a lookup table, with enough points that linear interpolation is accurate enough (do the math(s)!)

Program stops unexpectedly when I use gettimeofday() in an infinite loop

I've written a code to ensure each loop of while(1) loop to take specific amount of time (in this example 10000µS which equals to 0.01 seconds). The problem is this code works pretty well at the start but somehow stops after less than a minute. It's like there is a limit of accessing linux time. For now, I am initializing a boolean variable to make this time calculation run once instead infinite. Since performance varies over time, it'd be good to calculate the computation time for each loop. Is there any other way to accomplish this?
void some_function(){
struct timeval tstart,tend;
while (1){
gettimeofday (&tstart, NULL);
...
Some computation
...
gettimeofday (&tend, NULL);
diff = (tend.tv_sec - tstart.tv_sec)*1000000L+(tend.tv_usec - tstart.tv_usec);
usleep(10000-diff);
}
}
from man-page of usleep
#include <unistd.h>
int usleep(useconds_t usec);
usec is unsigned int, now guess what happens when diff is > 10000 in below line
usleep(10000-diff);
Well, the computation you make to get the difference is wrong:
diff = (tend.tv_sec - tstart.tv_sec)*1000000L+(tend.tv_usec - tstart.tv_usec);
You are mixing different integer types, missing that tv_usec can be an unsigned quantity, which your are substracting from another unsigned and can overflow.... after that, you get as result a full second plus a quantity that is around 4.0E09usec. This is some 4000sec. or more than an hour.... aproximately. It is better to check if there's some carry, and in that case, to increment tv_sec, and then substract 10000000 from tv_usec to get a proper positive value.
I don't know the implementation you are using for struct timeval but the most probable is that tv_sec is a time_t (this can be even 64bit) while tv_usec normally is just a unsigned 32 bit value, as it it not going to go further from 1000000.
Let me illustrate... suppose you have elapsed 100ms doing calculations.... and this happens to occur in the middle of a second.... you have
tstart.tv_sec = 123456789; tstart.tv_usec = 123456;
tend.tv_sec = 123456789; tend.tv_usec = 223456;
when you substract, it leads to:
tv_sec = 0; tv_usec = 100000;
but let's suppose you have done your computation while the second changes
tstart.tv_sec = 123456789; tstart.tv_usec = 923456;
tend.tv_sec = 123456790; tend.tv_usec = 23456;
the time difference is again 100msec, but now, when you calculate your expression you get, for the first part, 1000000 (one full second) but, after substracting the second part you get 23456 - 923456 =*=> 4294067296 (*) with the overflow.
so you get to usleep(4295067296) or 4295s. or 1h 11m more.
I think you have not had enough patience to wait for it to finish... but this is something that can be happening to your program, depending on how struct timeval is defined.
A proper way to make carry to work is to reorder the summation to do all the additions first and then the substractions. This forces casts to signed integers when dealing with signed and unsigned together, and prevents a negative overflow in unsigneds.
diff = (tend.tv_sec - tstart.tv_sec) * 1000000 + tstart.tv_usec - tend.tv_usec;
which is parsed as
diff = (((tend.tv_sec - tstart.tv_sec) * 1000000) + tstart.tv_usec) - tend.tv_usec;

Interbench Benchmark Code

I want to ask anyone of you here is familiar with this function as below in the Interbench. I want to port this to windows platform but keep failing. I can only get microsecond accuracy by using timeval instead of timespec. And in the end , there will be error : divide by zero and access violation exceptions
unsigned long get_usecs(struct timeval *myts)
{
if (clock_gettime(myts))
terminal_error("clock_gettime");
return (myts->tv_sec * 1000000 + myts->tv_usec);
}
void burn_loops(unsigned long loops)
{
unsigned long i;
/*
* We need some magic here to prevent the compiler from optimising
* this loop away. Otherwise trying to emulate a fixed cpu load
* with this loop will not work.
*/
for (i = 0; i < loops; i++)
_ReadWriteBarrier();
}
void calibrate_loop()
{
unsigned long long start_time, loops_per_msec, run_time = 0;
unsigned long loops;
struct timeval myts;
loops_per_msec = 100000;
redo:
/* Calibrate to within 1% accuracy */
while (run_time > 1010000 || run_time < 990000) {
loops = loops_per_msec;
start_time = get_usecs(&myts);
burn_loops(loops);
run_time = get_usecs(&myts) - start_time;
loops_per_msec = (1000000 * loops_per_msec / run_time ? run_time : loops_per_msec );
}
/* Rechecking after a pause increases reproducibility */
Sleep(1 * 1000);
loops = loops_per_msec;
start_time = get_usecs(&myts);
burn_loops(loops);
run_time = get_usecs(&myts) - start_time;
/* Tolerate 5% difference on checking */
if (run_time > 1050000 || run_time < 950000)
goto redo;
loops_per_ms = loops_per_msec;
}
The only clock_gettime() function I know is the one specified by POSIX, and that function has a different signature than the one you are using. It does provide nanosecond resolution (though it is unlikely to provide single-nanosecond precision). To the best of my knowledge, however, it is not available on Windows. Microsoft's answer to obtaining nanosecond-scale time differences is to use its proprietary "Query Performance Counter" (QPC) API. Do put that aside for the moment, however, because I suspect clock resolution isn't your real problem.
Supposing that your get_usecs() function successfully retrieves a clock time with microsecond resolution and at least at least (about) millisecond precision, as seems to be the expectation, your code looks a bit peculiar. In particular, this assignment ...
loops_per_msec = (1000000 * loops_per_msec / run_time
? run_time
: loops_per_msec );
... looks quite wrong, as is more apparent when the formatting emphasizes operator precedence, as above (* and / have higher precedence than ?:). It will give you your divide-by-zero if you don't get a measurable positive run time, or otherwise it will always give you either the same loops_per_msec value you started with or else run_time, the latter of which doesn't even have the right units.
I suspect the intent was something more like this ...
loops_per_msec = ((1000000 * loops_per_msec)
/ (run_time ? run_time : loops_per_msec));
..., but that still has a problem: if 1000000 loops is not sufficient to consume at least one microsecond (as measured) then you will fall into an infinite loop, with loops_per_msec repeatedly set to 1000000.
This would be less susceptible to that particular problem ...
loops_per_msec = ((1000000 * loops_per_msec) / (run_time ? run_time : 1));
... and it makes more sense to me, too, because if the measured run time is 0 microseconds, then 1 microsecond is a better non-zero approximation to that than any other possible value. Do note that this will scale up your loops_per_msec quite rapidly (one million-fold) when the measured run time is zero microseconds. You can't do that many times without overflowing, even if unsigned long long turns out to have 128 bits, and if you get an overflow then you will go into an infinite loop. On the other hand, if that overflow happens then it indicates an absurdly large correct value for the loops_per_msec you are trying to estimate.
And that leads me to my conclusion: I suspect your real problem is that your timing calculations are wrong or invalid, either because get_usecs() isn't working correctly or because the body of burn_loops() is being optimized away (despite your effort to avoid that). You don't need sub-microsecond precision for your time measurements. In fact, you don't even really need better than millisecond precision, as long as your burn_loop() actually does work proportional to the value of its argument.

Why addition using bitwise operators in this code very slower than arithmetic addition

I tried comparing arithmetic addition with a function using bitwise operations i wrote - to find that the latter was almost 10x slower. What could be the reason for such disparity in speed? Since i am adding to the same number in the loop, does the compiler rewrite it to something more optimal in the first case?
Using arithmetic operation:
int main()
{
clock_t begin = clock();
int x;
int i = 1000000000;
while(i--) {
x = 1147483000 + i;
}
printf("%d\n", x);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("time spent = %f\n", time_spent);
return 0;
}
Output:
1147483000
time spent = 3.520000
Using bitwise operators:
The line inside while loop was replaced with:
x = add(1147483000, i);
and here's the add function:
int add(int x, int y) {
while(y != 0) {
int carry = (x & y);
x = x ^ y;
y = carry << 1;
}
return x;
}
Output:
1147483000
time spent = 32.940000
Integer arithmetic is performed in hardware typically in a very small number of clock cycles.
You will not be able to get close to this performance in software. Your implementation using bitwise operations involves a function call and a loop. The bitwise operations that you perform typically cost similar numbers of clock cycles as arithmetic.
You are performing three bitwise operations per iteration. Frankly, I'm astonished that there is only a factor of 10 here.
I also wonder what your compiler settings are, specifically any optimizations. A good compiler could eliminate your while loop in the arithmetic version. For performance comparisons you should be comparing optimised code. It looks as if you might not be doing so.
It's difficult to know what you are trying to achieve here, but do not expect to beat the performance of hardware arithmetic units.
You have replaced this:
x = 1147483000 + i;
with this:
while(y != 0) {
int carry = (x & y);
x = x ^ y;
y = carry << 1;
}
Of course you get a huge slow-down! + of two integers is one assembly instruction. Your while loop executes many instructions, effectively simulating in software what the hardware does when it executes an addition.
To elaborate more, this is how a full adder looks like. With 32-bit addition, the ALU contains 32 of this unit cascaded. Each of those hardware elements have very very small delay. The delay of the wires is negligible. So if the software adds two 32-bit numbers together, it takes very very little time.
On the other hand, if you try to simulate the addition by hand, you make the CPU go to memory, fetch and decode some instructions 32 times which takes considerably longer time.
When you replace addition with a function call:
calling the function is more time intensive than simple addition because function call is associated with stack operations.
In the function you are replacing addition with three bitwise operations - how fast they are compared to addition may be an issue- though not confirm about that without testing. Can you post the individual times for the three bitwise operations here?:
1.
//tic
while(i--) {
int carry = (x & y);
}
//toc
2.
//tic
while(i--) {
x = x ^ y;
}
//toc
3.
//tic
while(i--) {
y = carry << 1;
}
//toc
But function call should be the main reason.

What are other mathematical operators one can use to transform an algorithm

The difference operator, (similar to the derivative operator), and the sum operator, (similar to the integration operator), can be used to change an algorithm because they are inverses.
Sum of (difference of y) = y
Difference of (sum of y) = y
An example of using them that way in a c program is below.
This c program demonstrates three approaches to making an array of squares.
The first approach is the simple obvious approach, y = x*x .
The second approach uses the equation (difference in y) = (x0 + x1)*(difference in x) .
The third approach is the reverse and uses the equation (sum of y) = x(x+1)(2x+1)/6 .
The second approach is consistently slightly faster then the first one, even though I haven't bothered optimizing it. I imagine that if I tried harder I could make it even better.
The third approach is consistently twice as slow, but this doesn't mean the basic idea is dumb. I could imagine that for some function other than y = x*x this approach might be faster. Also there is an integer overflow issue.
Trying out all these transformations was very interesting, so now I want to know what are some other pairs of mathematical operators I could use to transform the algorithm?
Here is the code:
#include <stdio.h>
#include <time.h>
#define tries 201
#define loops 100000
void printAllIn(unsigned int array[tries]){
unsigned int index;
for (index = 0; index < tries; ++index)
printf("%u\n", array[index]);
}
int main (int argc, const char * argv[]) {
/*
Goal, Calculate an array of squares from 0 20 as fast as possible
*/
long unsigned int obvious[tries];
long unsigned int sum_of_differences[tries];
long unsigned int difference_of_sums[tries];
clock_t time_of_obvious1;
clock_t time_of_obvious0;
clock_t time_of_sum_of_differences1;
clock_t time_of_sum_of_differences0;
clock_t time_of_difference_of_sums1;
clock_t time_of_difference_of_sums0;
long unsigned int j;
long unsigned int index;
long unsigned int sum1;
long unsigned int sum0;
long signed int signed_index;
time_of_obvious0 = clock();
for (j = 0; j < loops; ++j)
for (index = 0; index < tries; ++index)
obvious[index] = index*index;
time_of_obvious1 = clock();
time_of_sum_of_differences0 = clock();
for (j = 0; j < loops; ++j)
for (index = 1, sum_of_differences[0] = 0; index < tries; ++index)
sum_of_differences[index] = sum_of_differences[index-1] + 2 * index - 1;
time_of_sum_of_differences1 = clock();
time_of_difference_of_sums0 = clock();
for (j = 0; j < loops; ++j)
for (signed_index = 0, sum0 = 0; signed_index < tries; ++signed_index) {
sum1 = signed_index*(signed_index+1)*(2*signed_index+1);
difference_of_sums[signed_index] = (sum1 - sum0)/6;
sum0 = sum1;
}
time_of_difference_of_sums1 = clock();
// printAllIn(obvious);
printf(
"The obvious approach y = x*x took, %f seconds\n",
((double)(time_of_obvious1 - time_of_obvious0))/CLOCKS_PER_SEC
);
// printAllIn(sum_of_differences);
printf(
"The sum of differences approach y1 = y0 + 2x - 1 took, %f seconds\n",
((double)(time_of_sum_of_differences1 - time_of_sum_of_differences0))/CLOCKS_PER_SEC
);
// printAllIn(difference_of_sums);
printf(
"The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, %f seconds\n",
(double)(time_of_difference_of_sums1 - time_of_difference_of_sums0)/CLOCKS_PER_SEC
);
return 0;
}
There are two classes of optimizations here: strength reduction and peephole optimizations.
Strength reduction is the usual term for replacing "expensive" mathematical functions with cheaper functions -- say, replacing a multiplication with two logarithm table lookups, an addition, and then an inverse logarithm lookup to find the final result.
Peephole optimizations is the usual term for replacing something like multiplication by a power of two with left shifts. Some CPUs have simple instructions for these operations that run faster than generic integer multiplication for the specific case of multiplying by powers of two.
You can also perform optimizations of individual algorithms. You might write a * b, but there are many different ways to perform multiplication, and different algorithms perform better or worse under different conditions. Many of these decisions are made by the chip designers, but arbitrary-precision integer libraries make their own choices based on the merits of the primitives available to them.
When I tried to compile your code on Ubuntu 10.04, I got a segmentation fault right when main() started because you are declaring many megabytes worth of variables on the stack. I was able to compile it after I moved most of your variables outside of main to make them be global variables.
Then I got these results:
The obvious approach y = x*x took, 0.000000 seconds
The sum of differences approach y1 = y0 + 2x - 1 took, 0.020000 seconds
The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, 0.000000 seconds
The program runs so fast it's hard to believe it really did anything. I put the "-O0" option in to disable optimizations but it's possible GCC still might have optimized out all of the computations. So I tried adding the "volatile" qualifier to your arrays but still got similar results.
That's where I stopped working on it. In conclusion, I don't really know what's going on with your code but it's quite possible that something is wrong.

Resources