I've trying to mix together 2 16bit linear PCM audio streams and I can't seem to overcome the noise issues. I think they are coming from overflow when mixing samples together.
I have following function ...
short int mix_sample(short int sample1, short int sample2)
{
return #mixing_algorithm#;
}
... and here's what I have tried as #mixing_algorithm#
sample1/2 + sample2/2
2*(sample1 + sample2) - 2*(sample1*sample2) - 65535
(sample1 + sample2) - sample1*sample2
(sample1 + sample2) - sample1*sample2 - 65535
(sample1 + sample2) - ((sample1*sample2) >> 0x10) // same as divide by 65535
Some of them have produced better results than others but even the best result contained quite a lot of noise.
Any ideas how to solve it?
The best solution I have found is given by Viktor Toth. He provides a solution for 8-bit unsigned PCM, and changing that for 16-bit signed PCM, produces this:
int a = 111; // first sample (-32768..32767)
int b = 222; // second sample
int m; // mixed result will go here
// Make both samples unsigned (0..65535)
a += 32768;
b += 32768;
// Pick the equation
if ((a < 32768) || (b < 32768)) {
// Viktor's first equation when both sources are "quiet"
// (i.e. less than middle of the dynamic range)
m = a * b / 32768;
} else {
// Viktor's second equation when one or both sources are loud
m = 2 * (a + b) - (a * b) / 32768 - 65536;
}
// Output is unsigned (0..65536) so convert back to signed (-32768..32767)
if (m == 65536) m = 65535;
m -= 32768;
Using this algorithm means there is almost no need to clip the output as it is only one value short of being within range. Unlike straight averaging, the volume of one source is not reduced even when the other source is silent.
here's a descriptive implementation:
short int mix_sample(short int sample1, short int sample2) {
const int32_t result(static_cast<int32_t>(sample1) + static_cast<int32_t>(sample2));
typedef std::numeric_limits<short int> Range;
if (Range::max() < result)
return Range::max();
else if (Range::min() > result)
return Range::min();
else
return result;
}
to mix, it's just add and clip!
to avoid clipping artifacts, you will want to use saturation or a limiter. ideally, you will have a small int32_t buffer with a small amount of lookahead. this will introduce latency.
more common than limiting everywhere, is to leave a few bits' worth of 'headroom' in your signal.
Here is what I did on my recent synthesizer project.
int* unfiltered = (int *)malloc(lengthOfLongPcmInShorts*4);
int i;
for(i = 0; i < lengthOfShortPcmInShorts; i++){
unfiltered[i] = shortPcm[i] + longPcm[i];
}
for(; i < lengthOfLongPcmInShorts; i++){
unfiltered[i] = longPcm[i];
}
int max = 0;
for(int i = 0; i < lengthOfLongPcmInShorts; i++){
int val = unfiltered[i];
if(abs(val) > max)
max = val;
}
short int *newPcm = (short int *)malloc(lengthOfLongPcmInShorts*2);
for(int i = 0; i < lengthOfLongPcmInShorts; i++){
newPcm[i] = (unfilted[i]/max) * MAX_SHRT;
}
I added all the PCM data into an integer array, so that I get all the data unfiltered.
After doing that I looked for the absolute max value in the integer array.
Finally, I took the integer array and put it into a short int array by taking each element dividing by that max value and then multiplying by the max short int value.
This way you get the minimum amount of 'headroom' needed to fit the data.
You might be able to do some statistics on the integer array and integrate some clipping, but for what I needed the minimum amount of headroom was good enough for me.
There's a discussion here: https://dsp.stackexchange.com/questions/3581/algorithms-to-mix-audio-signals-without-clipping about why the A+B - A*B solution is not ideal. Hidden down in one of the comments on this discussion is the suggestion to sum the values and divide by the square root of the number of signals. And an additional check for clipping couldn't hurt. This seems like a reasonable (simple and fast) middle ground.
I think they should be functions mapping [MIN_SHORT, MAX_SHORT] -> [MIN_SHORT, MAX_SHORT] and they are clearly not (besides first one), so overflows occurs.
If unwind's proposition won't work you can also try:
((long int)(sample1) + sample2) / 2
Since you are in time domain the frequency info is in the difference between successive samples, when you divide by two you damage that information. That's why adding and clipping works better. Clipping will of course add very high frequency noise which is probably filtered out.
Related
I need to calculate the entropy and due to the limitations of my system I need to use restricted C features (no loops, no floating point support) and I need as much precision as possible. From here I figure out how to estimate the floor log2 of an integer using bitwise operations. Nevertheless, I need to increase the precision of the results. Since no floating point operations are allowed, is there any way to calculate log2(x/y) with x < y so that the result would be something like log2(x/y)*10000, aiming at getting the precision I need through arithmetic integer?
You will base an algorithm on the formula
log2(x/y) = K*(-log(x/y));
where
K = -1.0/log(2.0); // you can precompute this constant before run-time
a = (y-x)/y;
-log(x/y) = a + a^2/2 + a^3/3 + a^4/4 + a^5/5 + ...
If you write the loop correctly—or, if you prefer, unroll the loop to code the same sequence of operations looplessly—then you can handle everything in integer operations:
(y^N*(1*2*3*4*5*...*N)) * (-log(x/y))
= y^(N-1)*(2*3*4*5*...*N)*(y-x) + y^(N-2)*(1*3*4*5*...*N)*(y-x)^2 + ...
Of course, ^, the power operator, binding tighter than *, is not a C operator, but you can implement that efficiently in the context of your (perhaps unrolled) loop as a running product.
The N is an integer large enough to afford desired precision but not so large that it overruns the number of bits you have available. If unsure, then try N = 6 for instance. Regarding K, you might object that that is a floating-point number, but this is not a problem for you because you are going to precompute K, storing it as a ratio of integers.
SAMPLE CODE
This is a toy code but it works for small values of x and y such as 5 and 7, thus sufficing to prove the concept. In the toy code, larger values can silently overflow the default 64-bit registers. More work would be needed to make the code robust.
#include <stddef.h>
#include <stdlib.h>
// Your program will not need the below headers, which are here
// included only for comparison and demonstration.
#include <math.h>
#include <stdio.h>
const size_t N = 6;
const long long Ky = 1 << 10; // denominator of K
// Your code should define a precomputed value for Kx here.
int main(const int argc, const char *const *const argv)
{
// Your program won't include the following library calls but this
// does not matter. You can instead precompute the value of Kx and
// hard-code its value above with Ky.
const long long Kx = lrintl((-1.0/log(2.0))*Ky); // numerator of K
printf("K == %lld/%lld\n", Kx, Ky);
if (argc != 3) exit(1);
// Read x and y from the command line.
const long long x0 = atoll(argv[1]);
const long long y = atoll(argv[2]);
printf("x/y == %lld/%lld\n", x0, y);
if (x0 <= 0 || y <= 0 || x0 > y) exit(1);
// If 2*x <= y, then, to improve accuracy, double x repeatedly
// until 2*x > y. Each doubling offsets the log2 by 1. The offset
// is to be recovered later.
long long x = x0;
int integral_part_of_log2 = 0;
while (1) {
const long long trial_x = x << 1;
if (trial_x > y) break;
x = trial_x;
--integral_part_of_log2;
}
printf("integral_part_of_log2 == %d\n", integral_part_of_log2);
// Calculate the denominator of -log(x/y).
long long yy = 1;
for (size_t j = N; j; --j) yy *= j*y;
// Calculate the numerator of -log(x/y).
long long xx = 0;
{
const long long y_minus_x = y - x;
for (size_t i = N; i; --i) {
long long term = 1;
size_t j = N;
for (; j > i; --j) {
term *= j*y;
}
term *= y_minus_x;
--j;
for (; j; --j) {
term *= j*y_minus_x;
}
xx += term;
}
}
// Convert log to log2.
xx *= Kx;
yy *= Ky;
// Restore the aforementioned offset.
for (; integral_part_of_log2; ++integral_part_of_log2) xx -= yy;
printf("log2(%lld/%lld) == %lld/%lld\n", x0, y, xx, yy);
printf("in floating point, this ratio of integers works out to %g\n",
(1.0*xx)/(1.0*yy));
printf("the CPU's floating-point unit computes the log2 to be %g\n",
log2((1.0*x0)/(1.0*y)));
return 0;
}
Running this on my machine with command-line arguments of 5 7, it outputs:
K == -1477/1024
x/y == 5/7
integral_part_of_log2 == 0
log2(5/7) == -42093223872/86740254720
in floating point, this ratio of integers works out to -0.485279
the CPU's floating-point unit computes the log2 to be -0.485427
Accuracy would be substantially improved by N = 12 and Ky = 1 << 20, but for that you need either thriftier code or more than 64 bits.
THRIFTIER CODE
Thriftier code, wanting more effort to write, might represent numerator and denominator in prime factors. For example, it might represent 500 as [2 0 3], meaning (22)(30)(53).
Yet further improvements might occur to your imagination.
AN ALTERNATE APPROACH
For an alternate approach, though it might not meet your requirements precisely as you have stated them, #phuclv has given the suggestion I would be inclined to follow if your program were mine: work the problem in reverse, guessing a value c/d for the logarithm and then computing 2^(c/d), presumably via a Newton-Raphson iteration. Personally, I like the Newton-Raphson approach better. See sect. 4.8 here (my original).
MATHEMATICAL BACKGROUND
Several sources including mine already linked explain the Taylor series underlying the first approach and the Newton-Raphson iteration of the second approach. The mathematics unfortunately is nontrivial, but there you have it. Good luck.
I am trying to find the largest prime factor of a huge number in C ,for small numbers like 100 or even 10000 it works fine but fails (By fail i mean it keeps running and running for tens of minutes on my core2duo and i5) for very big target numbers (See code for the target number.)
Is my algorithm correct?
I am new to C and really struggling with big numbers. What i want is correction or guidance not a solution i can do this using python with bignum bindings and stuff (I have not tried yet but am pretty sure) but not in C. Or i might have done some tiny mistake that i am too tired to realize , anyways here is the code i wrote:
#include <stdio.h>
// To find largest prime factor of target
int is_prime(unsigned long long int num);
long int main(void) {
unsigned long long int target = 600851475143;
unsigned long long int current_factor = 1;
register unsigned long long int i = 2;
while (i < target) {
if ( (target % i) == 0 && is_prime(i) && (i > current_factor) ) { //verify i as a prime factor and greater than last factor
current_factor = i;
}
i++;
}
printf("The greates is: %llu \n",current_factor);
return(0);
}
int is_prime (unsigned long long int num) { //if num is prime 1 else 0
unsigned long long int z = 2;
while (num > z && z !=num) {
if ((num % z) == 0) {return 0;}
z++;
}
return 1;
}
600 billion iterations of anything will take some non-trivial amount of time. You need to substantially reduce this.
Here's a hint: Given an arbitrary integer value x, if we discover that y is a factor, then we've implicitly discovered that x / y is also a factor. In other words, factors always come in pairs. So there's a limit to how far we need to iterate before we're doing redundant work.
What is that limit? Well, what's the crossover point where y will be greater than x / y?
Once you've applied this optimisation to the outer loop, you'll find that your code's runtime will be limited by the is_prime function. But of course, you may apply a similar technique to that too.
By iterating until the square root of the number, we can get all of it's factors.( factor and N/factor and factor<=sqrt(N)). Under this small idea the solution exists. Any factor less than the sqrt(N) we check, will have corresponding factor larger than sqrt(N). So we only need to check up to the sqrt(N), and then we can get the remaining factors.
Here you don't need to use explicitly any prime finding algorithm. The factorization logic itself will deduce whether the target is prime or not. So all that is left is to check the pairwise factors.
unsigned long long ans ;
for(unsigned long long i = 2; i<=target/i; i++)
while(target % i == 0){
ans = i;
target/=i;
}
if( target > 1 ) ans = target; // that means target is a prime.
//print ans
Edit: A point to be added (chux)- i*i in the earlier code is may lead to overflow which can be avoided if we use i<=target/i.
Also another choice would be to have
unsigned long long sqaure_root = isqrt(target);
for(unsigned long long i = 2; i<=square_root; i++){
...
}
Here note than use of sqrt is not a wise choice since -
mixing of double math with an integer operation is prone to round-off errors.
For target given the answer will be 6857.
Code has 2 major problems
The while (i < target) loop is very inefficient. Upon finding a factor, target could be reduced to target = target / i;. Further, a factor i could occur multiple times. Fix not shown.
is_prime(n) is very inefficient. Its while (num > z && z !=num) could loop n time. Here too, use the quotient to limit the iterations to sqrt(n) times.
int is_prime (unsigned long long int num) {
unsigned long long int z = 2;
while (z <= num/z) {
if ((num % z) == 0) return 0;
z++;
}
return num > 1;
}
Nothing is wrong, it just needs optimization, for example:
int is_prime(unsigned long long int num) {
if (num == 2) {
return (1); /* Special case */
}
if (num % 2 == 0 || num <= 1) {
return (0);
}
unsigned long long int z = 3; /* We skipped the all even numbers */
while (z < num) { /* Do a single test instead of your redundant ones */
if ((num % z) == 0) {
return 0;
}
z += 2; /* Here we go twice as fast */
}
return 1;
}
Also the big other problem is while (z < num) but since you don't want the solution i let you find how to optimize that, similarly look out by yourself the first function.
EDIT: Someone else posted 50 seconds before me the array-list of primes solution which is the best but i chose to give an easy solution since you are just a beginner and manipulating arrays may not be easy at first (need to learn pointers and stuff).
is_prime has a chicken-and-egg problem in that you need to test num only against other primes. So you don't need to check against 9 because that is a multiple of 3.
is_prime could maintain an array of primes and each time a new num is tested that is a pime, it can be added to the array. num isr tested against each prime in the array and if it is not divisable by any of the primes in the array, it is itself a prime and is added to the array. The aray needs to be malloc'd and relloc'd unless there is a formue to calculate the number of primes up intil your target (I believe such formula does not exist).
EDIT: the number of primes to test for the target 600,851,475,143 will be approximately 7,500,000,000 and the table could run out of memory.
The approach can be adapted as follows:
to use unsiged int up until primes of UINT_max
to use unsigned long long int for primes above that
to use brute force above a certain memory consumption.
UINT_MAX is defined as 4,294,967,295 and would cover the primes up to around 100,000,000,000 and would cost 7.5*4= 30Gb
See also The Prime Pages.
#include <stdio.h>
int main()
{
int i,j,k,t;
long int n;
int count;
int a,b;
float c;
scanf("%d",&t);
for(k=0;k<t;k++)
{
count=0;
scanf("%d",&n);
for(i=1;i<n;i++)
{
a=pow(i,2);
for(j=i;j<n;j++)
{
b=pow(j,2);
c=sqrt(a+b);
if((c-floor(c)==0)&&c<=n)
++count;
}
}
printf("%d\n",count);
}
return 0;
}
The above is a c code that counts the number of Pythagorean triplets within range 1..n.
How do I optimize it ? It times out for large input .
1<=T<=100
1<=N<=10^6
Your inner two loops are O(n*n) so there's not too much that can be done without changing algorithms. Just looking at the inner loop the best I could come up with in a short time was the following:
unsigned long long int i,j,k,t;
unsigned long long int n = 30000; //Example for testing
unsigned long long int count = 0;
unsigned long long int a, b;
unsigned long long int c;
unsigned long long int n2 = n * n;
for(i=1; i<n; i++)
{
a = i*i;
for(j=i; j<n; j++)
{
b = j*j;
unsigned long long int sum = a + b;
if (sum > n2) break;
// Check for multiples of 2, 3, and 5
if ( (sum & 2) || ((sum & 7) == 5) || ((sum & 11) == 8) ) continue;
c = sqrt((double)sum);
if (c*c == sum) ++count;
}
}
A few comments:
For the case of n=30000 this is roughly twice as fast as your original.
If you don't mind n being limited to 65535 you can switch to unsigned int to get a x2 speed increase (or roughly x4 faster than your original).
The check for multiples of 2/3/5 increases the speed by a factor of two. You may be able to increase this by looking at the answers to this question.
Your original code has integer overflows when i > 65535 which is the reason I switched to 64-bit integers for everything.
I think your method of checking for a perfect square doesn't always work due to the inherent in-precision of floating point numbers. The method in my example should get around that and is slightly faster anyways.
You are still bound to the O(n*n) algorithm. On my machine the code for n=30000 runs in about 6 seconds which means the n=1000000 case will take close to 2 hours. Looking at Wikipedia shows a host of other algorithms you could explore.
It really depends on what the benchmark is that you are expecting.
But for now, the power function could be a bottle neck in this. I think you can do either of the two things:
a) precalculate and save in a file and then load into a dictionary all the squared values. Depending on the input size, that might be loading your memory.
b) memorize previously calculated squared values so that when asked again, you could reuse it there by saving CPU time. This again, would eventually load your memory.
You can define your indexes as (unsigned) long or even (unsigned) long long, but you may have to use big num libraries to solve your problem for huge numbers. Using unsigned uppers your Max number limit but forces you to work with positive numbers. I doubt you'll need bigger than long long though.
It seems your question is about optimising your code to make it faster. If you read up on Pythagorean triplets you will see there is a way to calculate them using integer parameters. If 3 4 5 are triplets then we know that 2*3 2*4 2*5 are also triplets and k*3 k*4 k*5 are also triplets. Your algorithm is checking all of those triplets. There are better algorithms to use, but I'm afraid you will have to search on Google to study about Pythagorean triplets.
This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Closed 9 years ago.
What would be the best way to identify all the set bit positions in a 64 bit bitmask. Suppose my bit mask is 0xDeadBeefDeadBeef, then what is the best way, to identify all the bit positions of the set bits in it.
long long bit_mask = 0xdeadbeefdeadbeef;
unsigned int bit_pos=0;
while(mask) {
if((mask&1)==1) {
printf("Set bit position is:%d \n",bit_pos};
}
bit_pos++;
mask>>=1;
}
One way is to loop through it, and check if a bit is set or not, if it is set, Return the count position and continue looping until the MSB, so for 64 bits, I would iterate until I have all the set bits traversed or all 64 bits traversed, if MSB is set, but there must be a better way of doing it?
Algorithm from Hacker's Delight (book):
int count_bits(long long s)
{
s = (s&0x5555555555555555L) + ((s>>1)&0x5555555555555555L);
s = (s&0x3333333333333333L) + ((s>>2)&0x3333333333333333L);
s = (s&0x0F0F0F0F0F0F0F0FL) + ((s>>4)&0x0F0F0F0F0F0F0F0FL);
s = (s&0x00FF00FF00FF00FFL) + ((s>>8)&0x00FF00FF00FF00FFL);
s = (s&0x0000FFFF0000FFFFL) + ((s>>16)&0x0000FFFF0000FFFFL);
s = (s&0x00000000FFFFFFFFL) + ((s>>32)&0x00000000FFFFFFFFL);
return (int)s;
}
Besides already explained nice bit twiddling hacks, there are other options.
This assumes that you have x86(64), SSE4, gcc and compile with -msse4 switch you can use:
int CountOnesSSE4(unsigned int x)
{
return __builtin_popcount(x);
}
This will compile into single popcnt instruction. If you need fast code you can actually check for SSE at runtime and use best function available.
If you expect number to have small number of ones, this could also be fast (and is always faster than the usual shift and compare loop):
int CountOnes(unsigned int x)
{
int cnt = 0;
while (x) {
x >>= ffs(x);
cnt++;
}
return cnt;
}
On x86 (even without SSE) ffs will compile into single instruction (bsf), and number of loops will depend on number of ones.
You might do this:
long long bit_mask = 0xdeadbeefdeadbeef;
int i;
for (i = 0; i < (sizeof(long long) * 8); i++) {
int res = bit_mask & 1;
printf ("Pos %i is %i\n", i, res);
bit_mask >>= 1;
}
It depends if you want clarity in your code or a very fast result. I almost always choose clarity in the code unless profiling tells me otherwise. For clarity, you might do something like:
int count_bits(long long value) {
int n = 0;
while(value) {
n += (value & 1);
value >>= 1;
}
return n;
}
For performance you might want to call count_bits from X J's answer.
int count_bits(long long s)
{
s = (s&0x5555555555555555L) + ((s>>1)&0x5555555555555555L);
s = (s&0x3333333333333333L) + ((s>>2)&0x3333333333333333L);
s = (s&0x0F0F0F0F0F0F0F0FL) + ((s>>4)&0x0F0F0F0F0F0F0F0FL);
s = (s&0x00FF00FF00FF00FFL) + ((s>>8)&0x00FF00FF00FF00FFL);
s = (s&0x0000FFFF0000FFFFL) + ((s>>16)&0x0000FFFF0000FFFFL);
s = (s&0x00000000FFFFFFFFL) + ((s>>32)&0x00000000FFFFFFFFL);
return (int)s;
}
It depends if you want to look through your code and say to yourself, "yeah, that makes sense" or "I'll take that guy's word it".
I've been called out on this before in stack overflow. Some people do not agree. Some very smart people choose complexity over simplicity. I believe clean code is simple code.
If performance calls for it, use complexity. If not, don't.
Also, consider a code review. What are you going to say when someone says "How does count_bits work?"
If you are counting the ones you can use the hackers delight solution which is fast, but a lookup table can be (isnt always) faster. And much more understandable. You could pre-prepare a table for example 256 items deep that represent the counts for the byte values 0x00 to 0xFF
0, //0x00
1, //0x01
1, //0x02
2, //0x03
1, //0x04
2, //0x05
2, //0x06
3, //0x07
...
The code to build that table would likely use the slow step through every bit approach.
Once built though you can break your larger number into bytes
count = table8[number&0xFF]; number>>=8;
count += table8[number&0xFF]; number>>=8;
count += table8[number&0xFF]; number>>=8;
count += table8[number&0xFF]; number>>=8;
...
if you have more memory you can make the table even bigger by representing wider numbers, a 65536 deep table for the numbers 0x0000 to 0xFFFF.
count = table16[number&0xFFFF]; number>>16;
count += table16[number&0xFFFF]; number>>16;
count += table16[number&0xFFFF]; number>>16;
count += table16[number&0xFFFF]; number>>16;
Tables are a general way to make things like this faster at the expense of memory consumption. The more memory you are able to consume the more you can pre-compute (at or before compile time) rather than real-time compute.
The difference operator, (similar to the derivative operator), and the sum operator, (similar to the integration operator), can be used to change an algorithm because they are inverses.
Sum of (difference of y) = y
Difference of (sum of y) = y
An example of using them that way in a c program is below.
This c program demonstrates three approaches to making an array of squares.
The first approach is the simple obvious approach, y = x*x .
The second approach uses the equation (difference in y) = (x0 + x1)*(difference in x) .
The third approach is the reverse and uses the equation (sum of y) = x(x+1)(2x+1)/6 .
The second approach is consistently slightly faster then the first one, even though I haven't bothered optimizing it. I imagine that if I tried harder I could make it even better.
The third approach is consistently twice as slow, but this doesn't mean the basic idea is dumb. I could imagine that for some function other than y = x*x this approach might be faster. Also there is an integer overflow issue.
Trying out all these transformations was very interesting, so now I want to know what are some other pairs of mathematical operators I could use to transform the algorithm?
Here is the code:
#include <stdio.h>
#include <time.h>
#define tries 201
#define loops 100000
void printAllIn(unsigned int array[tries]){
unsigned int index;
for (index = 0; index < tries; ++index)
printf("%u\n", array[index]);
}
int main (int argc, const char * argv[]) {
/*
Goal, Calculate an array of squares from 0 20 as fast as possible
*/
long unsigned int obvious[tries];
long unsigned int sum_of_differences[tries];
long unsigned int difference_of_sums[tries];
clock_t time_of_obvious1;
clock_t time_of_obvious0;
clock_t time_of_sum_of_differences1;
clock_t time_of_sum_of_differences0;
clock_t time_of_difference_of_sums1;
clock_t time_of_difference_of_sums0;
long unsigned int j;
long unsigned int index;
long unsigned int sum1;
long unsigned int sum0;
long signed int signed_index;
time_of_obvious0 = clock();
for (j = 0; j < loops; ++j)
for (index = 0; index < tries; ++index)
obvious[index] = index*index;
time_of_obvious1 = clock();
time_of_sum_of_differences0 = clock();
for (j = 0; j < loops; ++j)
for (index = 1, sum_of_differences[0] = 0; index < tries; ++index)
sum_of_differences[index] = sum_of_differences[index-1] + 2 * index - 1;
time_of_sum_of_differences1 = clock();
time_of_difference_of_sums0 = clock();
for (j = 0; j < loops; ++j)
for (signed_index = 0, sum0 = 0; signed_index < tries; ++signed_index) {
sum1 = signed_index*(signed_index+1)*(2*signed_index+1);
difference_of_sums[signed_index] = (sum1 - sum0)/6;
sum0 = sum1;
}
time_of_difference_of_sums1 = clock();
// printAllIn(obvious);
printf(
"The obvious approach y = x*x took, %f seconds\n",
((double)(time_of_obvious1 - time_of_obvious0))/CLOCKS_PER_SEC
);
// printAllIn(sum_of_differences);
printf(
"The sum of differences approach y1 = y0 + 2x - 1 took, %f seconds\n",
((double)(time_of_sum_of_differences1 - time_of_sum_of_differences0))/CLOCKS_PER_SEC
);
// printAllIn(difference_of_sums);
printf(
"The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, %f seconds\n",
(double)(time_of_difference_of_sums1 - time_of_difference_of_sums0)/CLOCKS_PER_SEC
);
return 0;
}
There are two classes of optimizations here: strength reduction and peephole optimizations.
Strength reduction is the usual term for replacing "expensive" mathematical functions with cheaper functions -- say, replacing a multiplication with two logarithm table lookups, an addition, and then an inverse logarithm lookup to find the final result.
Peephole optimizations is the usual term for replacing something like multiplication by a power of two with left shifts. Some CPUs have simple instructions for these operations that run faster than generic integer multiplication for the specific case of multiplying by powers of two.
You can also perform optimizations of individual algorithms. You might write a * b, but there are many different ways to perform multiplication, and different algorithms perform better or worse under different conditions. Many of these decisions are made by the chip designers, but arbitrary-precision integer libraries make their own choices based on the merits of the primitives available to them.
When I tried to compile your code on Ubuntu 10.04, I got a segmentation fault right when main() started because you are declaring many megabytes worth of variables on the stack. I was able to compile it after I moved most of your variables outside of main to make them be global variables.
Then I got these results:
The obvious approach y = x*x took, 0.000000 seconds
The sum of differences approach y1 = y0 + 2x - 1 took, 0.020000 seconds
The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, 0.000000 seconds
The program runs so fast it's hard to believe it really did anything. I put the "-O0" option in to disable optimizations but it's possible GCC still might have optimized out all of the computations. So I tried adding the "volatile" qualifier to your arrays but still got similar results.
That's where I stopped working on it. In conclusion, I don't really know what's going on with your code but it's quite possible that something is wrong.