Fast Multiplication

Fast Multiplication - c

I'm writing code for a microprocessor with fast integer arithmetic and not so fast float arithmetic. I need to divide an integer by a number from 1 to 9 and convert result back to integer.
I made a float array with members like 0, 1, 0.5, 0.3333 etc.
But i think there is MAGIC constants (like 0x55555556) for a numbers except (1/3).
What are this numbers?

If the division instruction on your microcontroller is fast enough, use that. If you need the fractional part of the result, you may be able to use the remainder; on most architectures, the division instruction puts the quotient in one register and the remainder in another.
If your division instruction is not fast enough but the multiplication instruction is, you can use the following technique (and it sounds as if this is the technique you're after). On most architectures, multiplying a 32-bit number by another 32-bit number results in a 64-bit result; the more significant half is stored in one register and the less significant half is stored in the other. You can exploit this by realizing that division by a number n is the same as multiplying by (2^32)/n, then taking the more significant 32 bits of the result. In other words, if you want to divide by 3, you can instead multiply by 0x100000000/3 = 0x55555555, then take the more significant 32 bits of the result.
What you're doing here is really a form of fixed-point arithmetic. Take a look at the Wikipedia article for more information.

I'm assuming, based on the micro-controller tag, you don't have a fast integer divide. My answer is also for unsigned values - it will work for signed values, you just have to limit the numbers used in the tricky bit below.
A good start is divide by 2, 4 and 8. These can be done with right shifts of 1, 2 and 3 bits respectively, assuming your CPU has a logical right-shift instruction.
Secondly, dividing by 1 is just keeping the number as-is. That just leaves, 3, 5, 6, 7 and 9.
Tricky bit starts here:
For the other numbers, you can use the fact that a divide can be replaced by a multiply-and-shift.
Let's say you have a 16-bit processor. To divide by N, you multiply by 256/N and shift right 8 bits:
N = 3, multiply by 85
N = 5, multiply by 51
N = 6, multiply by 43
N = 7, multiply by 37
N = 9, multiply by 28
Take the random example of 72 / 5. Multiply 72 by 51 to get 3672 then shift right 8 bits to get 14.
In order for this to work, your numbers that you're using must not overflow the 16 bits. Since your worst case is multiply-by-85, you can handle numbers up to 771.
The reason this works is because a shift-right of 8 bits is the same as dividing by 256, and:
m * (256 / n) / 256
= m / (n / 256) / 256
= m / n * 256 / 256
= m / n * (256 / 256)
= m / n
If you have a 32-bit processor, the values and ranges change somewhat, since it's 65536/N:
N = 3, multiply by 21,846, right shift 16 bits, max value roughly 196,600.
N = 5, multiply by 13,108.
N = 6, multiply by 10,923.
N = 7, multiply by 9,363.
N = 9, multiply by 7,282.
Again, let's choose the random 20,000 / 7: 20,000 multiplied by 9,363 is 187,260,000 and, when you right shift that 16 bits, you get 2,857 - the real result is 2,857.
The following test program in C shows the accuracy figures for the values given. It uses signed values so is only good up to about 98,000 but you can see that the largest error is 1 and that it occurs at the low point of 13,110 (only 0.008% error).
#include <stdio.h>
int res[5] = {0};
int low[5] = {-1,-1,-1,-1,-1};
int da[] = {3,5,6,7,9};
int ma[] = {21846,13108,10923,9363,7282};
int main (void) {
int n, i;
for (n = 0; n < 98000; n++) {
for (i = 0; i < sizeof(da)/sizeof(da[0]); i++) {
int r1 = n / da[i];
int r2 = (n * ma[i])>>16;
int dif = abs (r1-r2);
if (dif >= 5) {
printf ("%d / %d gives %d and %d\n", n, da[i], r1, r2);
return 1;
}
res[dif]++;
if (low[dif] == -1) {
low[dif] = n;
}
}
}
for (i = 0; i < sizeof(res)/sizeof(res[0]); i++) {
printf ("Difference of %d: %6d, lowest value was %6d\n", i, res[i], low[i]);
}
return 0;
}
This outputs:
Difference of 0: 335874, lowest value was 0
Difference of 1: 154126, lowest value was 13110
Difference of 2: 0, lowest value was -1
Difference of 3: 0, lowest value was -1
Difference of 4: 0, lowest value was -1

Related

how can I get the first 3 digits of a given unsigned long long without converting it to string and without knowing is the length

how can I get the first 3 digits of a given unsigned long long
without converting it to string
and there is now the constant length of the number.
without using the naive approach of dividing with / 10.
like this
int first_3_digits(unsigned long long number)
{
unsigned long long n = number;
while(n >= 1000){
n = n/10;
}
return n;
}
I had an interview and they said that this solution wasn't good enough
the interviewer said the solution resembles a binary search.
I know how binary search work but I don't know how to connect it to this problem.

Modern processors are able to compute the log2 of an integer in only few cycles using specific low-level instructions (eg. bsr on mainstream x86-64 processors). Based on this great previous post one can compute the log10 of an integer very quickly. The idea is to use a lookup table so to do the translation between the log2 and log10. Once the log10 has been computed, one can just use another lookup table to perform the division by 10 ** log10(number). However, non-constant 64-bit divisions are very expensive on almost all processors. An alternative solution is to use a switch with all the possible case so the compiler can generate an efficient code for all cases and use a fast jump table. Indeed, a division by a constant can be optimized by the compiler in few fast instructions (ie. multiplications and shifts) that are much faster than a non-constant division. The resulting code is not very beautiful/simple though. Here it is:
#include <math.h>
#include <assert.h>
static inline unsigned int baseTwoDigits(unsigned long long x) {
return x ? 64 - __builtin_clzll(x) : 0;
}
static inline unsigned int baseTenDigits(unsigned long long x) {
static const unsigned char guess[65] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4,
5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9,
10, 10, 10, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15,
15, 15, 15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19
};
static const unsigned long long tenToThe[] = {
1ull, 10ull, 100ull, 1000ull, 10000ull, 100000ull, 1000000ull, 10000000ull, 100000000ull,
1000000000ull, 10000000000ull, 100000000000ull, 1000000000000ull,
10000000000000ull, 100000000000000ull, 1000000000000000ull,
10000000000000000ull, 100000000000000000ull, 1000000000000000000ull,
10000000000000000000ull
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
inline int optimized(unsigned long long number)
{
const unsigned int intLog = baseTenDigits(number);
switch(intLog)
{
case 0: return number;
case 1: return number;
case 2: return number;
case 3: return number;
case 4: return number / 10ull;
case 5: return number / 100ull;
case 6: return number / 1000ull;
case 7: return number / 10000ull;
case 8: return number / 100000ull;
case 9: return number / 1000000ull;
case 10: return number / 10000000ull;
case 11: return number / 100000000ull;
case 12: return number / 1000000000ull;
case 13: return number / 10000000000ull;
case 14: return number / 100000000000ull;
case 15: return number / 1000000000000ull;
case 16: return number / 10000000000000ull;
case 17: return number / 100000000000000ull;
case 18: return number / 1000000000000000ull;
case 19: return number / 10000000000000000ull;
case 20: return number / 100000000000000000ull;
default: assert(0); return 0;
}
}
Note that this code use the non-standard compiler built-in __builtin_clzll available on both Clang and GCC. For MSVC, please read this post.
[Update] The previous benchmark did not inline the code of the proposed function as opposed to the others resulting in a slower execution (especially on Clang). Using static+inline helped the compiler to properly inline the function calls.
Results
Here is the results of the methods on QuickBench respectively on GCC and Clang (with -O3). Note that the input distribution is chosen so that the logarithm are quite uniform and pseudo-random (ie. logarithm uniform). This choice as been made since the interviewer said a binary-search was a good solution and this distribution is a best one for such algorithm.
One can see that this solution is the fastest. The one of Yakov Khodorkovski and qqNade give wrong results for big values due to floating point rounding. The one of qqNade is not presented in the benchmark for sake of clarity as it is >10 time slower than the original one.
The reason why the solution of Gerhardh is so fast with Clang is that the compiler is able to partially generate fast conditional moves instead of slow conditional branches. This optimization is insanely clever since this is only possible on 32-bit integers (and also if the optimization about the division by a constant is performed before), but Clang is able to know that n is small enough after the 2 first conditions! That being said, this optimization is fragile since few changes in the code often appears to break it.
One can note that the original code is surprisingly quite fast (especially on GCC). This is due to branch prediction. Modern processors execute many iterations without checking they should be executed (and roll back if needed). Each iteration is very fast since the division by a constant is optimized: it only takes 2 cycle/iterations on my machine. On modern x86-64 Intel processors a branch misprediction takes 14 cycles while a well-predicted branch takes only 1-2 cycles (similar on AMD Zen processors). The average number of iteration is ~9 and only the last iteration is expensive. The solution of Gerhardh results in far less instructions executed, but it can result in up to 4 miss-predicted branch with GCC and up to 2 with Clang. The proposed solution results in only 1 indirect miss-predicted branch (that processors can less easily execute efficiently though). Since the optimized implementations runs in only ~10 cycles in average on Quickbench, the effect of branch misprediction is huge.
Note that using other input distribution have an impact on results though the overall trend remains the same. Here are results for a uniform distribution: GCC and Clang. The original algorithm is significantly slower since the average number of digits is twice bigger (17~18 instead of ~9) so is the number of iterations. The speed of the other algorithms is not very different compared to the previous distribution and the trend is left unchanged overall for them.
Conclusion
To conclude, the solution of Gerhardh is relatively portable, simple and pretty fast. The new proposed solution is more complex, but it is the fastest on both GCC and Clang. Thus, the solution of Gerhardh should be preferred unless the performance of this function is very important.

Finding the "best" solution often depends on the criteria you define to matter.
If you want smallest or simplest solution, your approach is not bad.
If you want fasted solution (and got the hint about "binary search" from the interviewer), then you might try something like this (not tested):
int first_3_digits(unsigned long long number)
{
unsigned long long n = number;
unsigned long long chopped;
// n has up to 20 digits. We need to chop up to 17 digits.
// Chop 8 digits
chopped = n / 100000000ull;
if (chopped >= 100)
{
n = chopped;
}
// chopped has up to 12 digits,
// If we use old n we have up to 11 digits
// 9 more to go...
// Chop 4 digits
chopped = n / 10000ull;
if (chopped >= 100)
{
n = chopped;
}
// chopped has up to 8 digits,
// If we use old n we have up to 7 digits
// 5 more to go...
// Chop 2 digits
chopped = n / 100ull;
if (chopped >= 100)
{
n = chopped;
}
// chopped has up to 6 digits,
// If we use old n we have up to 5 digits
// 3 more to go...
// Chop 2 digits again
chopped = n / 100ull;
if (chopped >= 100)
{
n = chopped;
}
// chopped has up to 4 digits,
// If we use old n we have up to 3 digits
// 1 more to go...
// Chop last digit if required.
if (n >= 1000)
{
n /= 10;
}
return n;
}
For 64 bit values, maximum number is 18446744073709551615, i.e. 20 digits.
We have to remove at most 17 digits.
As this is not a perfect number to use powers of 2 for number of digits to chop, we repeat the step to chop 2 digits.
That solution might be a bit faster but is likely to take more code.

When the interviewer says:
the solution resembles a binary search
that's evidence that the interviewer has in mind a particular distribution of inputs for this function, and that distribution is not a uniform distribution over the range of unsigned long long values. At that point, it's necessary to ask the interviewer what input distribution they expect, since it's not possible to optimise algorithms like this without knowing that.
In particular, if the inputs were selected from a uniform sample of the range of 64-bit unsigned values, then the following simple function would be close to optimal:
/* See Note 1 for this constant */
#define MAX_POWER (10ULL * 1000 * 1000 * 1000 * 1000 * 1000)
int first_3_digits(unsigned long long n) {
if (n >= 1000) { /* Almost always true but needed for correctness */
while (n < MAX_POWER * 100) n *= 10;
if (n >= MAX_POWER * 1000) n /= 10;
n /= MAX_POWER;
}
return n;
}
I hope this demonstrates how much difference the expected input distribution makes. The above solution optimises for the case where the input is close to the maximum number of digits, which is almost always the case with a uniform distribution. In that case, the while loop will hardly ever execute (more precisely, it will execute about 1/1844 times).
That solution also has the advantage that the while loop, even if it does execute, only does multiplications by 10, not divisions by 10. Both GCC and Clang understand how to optimise divisions by constants into multiplication and shift, but a multiplication is still faster, and multiplication by 10 is particularly easy to optimise. [Note 2]
Notes
Note that the constant MAX_POWER was precomputed for the case where unsigned long long is a 64-bit value. That's not guaranteed, although it's common, and it's also the minimum possible size of an unsigned long long. Computing the correct value with the C preprocessor is possible, at least up to a certain point, but it's tedious, so I left it out. The value needed is the largest power of 10 no greater than ULLONG_MAX / 1000. (Since unsigned long long is always a power of 2, it cannot be a power of 10 and the test could equally be for the largest power of 10 less than ULLONG_MAX / 1000, if that were more convenient.)
In the benchmark provided by Jérôme Richard, whose sample inputs are chosen so that their logarithms are roughly uniform --a very different input distribution--, this comes out a bit slower than his optimised solution, although it's within the margin of error of the benchmark tool. (On the other hand, the code is quite a bit simpler.) On Jérôme's second benchmark, with a uniform sample, it comes out a lot faster.

The hint for a binary search implies the last few tests and divides should be as below, using powers of 10: ..., 8, 4, 2, 1
// Pseudo code
if (big enough)
divide by 100,000,000
if (big enough)
divide by 10,000
if (big enough)
divide by 100
if (big enough)
divide by 10
Working this backwards we need up to 5 divisions for a 64-bit unsigned long long.
As unsigned long long may be wider than 64, add tests for that.
Provide not only a solution, but a test harness to demo the correctness.
Example:
#include <errno.h>
#include <inttypes.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned first3(unsigned long long x) {
static const unsigned long long ten18 = 1000000ull * 1000000 * 1000000;
static const unsigned long long ten16 = 10000ull * 1000000 * 1000000;
static const unsigned long long ten10 = 10000ull * 1000000;
static const uint32_t ten8 = 100ul * 1000000;
static const uint32_t ten6 = 1000000u;
static const uint32_t ten4 = 10000u;
static const uint32_t ten3 = 1000u;
static const uint32_t ten2 = 100u;
// while loop used in case unsigned long long is more than 64-bit
// We could use macro-magic to make this an `if` for common 64-bit.
while (x >= ten18) {
x /= ten16;
}
if (x >= ten10) {
x /= ten8;
}
if (x >= ten6) {
x /= ten4;
}
uint32_t x32 = (uint32_t) x; // Let us switch to narrow math
if (x32 >= ten4) {
x32 /= ten2;
}
if (x32 >= ten3) {
x32 /= 10;
}
return (unsigned) x32;
}
Test code
int test_first3(void) {
char buf[sizeof(unsigned long long) * CHAR_BIT];
for (size_t i = 1;; i++) {
for (int dig = '0'; dig <= '9'; dig += 9) {
memset(buf, dig, i);
if (dig == '0') {
buf[0]++;
}
buf[i] = '\0';
errno = 0;
unsigned long long x = strtoull(buf, 0, 10);
if (errno) {
puts("Success!");
return 0;
}
unsigned f3 = first3(x);
char buf3[sizeof(unsigned) * CHAR_BIT];
int len = sprintf(buf3, "%u", f3);
printf("%2zu <%s> <%s>\n", i, buf3, buf);
if (len > 3) {
printf("i:%zu dig:%c\n", i, dig);
return -1;
}
if (strncmp(buf, buf3, 3) != 0) {
return -1;
}
}
}
}
int main(void) {
test_first3();
}
Output
1 <1> <1>
1 <9> <9>
2 <10> <10>
2 <99> <99>
3 <100> <100>
3 <999> <999>
4 <100> <1000>
4 <999> <9999>
5 <100> <10000>
5 <999> <99999>
...
17 <100> <10000000000000000>
17 <999> <99999999999999999>
18 <100> <100000000000000000>
18 <999> <999999999999999999>
19 <100> <1000000000000000000>
19 <999> <9999999999999999999>
20 <100> <10000000000000000000>
Success!

here a short solution using log10 function :
int first_n_digits(unsigned long long number){
return number < 1000 ? number = (int)number : number = (int)(number/pow(10,(int)(log10(number)+1)-3));
}

You can calculate the length of a number (its power) using the decimal logarithm, then use the decimal exponent to get a divisor less than three orders of magnitude and integer divide the number by it:
#include <assert.h>
#include <math.h>
int first_3_digits(unsigned long long number)
{
if(number < 1000)
return number;
int number_length = int(floor(log10l(number)))+1;
assert(number_length > 3);
unsigned long long divider = exp10l(number_length - 3);
return number/divider;
}
int main()
{
assert(first_3_digits(0)==0);
assert(first_3_digits(999)==999);
assert(first_3_digits(1234)==123);
assert(first_3_digits(9876543210123456789ull)==987);
return 0;
}

Calculating sum of digits of 2^n in C

I am new to C and trying to write a program that calculates the sum of the digits of 2^n, where n<10^8.
For example, for 2^10, we'd have 1+0+2+4, which is 7.
Here's what I came up with:
#include <stdio.h>
#include <math.h>
int main()
{
int n, t, sum = 0, remainder;
printf("Enter an integer\n");
scanf("%d", &n);
t = pow(2, n);
while (t != 0)
{
remainder = t % 10;
sum = sum + remainder;
t = t / 10;
}
printf("Sum of digits of 2 to the power of %d = %d\n", n, sum);
return 0;
}
The problem is: the program works fine with numbers smaller than 30. Once I set n to a number higher than 30, the result is always -47.
I really do not understand this error and what causes it.

An interesting problem to be sure, but I think the solution is way outside the scope of a simple answer if you wish to support large values of n, such as the 108 you mentioned. The number 2108 requires 108 + 1 (100,000,001) bits, or around 12 megabytes of memory, to store in binary. In decimal it has around 30 million digits.
Your int is 32 bits wide, which is why the signed int can't store 231 – the 32nd bit is the sign while 231 has a 1 followed by 31 zeros in binary, requiring 32 bits without the sign. So it overflows and is interpreted as a negative number. (Technically signed integer overflow is undefined behaviour in C.)
You can switch to an unsigned int to get rid of the sign and the undefined behaviour, in which case your new highest supported n will be 31. You almost certainly have 64-bit integers available, and perhaps even 128-bit, but 2127 is still way less than 2100000000.
So either you need to find an algorithm to compute the decimal digits of a power of 2 without actually storing them (and only store the sum), or forget about trying to use any scalar types in standard C and get (or implement) an arbitrary precision math library operating on arrays (of bits, decimal digits, or binary-coded decimal digits). Alternatively, you can limit your solution to, say, uint64_t, but then you have n < 64, which is not nearly as interesting… =)

For signed int t = pow(2,n), if n >= 31 then t > INT_MAX.
You can use unsigned long long t = pow(2,n) instead.
This will allow you to go as up as n == 63.
Also, since you're using base 2, you can use (unsigned long long)1 << n instead of pow(2,n).

In C bits, multiply by 3 and divide by 16

A buddy of mine had these puzzles and this is one that is eluding me. Here is the problem, you are given a number and you want to return that number times 3 and divided by 16 rounding towards 0. Should be easy. The catch? You can only use the ! ~ & ^ | + << >> operators and of them only a combination of 12.
int mult(int x){
//some code here...
return y;
}
My attempt at it has been:
int hold = x + x + x;
int hold1 = 8;
hold1 = hold1 & hold;
hold1 = hold1 >> 3;
hold = hold >> 4;
hold = hold + hold1;
return hold;
But that doesn't seem to be working. I think I have a problem of losing bits but I can't seem to come up with a way of saving them. Another perspective would be nice. Just to add, you also can only use variables of type int and no loops, if statements or function calls may be used.
Right now I have the number 0xfffffff. It is supposed to return 0x2ffffff but it is returning 0x3000000.

For this question you need to worry about the lost bits before your division (obviously).
Essentially, if it is negative then you want to add 15 after you multiply by 3. A simple if statement (using your operators) should suffice.
I am not going to give you the code but a step by step would look like,
x = x*3
get the sign and store it in variable foo.
have another variable hold x + 15;
Set up an if statement so that if x is negative it uses that added 15 and if not then it uses the regular number (times 3 which we did above).
Then divide by 16 which you already showed you know how to do. Good luck!

This seems to work (as long as no overflow occurs):
((num<<2)+~num+1)>>4
Try this JavaScript code, run in console:
for (var num = -128; num <= 128; ++num) {
var a = Math.floor(num * 3 / 16);
var b = ((num<<2)+~num+1)>>4;
console.log(
"Input:", num,
"Regular math:", a,
"Bit math:", b,
"Equal: ", a===b
);
}

The Maths
When you divide a positive integer n by 16, you get a positive integer quotient k and a remainder c < 16:
(n/16) = k + (c/16).
(Or simply apply the Euclidan algorithm.) The question asks for multiplication by 3/16, so multiply by 3
(n/16) * 3 = 3k + (c/16) * 3.
The number k is an integer, so the part 3k is still a whole number. However, int arithmetic rounds down, so the second term may lose precision if you divide first, And since c < 16, you can safely multiply first without overflowing (assuming sizeof(int) >= 7). So the algorithm design can be
(3n/16) = 3k + (3c/16).
The design
The integer k is simply n/16 rounded down towards 0. So k can be found by applying a single AND operation. Two further operations will give 3k. Operation count: 3.
The remainder c can also be found using an AND operation (with the missing bits). Multiplication by 3 uses two more operations. And shifts finishes the division. Operation count: 4.
Add them together gives you the final answer.
Total operation count: 8.
Negatives
The above algorithm uses shift operations. It may not work well on negatives. However, assuming two's complement, the sign of n is stored in a sign bit. It can be removed beforing applying the algorithm and reapplied on the answer.
To find and store the sign of n, a single AND is sufficient.
To remove this sign, OR can be used.
Apply the above algorithm.
To restore the sign bit, Use a final OR operation on the algorithm output with the stored sign bit.
This brings the final operation count up to 11.

what you can do is first divide by 4 then add 3 times then again devide by 4.
3*x/16=(x/4+x/4+x/4)/4
with this logic the program can be
main()
{
int x=0xefffffff;
int y;
printf("%x",x);
y=x&(0x80000000);
y=y>>31;
x=(y&(~x+1))+(~y&(x));
x=x>>2;
x=x&(0x3fffffff);
x=x+x+x;
x=x>>2;
x=x&(0x3fffffff);
x=(y&(~x+1))+(~y&(x));
printf("\n%x %d",x,x);
}
AND with 0x3fffffff to make msb's zero. it'l even convert numbers to positive.
This uses 2's complement of negative numbers. with direct methods to divide there will be loss of bit accuracy for negative numbers. so use this work arround of converting -ve to +ve number then perform division operations.

Note that the C99 standard states in section section 6.5.7 that right shifts of signed negative integer invokes implementation-defined behavior. Under the provisions that int is comprised of 32 bits and that right shifting of signed integers maps to an arithmetic shift instruction, the following code works for all int inputs. A fully portable solution that also fulfills the requirements set out in the question may be possible, but I cannot think of one right now.
My basic idea is to split the number into high and low bits to prevent intermediate overflow. The high bits are divided by 16 first (this is an exact operation), then multiplied by three. The low bits are first multiplied by three, then divided by 16. Since arithmetic right shift rounds towards negative infinity instead of towards zero like integer division, a correction needs to be applied to the right shift for negative numbers. For a right shift by N, one needs to add 2N-1 prior to the shift if the number to be shifted is negative.
#include <stdio.h>
#include <stdlib.h>
int ref (int a)
{
long long int t = ((long long int)a * 3) / 16;
return (int)t;
}
int main (void)
{
int a, t, r, c, res;
a = 0;
do {
t = a >> 4; /* high order bits */
r = a & 0xf; /* low order bits */
c = (a >> 31) & 15; /* shift correction. Portable alternative: (a < 0) ? 15 : 0 */
res = t + t + t + ((r + r + r + c) >> 4);
if (res != ref(a)) {
printf ("!!!! error a=%08x res=%08x ref=%08x\n", a, res, ref(a));
return EXIT_FAILURE;
}
a++;
} while (a);
return EXIT_SUCCESS;
}

Formula for division of each individual term in a summation

Example:
When the division is applied as a whole, the result is,
The summation formula is given by,
The above can be easily calculated in O(1), using the rules of summation.
But when it is applied to each term individually(truncating after the decimal point in the quotient),
=0+1+1+2+3+3+4+5=19. [using normal int/int division in C]
The above method requires O(N) as rules of summation can NOT be applied.
I understand the above is due to loss of precision is more when the division is applied to each term rather than at the last. But this is exacly what I need.[In the above example, 19 is the required solution and not 21]
Is there a formula that would serve as a shortcut for applying division individually to each term, similar to summation?

So, you get:
0 + (1+1+2) + (3+3+4) + 5
Let's multiply this by 3:
0 + (3+3+6) + (9+9+12) + 15
And compare it with the numerator of (1+...+15)/3:
1 + (3+5+7) + (9+11+13) + 15
You can clearly see that the sum you're seeking is losing 3 every 3 terms to the numerator or 1 in every term on average. And it doesn't matter how we group terms into triples:
(0+3+3) + (6+9+9) + 12+15
(1+3+5) + (7+9+11) + 13+15
or even
0+3 + (3+6+9) + (9+12+15)
1+3 + (5+7+9) + (11+13+15)
So your sum*3 is less than the numerator of (1+...+15)/3 by about the number of terms.
And the numerator can be calculated using the formula for the sum of the arithmetic progression: n2, where n is the number of terms in the sum:
1+3+5+7+9+11+13+15 = 28 = 64
Now you subtract 8 from 64, get 56 and divide it by 3, getting 18.6(6). The number isn't equal to 19 because n (the number of terms) wasn't a multiple of 3.
So, the final formula isn't exactly (n2-n)/3, but differs in value at most by 1 from the correct one.
In fact, it's:
(n*n-n+1)/3 rounded down or calculated using integer division.
Plugging the number(s) into it we get:
(8*8-8+1)/3 = 57/3 = 19

Short answer: Yes there is such a formula.
Long answer (as I guess you want the formula):
How to get it: You already realized that the difference between the summation formula and the sum of the int devision came from the rounding of the int division at each summand.
Make a table with the rows:
First row, the result of each summand, when you divide with full precision.
Second row, the reuslt of each summand, when you perform integer division.
And third row, the difference of both.
Now you should realize the pattern, its always 1/3, 0, 2/3.
That came from the division by 3, you could proof that formal if you want (e.g. induction).
So in the end your formula is: (n^2)/3 - (n/3)
the n*n/3 is the regular summation formula, and as for all full 3 summands 1 is lost we subtract n/3.

The result is going to be
1+1 + 2 + 3+3 + 4 + 5+5 + 6 + 7+7 + 8 + 9+9 + 10 + ...
in other words all odd numbers appear twice and all even numbers once. The sum of first n natural numbers is n*(n+1)/2 so the sum of first n even natural numbers is twice that and the sum of first n odd numbers is instead n*n.
I think you now have all pieces to get the result you need...

The sums you need are, for increasing ns:
1, 2, 4, 7, 10, 14, 19, 24, 30, 36 ...
Plug these into the The On-Line Encyclopedia of Integer Sequences™ (OEIS™) and you get A007980 as the series that fits your requirements. It is calculated as a(n)=ceil((n+1)*(n+2)/3).
This makes a(0) = 1, a(1) = 2, a(6) = 19, meaning the index is offset by 2: sum(1,8) = a(8-2).

Σ((2i+1)/3) where i =0 to n-1 and Σ((2i+1)/3) = n*n/3

#include <stdio.h>
typedef struct fraction {
int n;//numerator
int d;//denominator
} Fraction;
int gcd(int x, int y){
x = (x < 0)? -x : x;
y = (y < 0)? -y : y;
while(y){
int wk;
wk = x % y;
x = y;
y = wk;
}
return x;
}
Fraction rcd(Fraction x){
int gcm;
gcm = gcd(x.n, x.d);
x.n /= gcm;
x.d /= gcm;
return x;
}
Fraction add(Fraction x, Fraction y){
x.n = y.d*x.n + x.d*y.n;
x.d = x.d*y.d;
return rcd(x);
}
int main(void){
Fraction sum = {0,1};
int n;
for(n=1;n<=8;++n){
Fraction x = { 2*n-1, 3 };
sum = add(sum, x);
}
printf("%d/%d=",sum.n,sum.d);
printf("%d",sum.n/sum.d);
return 0;
}

speeding up "base conversion" for large integers

I am using a base-conversion algorithm to generate a permutation from a large integer (split into 32-bit words).
I use a relatively standard algorithm for this:
/* N = count,K is permutation index (0..N!-1) A[N] contains 0..N-1 */
i = 0;
while (N > 1) {
swap A[i] and A[i+(k%N)]
k = k / N
N = N - 1
i = i + 1
}
Unfortunately, the divide and modulo each iteration adds up, especially moving to large integers - But, it seems I could just use multiply!
/* As before, N is count, K is index, A[N] contains 0..N-1 */
/* Split is arbitrarily 128 (bits), for my current choice of N */
/* "Adjust" is precalculated: (1 << Split)/(N!) */
a = k*Adjust; /* a can be treated as a fixed point fraction */
i = 0;
while (N > 1) {
a = a*N;
index = a >> Split;
a = a & ((1 << Split) - 1); /* actually, just zeroing a register */
swap A[i] and A[i+index]
N = N - 1
i = i + 1
}
This is nicer, but doing large integer multiplies is still sluggish.
Question 1:
Is there a way of doing this faster?
Eg. Since I know that N*(N-1) is less than 2^32, could I pull out those numbers from one word, and merge in the 'leftovers'?
Or, is there a way to modify an arithetic decoder to pull out the indicies one at a time?
Question 2:
For the sake of curiosity - if I use multiplication to convert a number to base 10 without the adjustment, then the result is multiplied by (10^digits/2^shift). Is there a tricky way to remove this factor working with the decimal digits? Even with the adjustment factor, this seems like it would be faster -- why wouldn't standard libraries use this vs divide and mod?

Seeing that you are talking about numbers like 2^128/(N!), it seems that in your problem N is going to be rather small (N < 35 according to my calculations).
I suggest taking the original algorithm as a starting point; first switch the direction of the loop:
i = 2;
while (i < N) {
swap A[N - 1 - i] and A[N - i + k % i]
k = k / i
i = i + 1
}
Now change the loop to do several permutations per iteration. I guess the speed of division is the same regardless of the number i, as long as i < 2^32.
Split the range 2...N-1 into sub-ranges so that the product of the numbers in each sub-range is less than 2^32:
2, 3, 4, ..., 12: product is 479001600
13, 14, ..., 19: product is 253955520
20, 21, ..., 26: product is 3315312000
27, 28, ..., 32: product is 652458240
33, 34, 35: product is 39270
Then, divide the long number k by the products instead of dividing by i. Each iteration will yield a remainder (less than 2^32) and a smaller number k. When you have the remainder, you can work with it in an inner loop using the original algorithm; which will now be faster because it doesn't involve long division.
Here is some code:
static const int rangeCount = 5;
static const int rangeLimit[rangeCount] = {13, 20, 27, 33, 36};
static uint32_t rangeProduct[rangeCount] = {
479001600,
253955520,
3315312000,
652458240,
39270
};
for (int rangeIndex = 0; rangeIndex < rangeCount; ++rangeIndex)
{
// The following two lines involve long division;
// math libraries probably calculate both quotient and remainder
// in one function call
uint32_t rangeRemainder = k % rangeProduct[rangeIndex];
k /= rangeProduct[rangeIndex];
// A range starts where the previous range ended
int rangeStart = (rangeIndex == 0) ? 2 : rangeLimit[rangeIndex - 1];
// Iterate over range
for (int i = rangeStart; i < rangeLimit[rangeIndex] && i < n; ++i)
{
// The following two lines involve a 32-bit division;
// it produces both quotient and remainder in one Pentium instruction
int remainder = rangeRemainder % i;
rangeRemainder /= i;
std::swap(permutation[n - 1 - i], permutation[n - i + remainder]);
}
}
Of course, this code can be extended into more than 128 bits.
Another optimization could involve extraction of powers of 2 from the products of ranges; this might add a slight speedup by making the ranges longer. Not sure whether this is worthwhile (maybe for large values of N, like N=1000).

Dont know about algorithms, but the ones you use seems pretty simple, so i dont really see how you can optimize the algorithm.
You may use alternative approaches:
use ASM (assembler) - from my experience, after a long time trying to figure out how should a certain algorithm would be written in ASM, it ended up being slower than the version generated by the compiler:) Probably because the compiler also knows how to layout the code so the CPU cache would be more efficient, and/or what instructions are actually faster and what situations(this was on GCC/linux).
use multi-processing:
make your algorithm multithreaded, and make sure you run with the same number of threads as the number of available cpu cores(most cpu's nowdays do have multiple cores/multithreading)
make you algorithm capable of running on multiple machines on a network, and devise a way of sending these numbers to machines in a network, so you may use their CPU power.