Related
There is an existing question "Average of 3 long integers" that is specifically concerned with the efficient computation of the average of three signed integers.
The use of unsigned integers however allows for additional optimizations not applicable to the scenario covered in the previous question. This question is about the efficient computation of the average of three unsigned integers, where the average is rounded towards zero, i.e. in mathematical terms I want to compute ⌊ (a + b + c) / 3 ⌋.
A straightforward way to compute this average is
avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
To first order, modern optimizing compilers will transform the divisions into multiplications with a reciprocal plus a shift, and the modulo operations into a back-multiply and a subtraction, where the back-multiply may use a scale_add idiom available on many architectures, e.g. lea on x86_64, add with lsl #n on ARM, iscadd on NVIDIA GPUs.
In trying to optimize the above in a generic fashion suitable for many common platforms, I observe that typically the cost of integer operations is in the relationship logical ≤ (add | sub) ≤ shift ≤ scale_add ≤ mul. Cost here refers to all of latency, throughput limitations, and power consumption. Any such differences become more pronounced when the integer type processed is wider than the native register width, e.g. when processing uint64_t data on a 32-bit processor.
My optimization strategy was therefore to minimize instruction count and replace "expensive" with "cheap" operations where possible, while not increasing register pressure and retaining exploitable parallelism for wide out-of-order processors.
The first observation is that we can reduce a sum of three operands into a sum of two operands by first applying a CSA (carry save adder) that produces a sum value and a carry value, where the carry value has twice the weight of the sum value. The cost of a software-based CSA is five logicals on most processors. Some processors, like NVIDIA GPUs, have a LOP3 instruction that can compute an arbitrary logical expression of three operands in one fell swoop, in which case CSA condenses to two LOP3s (note: I have yet convince the CUDA compiler to emit those two LOP3s; it currently produces four LOP3s!).
The second observation is that because we are computing the modulo of division by 3, we don't need a back-multiply to compute it. We can instead use dividend % 3 = ((dividend / 3) + dividend) & 3, reducing the modulo to an add plus a logical since we already have the division result. This is an instance of the general algorithm: dividend % (2n-1) = ((dividend / (2n-1) + dividend) & (2n-1).
Finally for the division by 3 in the correction term (a % 3 + b % 3 + c % 3) / 3 we don't need the code for generic division by 3. Since the dividend is very small, in [0, 6], we can simplify x / 3 into (3 * x) / 8 which requires just a scale_add plus a shift.
The code below shows my current work-in-progress. Using Compiler Explorer to check the code generated for various platforms shows the tight code I would expect (when compiled with -O3).
However, in timing the code on my Ivy Bridge x86_64 machine using the Intel 13.x compiler, a flaw became apparent: while my code improves latency (from 18 cycles to 15 cycles for uint64_t data) compared to the simple version, throughput worsens (from one result every 6.8 cycles to one result every 8.5 cycles for uint64_t data). Looking at the assembly code more closely it is quite apparent why that is: I basically managed to take the code down from roughly three-way parallelism to roughly two-way parallelism.
Is there a generically applicable optimization technique, beneficial on common processors in particular all flavors of x86 and ARM as well as GPUs, that preserves more parallelism? Alternatively, is there an optimization technique that further reduces overall operation count to make up for reduced parallelism? The computation of the correction term (tail in the code below) seems like a good target. The simplification (carry_mod_3 + sum_mod_3) / 2 looked enticing but delivers an incorrect result for one of the nine possible combinations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define BENCHMARK (1)
#define SIMPLE_COMPUTATION (0)
#if BENCHMARK
#define T uint64_t
#else // !BENCHMARK
#define T uint8_t
#endif // BENCHMARK
T average_of_3 (T a, T b, T c)
{
T avg;
#if SIMPLE_COMPUTATION
avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
#else // !SIMPLE_COMPUTATION
/* carry save adder */
T a_xor_b = a ^ b;
T sum = a_xor_b ^ c;
T carry = (a_xor_b & c) | (a & b);
/* here 2 * carry + sum = a + b + c */
T sum_div_3 = (sum / 3); // {MUL|MULHI}, SHR
T sum_mod_3 = (sum + sum_div_3) & 3; // ADD, AND
if (sizeof (size_t) == sizeof (T)) { // "native precision" (well, not always)
T two_carry_div_3 = (carry / 3) * 2; // MULHI, ANDN
T two_carry_mod_3 = (2 * carry + two_carry_div_3) & 6; // SCALE_ADD, AND
T head = two_carry_div_3 + sum_div_3; // ADD
T tail = (3 * (two_carry_mod_3 + sum_mod_3)) / 8; // ADD, SCALE_ADD, SHR
avg = head + tail; // ADD
} else {
T carry_div_3 = (carry / 3); // MUL, SHR
T carry_mod_3 = (carry + carry_div_3) & 3; // ADD, AND
T head = (2 * carry_div_3 + sum_div_3); // SCALE_ADD
T tail = (3 * (2 * carry_mod_3 + sum_mod_3)) / 8; // SCALE_ADD, SCALE_ADD, SHR
avg = head + tail; // ADD
}
#endif // SIMPLE_COMPUTATION
return avg;
}
#if !BENCHMARK
/* Test correctness on 8-bit data exhaustively. Should catch most errors */
int main (void)
{
T a, b, c, res, ref;
a = 0;
do {
b = 0;
do {
c = 0;
do {
res = average_of_3 (a, b, c);
ref = ((uint64_t)a + (uint64_t)b + (uint64_t)c) / 3;
if (res != ref) {
printf ("a=%08x b=%08x c=%08x res=%08x ref=%08x\n",
a, b, c, res, ref);
return EXIT_FAILURE;
}
c++;
} while (c);
b++;
} while (b);
a++;
} while (a);
return EXIT_SUCCESS;
}
#else // BENCHMARK
#include <math.h>
// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
#define N (3000000)
int main (void)
{
double start, stop, elapsed = INFINITY;
int i, k;
T a, b;
T avg0 = 0xffffffff, avg1 = 0xfffffffe;
T avg2 = 0xfffffffd, avg3 = 0xfffffffc;
T avg4 = 0xfffffffb, avg5 = 0xfffffffa;
T avg6 = 0xfffffff9, avg7 = 0xfffffff8;
T avg8 = 0xfffffff7, avg9 = 0xfffffff6;
T avg10 = 0xfffffff5, avg11 = 0xfffffff4;
T avg12 = 0xfffffff2, avg13 = 0xfffffff2;
T avg14 = 0xfffffff1, avg15 = 0xfffffff0;
a = 0x31415926;
b = 0x27182818;
avg0 = average_of_3 (a, b, avg0);
for (k = 0; k < 5; k++) {
start = second();
for (i = 0; i < N; i++) {
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
b = (b + avg0) ^ a;
a = (a ^ b) + avg0;
}
stop = second();
elapsed = fmin (stop - start, elapsed);
}
printf ("a=%016llx b=%016llx avg=%016llx",
(uint64_t)a, (uint64_t)b, (uint64_t)avg0);
printf ("\rlatency: each average_of_3() took %.6e seconds\n",
elapsed / 16 / N);
a = 0x31415926;
b = 0x27182818;
avg0 = average_of_3 (a, b, avg0);
for (k = 0; k < 5; k++) {
start = second();
for (i = 0; i < N; i++) {
avg0 = average_of_3 (a, b, avg0);
avg1 = average_of_3 (a, b, avg1);
avg2 = average_of_3 (a, b, avg2);
avg3 = average_of_3 (a, b, avg3);
avg4 = average_of_3 (a, b, avg4);
avg5 = average_of_3 (a, b, avg5);
avg6 = average_of_3 (a, b, avg6);
avg7 = average_of_3 (a, b, avg7);
avg8 = average_of_3 (a, b, avg8);
avg9 = average_of_3 (a, b, avg9);
avg10 = average_of_3 (a, b, avg10);
avg11 = average_of_3 (a, b, avg11);
avg12 = average_of_3 (a, b, avg12);
avg13 = average_of_3 (a, b, avg13);
avg14 = average_of_3 (a, b, avg14);
avg15 = average_of_3 (a, b, avg15);
b = (b + avg0) ^ a;
a = (a ^ b) + avg0;
}
stop = second();
elapsed = fmin (stop - start, elapsed);
}
printf ("a=%016llx b=%016llx avg=%016llx", (uint64_t)a, (uint64_t)b,
(uint64_t)(avg0 + avg1 + avg2 + avg3 + avg4 + avg5 + avg6 + avg7 +
avg8 + avg9 +avg10 +avg11 +avg12 +avg13 +avg14 +avg15));
printf ("\rthroughput: each average_of_3() took %.6e seconds\n",
elapsed / 16 / N);
return EXIT_SUCCESS;
}
#endif // BENCHMARK
Let me throw my hat in the ring. Not doing anything too tricky here, I
think.
#include <stdint.h>
uint64_t average_of_three(uint64_t a, uint64_t b, uint64_t c) {
uint64_t hi = (a >> 32) + (b >> 32) + (c >> 32);
uint64_t lo = hi + (a & 0xffffffff) + (b & 0xffffffff) + (c & 0xffffffff);
return 0x55555555 * hi + lo / 3;
}
Following discussion below about different splits, here's a version that saves a multiply at the expense of three bitwise-ANDs:
T hi = (a >> 2) + (b >> 2) + (c >> 2);
T lo = (a & 3) + (b & 3) + (c & 3);
avg = hi + (hi + lo) / 3;
I'm not sure if it fits your requirements, but maybe it works to just calculate the result and then fixup the error from the overflow:
T average_of_3 (T a, T b, T c)
{
T r = ((T) (a + b + c)) / 3;
T o = (a > (T) ~b) + ((T) (a + b) > (T) (~c));
if (o) r += ((T) 0x5555555555555555) << (o - 1);
T rem = ((T) (a + b + c)) % 3;
if (rem >= (3 - o)) ++r;
return r;
}
[EDIT] Here is the best branch-and-compare-less version I can come up with. On my machine, this version actually has slightly higher throughput than njuffa's code. __builtin_add_overflow(x, y, r) is supported by gcc and clang and returns 1 if the sum x + y overflows the type of *r and 0 otherwise, so the calculation of o is equivalent to the portable code in the first version, but at least gcc produces better code with the builtin.
T average_of_3 (T a, T b, T c)
{
T r = ((T) (a + b + c)) / 3;
T rem = ((T) (a + b + c)) % 3;
T dummy;
T o = __builtin_add_overflow(a, b, &dummy) + __builtin_add_overflow((T) (a + b), c, &dummy);
r += -((o - 1) & 0xaaaaaaaaaaaaaaab) ^ 0x5555555555555555;
r += (rem + o + 1) >> 2;
return r;
}
I answered the question you linked to already, so I am only answering the part that is different about this one: performance.
If you really cared about performance, then the answer is:
( a + b + c ) / 3
Since you cared about performance, you should have an intuition about the size of data you are working with. You should not have worried about overflow on addition (multiplication is another matter) of only 3 values, because if your data is already big enough to use the high bits of your chosen data type, you are in danger of overflow anyway and should have used a larger integer type. If you are overflowing on uint64_t, then you should really ask yourself why exactly do you need to count accurately up to 18 quintillion, and perhaps consider using float or double.
Now, having said all that, I will give you my actual reply: It doesn't matter. The question doesn't come up in real life and when it does, perf doesn't matter.
It could be a real performance question if you are doing it a million times in SIMD, because there, you are really incentivized to use integers of smaller width and you may need that last bit of headroom, but that wasn't your question.
New answer, new idea. This one's based on the mathematical identity
floor((a+b+c)/3) = floor(x + (a+b+c - 3x)/3)
When does this work with machine integers and unsigned division?
When the difference doesn't wrap, i.e. 0 ≤ a+b+c - 3x ≤ T_MAX.
This definition of x is fast and gets the job done.
T avg3(T a, T b, T c) {
T x = (a >> 2) + (b >> 2) + (c >> 2);
return x + (a + b + c - 3 * x) / 3;
}
Weirdly, ICC inserts an extra neg unless I do this:
T avg3(T a, T b, T c) {
T x = (a >> 2) + (b >> 2) + (c >> 2);
return x + (a + b + c - (x + x * 2)) / 3;
}
Note that T must be at least five bits wide.
If T is two platform words long, then you can save some double word operations by omitting the low word of x.
Alternative version with worse latency but maybe slightly higher throughput?
T lo = a + b;
T hi = lo < b;
lo += c;
hi += lo < c;
T x = (hi << (sizeof(T) * CHAR_BIT - 2)) + (lo >> 2);
avg = x + (T)(lo - 3 * x) / 3;
I suspect SIMPLE is defeating the throughput benchmark by CSEing and hoisting a/3+b/3 and a%3+b%3 out of the loop, reusing those results for all 16 avg0..15 results.
(The SIMPLE version can hoist much more of the work than the tricky version; really just a ^ b and a & b in that version.)
Forcing the function to not inline introduces more front-end overhead, but does make your version win, as we expect it should on a CPU with deep out-of-order execution buffers to overlap independent work. There's lots of ILP to find across iterations, for the throughput benchmark. (I didn't look closely at the asm for the non-inline version.)
https://godbolt.org/z/j95qn3 (using __attribute__((noinline)) with clang -O3 -march=skylake on Godbolt's SKX CPUs) shows 2.58 nanosec throughput for the simple way, 2.48 nanosec throughput for your way. vs. 1.17 nanosec throughput with inlining for the simple version.
-march=skylake allows mulx for more flexible full-multiply, but otherwise no benefit from BMI2. andn isn't used; the line you commented with mulhi / andn is mulx into RCX / and rcx, -2 which only requires a sign-extended immediate.
Another way to do this without forcing call/ret overhead would be inline asm like in Preventing compiler optimizations while benchmarking (Chandler Carruth's CppCon talk has some example of how he uses a couple wrappers), or Google Benchmark's benchmark::DoNotOptimize.
Specifically, GNU C asm("" : "+r"(a), "+r"(b)) between each avgX = average_of_3 (a, b, avgX); statement will make the compiler forget everything it knows about the values of a and b, while keeping them in registers.
My answer on I don't understand the definition of DoNotOptimizeAway goes into more detail about using a read-only "r" register constraint to force the compiler to materialize a result in a register, vs. "+r" to make it assume the value has been modified.
If you understand GNU C inline asm well, it may be easier to roll your own in ways that you know exactly what they do.
[Falk Hüffner points out in comments that this answer has similarities to his answer . Looking at his code more closely belatedly, I do find some similarities. However what I posted here is product of an independent thought process, a continuation of my original idea "reduce three items to two prior to div-mod". I understood Hüffner's approach to be different: "naive computation followed by corrections".]
I have found a better way than the CSA-technique in my question to reduce the division and modulo work from three operands to two operands. First, form the full double-word sum, then apply the division and modulo by 3 to each of the halves separately, finally combine the results. Since the most significant half can only take the values 0, 1, or 2, computing the quotient and remainder of division by three is trivial. Also, the combination into the final result becomes simpler.
Compared to the non-simple code variant from the question this achieves speedup on all platforms I examined. The quality of the code generated by compilers for the simulated double-word addition varies but is satisfactory overall. Nonetheless it may be worthwhile to code this portion in a non-portable way, e.g. with inline assembly.
T average_of_3_hilo (T a, T b, T c)
{
const T fives = (((T)(~(T)0)) / 3); // 0x5555...
T avg, hi, lo, lo_div_3, lo_mod_3, hi_div_3, hi_mod_3;
/* compute the full sum a + b + c into the operand pair hi:lo */
lo = a + b;
hi = lo < a;
lo = c + lo;
hi = hi + (lo < c);
/* determine quotient and remainder of each half separately */
lo_div_3 = lo / 3;
lo_mod_3 = (lo + lo_div_3) & 3;
hi_div_3 = hi * fives;
hi_mod_3 = hi;
/* combine partial results into the division result for the full sum */
avg = lo_div_3 + hi_div_3 + ((lo_mod_3 + hi_mod_3 + 1) / 4);
return avg;
}
An experimental build of GCC-11 compiles the obvious naive function to something like:
uint32_t avg3t (uint32_t a, uint32_t b, uint32_t c) {
a += b;
b = a < b;
a += c;
b += a < c;
b = b + a;
b += b < a;
return (a - (b % 3)) * 0xaaaaaaab;
}
Which is similar to some of the other answers posted here.
Any explanation of how these solutions work would be welcome
(not sure of the netiquette here).
Let's say I've been given two integers a, b where a is a positive integer and is smaller than b. I have to find an efficient algorithm that's going to give me the sum of number of base2 digits (number of bits) over the interval [a, b]. For example, in the interval [0, 4] the sum of digits is equal to 9 because 0 = 1 digit, 1 = 1 digit, 2 = 2 digits, 3 = 2 digits and 4 = 3 digits.
My program is capable of calculating this number by using a loop but I'm looking for something more efficient for large numbers. Here are the snippets of my code just to give you an idea:
int numberOfBits(int i) {
if(i == 0) {
return 1;
}
else {
return (int) log2(i) + 1;
}
}
The function above is for calculating the number of digits of one number in the interval.
The code below shows you how I use it in my main function.
for(i = a; i <= b; i++) {
l = l + numberOfBits(i);
}
printf("Digits: %d\n", l);
Ideally I should be able to get the number of digits by using the two values of my interval and using some special algorithm to do that.
Try this code, i think it gives you what you are needing to calculate the binaries:
int bit(int x)
{
if(!x) return 1;
else
{
int i;
for(i = 0; x; i++, x >>= 1);
return i;
}
}
The main thing to understand here is that the number of digits used to represent a number in binary increases by one with each power of two:
+--------------+---------------+
| number range | binary digits |
+==============+===============+
| 0 - 1 | 1 |
+--------------+---------------+
| 2 - 3 | 2 |
+--------------+---------------+
| 4 - 7 | 3 |
+--------------+---------------+
| 8 - 15 | 4 |
+--------------+---------------+
| 16 - 31 | 5 |
+--------------+---------------+
| 32 - 63 | 6 |
+--------------+---------------+
| ... | ... |
A trivial improvement over your brute force algorithm would then be to figure out how many times this number of digits has increased between the two numbers passed in (given by the base two logarithm) and add up the digits by multiplying the count of numbers that can be represented by the given number of digits (given by the power of two) with the number of digits.
A naive implementation of this algorithm is:
int digits_sum_seq(int a, int b)
{
int sum = 0;
int i = 0;
int log2b = b <= 0 ? 1 : floor(log2(b));
int log2a = a <= 0 ? 1 : floor(log2(a)) + 1;
sum += (pow(2, log2a) - a) * (log2a);
for (i = log2b; i > log2a; i--)
sum += pow(2, i - 1) * i;
sum += (b - pow(2, log2b) + 1) * (log2b + 1);
return sum;
}
It can then be improved by the more efficient versions of the log and pow functions seen in the other answers.
First, we can improve the speed of log2, but that only gives us a fixed factor speed-up and doesn't change the scaling.
Faster log2 adapted from: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup
The lookup table method takes only about 7 operations to find the log
of a 32-bit value. If extended for 64-bit quantities, it would take
roughly 9 operations. Another operation can be trimmed off by using
four tables, with the possible additions incorporated into each. Using
int table elements may be faster, depending on your architecture.
Second, we must re-think the algorithm. If you know that numbers between N and M have the same number of digits, would you add them up one by one or would you rather do (M-N+1)*numDigits?
But if we have a range where multiple numbers appear what do we do? Let's just find the intervals of same digits, and add sums of those intervals. Implemented below. I think that my findEndLimit could be further optimized with a lookup table.
Code
#include <stdio.h>
#include <limits.h>
#include <time.h>
unsigned int fastLog2(unsigned int v)
{
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
register unsigned int t, tt; // temporaries
if (tt = v >> 16)
{
return (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
}
else
{
return (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
}
}
unsigned int numberOfBits(unsigned int i)
{
if (i == 0) {
return 1;
}
else {
return fastLog2(i) + 1;
}
}
unsigned int findEndLimit(unsigned int sx, unsigned int ex)
{
unsigned int sy = numberOfBits(sx);
unsigned int ey = numberOfBits(ex);
unsigned int mx;
unsigned int my;
if (sy == ey) // this also means sx == ex
return ex;
// assumes sy < ey
mx = (ex - sx) / 2 + sx; // will eq. sx for sx + 1 == ex
my = numberOfBits(mx);
while (ex - sx != 1) {
mx = (ex - sx) / 2 + sx; // will eq. sx for sx + 1 == ex
my = numberOfBits(mx);
if (my == ey) {
ex = mx;
ey = numberOfBits(ex);
}
else {
sx = mx;
sy = numberOfBits(sx);
}
}
return sx+1;
}
int main(void)
{
unsigned int a, b, m;
unsigned long l;
clock_t start, end;
l = 0;
a = 0;
b = UINT_MAX;
start = clock();
unsigned int i;
for (i = a; i < b; ++i) {
l += numberOfBits(i);
}
if (i == b) {
l += numberOfBits(i);
}
end = clock();
printf("Naive\n");
printf("Digits: %ld; Time: %fs\n",l, ((double)(end-start))/CLOCKS_PER_SEC);
l=0;
start = clock();
do {
m = findEndLimit(a, b);
l += (b-m + 1) * (unsigned long)numberOfBits(b);
b = m-1;
} while (b > a);
l += (b-a+1) * (unsigned long)numberOfBits(b);
end = clock();
printf("Binary search\n");
printf("Digits: %ld; Time: %fs\n",l, ((double)(end-start))/CLOCKS_PER_SEC);
}
Output
From 0 to UINT_MAX
$ ./main
Naive
Digits: 133143986178; Time: 25.722492s
Binary search
Digits: 133143986178; Time: 0.000025s
My findEndLimit can take long time in some edge cases:
From UINT_MAX/16+1 to UINT_MAX/8
$ ./main
Naive
Digits: 7784628224; Time: 1.651067s
Binary search
Digits: 7784628224; Time: 4.921520s
Conceptually, you would need to split the task to two subproblems -
1) find the sum of digits from 0..M, and from 0..N, then subtract.
2) find the floor(log2(x)), because eg for the number 77 the numbers 64,65,...77 all have 6 digits, the next 32 have 5 digits, the next 16 have 4 digits and so on, which makes a geometric progression.
Thus:
int digits(int a) {
if (a == 0) return 1; // should digits(0) be 0 or 1 ?
int b=(int)floor(log2(a)); // use any all-integer calculation hack
int sum = 1 + (b+1) * (a- (1<<b) +1); // added 1, due to digits(0)==1
while (--b)
sum += (b + 1) << b; // shortcut for (b + 1) * (1 << b);
return sum;
}
int digits_range(int a, int b) {
if (a <= 0 || b <= 0) return -1; // formulas work for strictly positive numbers
return digits(b)-digits(a-1);
}
As efficiency depends on the tools available, one approach would be doing it "analog":
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
unsigned long long pow2sum_min(unsigned long long n, long long unsigned m)
{
if (m >= n)
{
return 1;
}
--n;
return (2ULL << n) + pow2sum_min(n, m);
}
#define LN(x) (log2(x)/log2(M_E))
int main(int argc, char** argv)
{
if (2 >= argc)
{
fprintf(stderr, "%s a b\n", argv[0]);
exit(EXIT_FAILURE);
}
long a = atol(argv[1]), b = atol(argv[2]);
if (0L >= a || 0L >= b || b < a)
{
puts("Na ...!");
exit(EXIT_FAILURE);
}
/* Expand intevall to cover full dimensions: */
unsigned long long a_c = pow(2, floor(log2(a)));
unsigned long long b_c = pow(2, floor(log2(b+1)) + 1);
double log2_a_c = log2(a_c);
double log2_b_c = log2(b_c);
unsigned long p2s = pow2sum_min(log2_b_c, log2_a_c) - 1;
/* Integral log2(x) between a_c and b_c: */
double A = ((b_c * (LN(b_c) - 1))
- (a_c * (LN(a_c) - 1)))/LN(2)
+ (b+1 - a);
/* "Integer"-integral - integral of log2(x)'s inverse function (2**x) between log(a_c) and log(b_c): */
double D = p2s - (b_c - a_c)/LN(2);
/* Corrective from a_c/b_c to a/b : */
double C = (log2_b_c - 1)*(b_c - (b+1)) + log2_a_c*(a - a_c);
printf("Total used digits: %lld\n", (long long) ((A - D - C) +.5));
}
:-)
The main thing here is the number and kind of iterations done.
Number is
log(floor(b_c)) - log(floor(a_c))
times
doing one
n - 1 /* Integer decrement */
2**n + s /* One bit-shift and one integer addition */
for each iteration.
Here's an entirely look-up based approach. You don't even need the log2 :)
Algorithm
First we precompute interval limits where the number of bits would change and create a lookup table. In other words we create an array limits[2^n], where limits[i] gives us the biggest integer that can be represented with (i+1) bits. Our array is then {1, 3, 7, ..., 2^n-1}.
Then, when we want to determine the sum of bits for our range, we must first match our range limits a and b with the smallest index for which a <= limits[i] and b <= limits[j] holds, which will then tell us that we need (i+1) bits to represent a, and (j+1) bits to represent b.
If the indexes are the same, then the result is simply (b-a+1)*(i+1), otherwise we must separately get the number of bits from our value to the edge of same number of bits interval, and add up total number of bits for each interval between as well. In any case, simple arithmetic.
Code
#include <stdio.h>
#include <limits.h>
#include <time.h>
unsigned long bitsnumsum(unsigned int a, unsigned int b)
{
// generate lookup table
// limits[i] is the max. number we can represent with (i+1) bits
static const unsigned int limits[32] =
{
#define LTN(n) n*2u-1, n*4u-1, n*8u-1, n*16u-1, n*32u-1, n*64u-1, n*128u-1, n*256u-1
LTN(1),
LTN(256),
LTN(256*256),
LTN(256*256*256)
};
// make it work for any order of arguments
if (b < a) {
unsigned int c = a;
a = b;
b = c;
}
// find interval of a
unsigned int i = 0;
while (a > limits[i]) {
++i;
}
// find interval of b
unsigned int j = i;
while (b > limits[j]) {
++j;
}
// add it all up
unsigned long sum = 0;
if (i == j) {
// a and b in the same range
// conveniently, this also deals with j == 0
// so no danger to do [j-1] below
return (i+1) * (unsigned long)(b - a + 1);
}
else {
// add sum of digits in range [a, limits[i]]
sum += (i+1) * (unsigned long)(limits[i] - a + 1);
// add sum of digits in range [limits[j], b]
sum += (j+1) * (unsigned long)(b - limits[j-1]);
// add sum of digits in range [limits[i], limits[j]]
for (++i; i<j; ++i) {
sum += (i+1) * (unsigned long)(limits[i] - limits[i-1]);
}
return sum;
}
}
int main(void)
{
clock_t start, end;
unsigned int a=0, b=UINT_MAX;
start = clock();
printf("Sum of binary digits for numbers in range "
"[%u, %u]: %lu\n", a, b, bitsnumsum(a, b));
end = clock();
printf("Time: %fs\n", ((double)(end-start))/CLOCKS_PER_SEC);
}
Output
$ ./lookup
Sum of binary digits for numbers in range [0, 4294967295]: 133143986178
Time: 0.000282s
Algorithm
The main idea is to find the n2 = log2(x) rounded down. That is the number of digits in x. Let pow2 = 1 << n2. n2 * (pow2 - x + 1) is the number of digits in the values [x...pow2]. Now find the sun of digits in the powers of 2 from 1 to n2-1
Code
I am certain various simplifications can be made.
Untested code. Will review later.
// Let us use unsigned for everything.
unsigned ulog2(unsigned value) {
unsigned result = 0;
if (0xFFFF0000u & value) {
value >>= 16; result += 16;
}
if (0xFF00u & value) {
value >>= 8; result += 8;
}
if (0xF0u & value) {
value >>= 4; result += 4;
}
if (0xCu & value) {
value >>= 2; result += 2;
}
if (0x2 & value) {
value >>= 1; result += 1;
}
return result;
}
unsigned bit_count_helper(unsigned x) {
if (x == 0) {
return 1;
}
unsigned n2 = ulog2(x);
unsigned pow2 = 1u << n;
unsigned sum = n2 * (pow2 - x + 1u); // value from pow2 to x
while (n2 > 0) {
// ... + 5*16 + 4*8 + 3*4 + 2*2 + 1*1
pow2 /= 2;
sum += n2 * pow2;
}
return sum;
}
unsigned bit_count(unsigned a, unsigned b) {
assert(a < b);
return bit_count_helper(b - 1) - bit_count_helper(a);
}
For this problem your solution is the simplest, the one called "naive" where you look for every element in the sequence or in your case interval for check something or execute operations.
Naive Algorithm
Assuming that a and b are positive integers with b greater than a let's call the dimension/size of the interval [a,b], n = (b-a).
Having our number of elements n and using some notations of algorithms (like big-O notation link), the worst case cost is O(n*(numberOfBits_cost)).
From this we can see that we can speed up our algorithm by using a faster algorithm for computing numberOfBits() or we need to find a way to not look at every element of the interval that costs us n operations.
Intuition
Now looking at a possible interval [6,14] you can see that for 6 and 7 we need 3 digits, with 4 need for 8,9,10,11,12,13,14. This results in calling numberOfBits() for every number that use the same number of digits to be represented, while the following multiplication operation would be faster:
(number_in_subinterval)*digitsForThisInterval
((14-8)+1)*4 = 28
((7-6)+1)*3 = 6
So we reduced the looping on 9 elements with 9 operations to only 2.
So writing a function that use this intuition will give us a more efficient in time, not necessarily in memory, algorithm. Using your numberOfBits() function I have created this solution:
int intuitionSol(int a, int b){
int digitsForA = numberOfBits(a);
int digitsForB = numberOfBits(b);
if(digitsForA != digitsForB){
//because a or b can be that isn't the first or last element of the
// interval that a specific number of digit can rappresent there is a need
// to execute some correction operation before on a and b
int tmp = pow(2,digitsForA) - a;
int result = tmp*digitsForA; //will containt the final result that will be returned
int i;
for(i = digitsForA + 1; i < digitsForB; i++){
int interval_elements = pow(2,i) - pow(2,i-1);
result = result + ((interval_elements) * i);
//printf("NumOfElem: %i for %i digits; sum:= %i\n", interval_elements, i, result);
}
int tmp1 = ((b + 1) - pow(2,digitsForB-1));
result = result + tmp1*digitsForB;
return result;
}
else {
int elements = (b - a) + 1;
return elements * digitsForA; // or digitsForB
}
}
Let's look at the cost, this algorithm costs is the cost of doing correction operation on a and b plus the most expensive one that of the for-loop. In my solution however I'm not looping over all elements but only on numberOfBits(b)-numberOfBits(a) that in the worst case, when [0,n], become log(n)-1 thats equivalent to O(log n).
To resume we passed from a linear operations cost O(n) to a logartmic one O(log n) in the worst case. Look on this diagram the diferinces between the two.
Note
When I talk about interval or sub-interval I refer to the interval of elements that use the same number of digits to represent the number in binary.
Following there are some output of my tests with the last one that shows the difference:
Considered interval is [0,4]
YourSol: 9 in time: 0.000015s
IntuitionSol: 9 in time: 0.000007s
Considered interval is [0,0]
YourSol: 1 in time: 0.000005s
IntuitionSol: 1 in time: 0.000005s
Considered interval is [4,7]
YourSol: 12 in time: 0.000016s
IntuitionSol: 12 in time: 0.000005s
Considered interval is [2,123456]
YourSol: 1967697 in time: 0.005010s
IntuitionSol: 1967697 in time: 0.000015s
I am trying to solve Project Euler+ #97 from Hackerrank. The problem asks to calculate the last 12 digits of A x B ** C + D. My attempt was to use the modular exponentiation mod 10 ** 12 from Wikipedia in order to efficiently calculate the last 12 digits and avoid overflow. However, for all cases aside from the sample 2 x 3 ** 4 + 5 I am getting wrong. According to the constraints there should be no overflow for unsigned long long.
The problem:
Now we want to learn how to calculate some last digits of such big numbers. Let's assume we have a lot of numbers A x B ** C + D and we want to know last 12 digits of these numbers.
Constraints:
1 ≤ T ≤ 500000
1 ≤ A, B, C, D ≤ 10 ** 9
Input: First line contains one integer T - the number of tests.
T lines follow containing 4 integers (A, B, C and D) each.
Output: Output exactly one line containing exactly 12 digits - the last 12 digits of the sum of all results. If the sum is less than 10 ** 12 print corresponding number of leading zeroes then.
My attempt in C
#include <stdio.h>
int main() {
const unsigned long long limit = 1000000000000;
int cases;
for (scanf("%d", &cases); cases; --cases) {
// mult = A, base = B, exp = C, add = D
unsigned long long mult, base, exp, add;
scanf("%llu %llu %llu %llu", &mult, &base, &exp, &add);
base = base % limit;
while (exp) {
if (exp & 1) {
mult = (mult * base) % limit;
}
exp >>= 1;
base = (base * base) % limit;
}
printf("%012llu\n", (mult + add) % limit);
}
return 0;
}
I think you can overflow unsigned long long math (e.g. - modulo 2^64) because your computation of base in your inner loop can get as high as (10^12 - 1)^2 ~= 10^24 ~= 2^79.726, which is much more than 2^64. For example, think about B = 10^6 - 1 and C = 4.
On my MacBook Pro running a 64b version of Mac OS X with clang 8.1.0:
#include <stdio.h>
int main()
{
fprintf(stdout, "sizeof(unsigned long long) = %u\n", (unsigned) sizeof(unsigned long long));
fprintf(stdout, "sizeof(__uint128_t) = %u\n", (unsigned) sizeof(__uint128_t));
fprintf(stdout, "sizeof(long double) = %u\n", (unsigned) sizeof(long double));
return 0;
}
Says:
sizeof(unsigned long long) = 8
sizeof(__uint128_t) = 16
sizeof(long double) = 16
If your platform says 16 or 10 instead for long long, then I think you are in the clear. If it says 8 like mine does, then you need to rework your answer to either force 128b (or 80b) integer math natively or mimic it some other way.
You can try __uint128_t, which is supported by gcc and clang. Otherwise, you'd need to resort to something like long double and fmodl(), which might have enough mantissa bits but might not give exact answers like you want.
Also, you don't accumulate multiple results like the task says. Here's my shot at it, based on your program, but using __uint128_t.
#include <stdio.h>
#include <stdlib.h>
#define BILLION 1000000000
#define TRILLION 1000000000000
int main()
{
const __uint128_t limit = TRILLION;
unsigned long cases = 0;
__uint128_t acc = 0;
if (scanf("%lu", &cases) != 1 || cases == 0 || cases > 500000)
abort();
while (cases-- > 0)
{
unsigned long a, b, c, d;
__uint128_t b2c = 1, bbase;
if (scanf("%lu %lu %lu %lu", &a, &b, &c, &d) != 4 ||
a == 0 || a > BILLION || b == 0 || b > BILLION ||
c == 0 || c > BILLION || d == 0 || d > BILLION)
abort();
for (bbase = b; c > 0; c >>= 1)
{
if ((c & 0x1) != 0)
b2c = (b2c * bbase) % limit; // 64b overflow: ~10^12 * ~10^12 ~= 10^24 > 2^64
bbase = (bbase * bbase) % limit; // same overflow issue as above
}
// can do modulus on acc only once at end of program instead because
// 5 * 10^5 * (10^9 * (10^12 - 1) + 10^9) = 5 * 10^26 < 2^128
acc += a * b2c + d;
}
acc %= limit;
printf("%012llu\n", (unsigned long long) acc);
return 0;
}
I would like to compare two positive integers and add a comparison sign between them. I may not use any logical, relational or bitwise operators and no if then else or while loop or ternary operator.
I found the max and min of these two numbers.
How can I preserve the order and still insert the comparison sign? Any ideas?
E.g.:
4 6 was entered by user output must be 4 < 6
10 2 was entered by user output must be 10 > 2
2 2 was entered by user output must be 2 = 2
f1 = x / y;
f2 = y / x;
f1 = (f1 + 2) % (f1 + 1);
f2 = (f2 + 2) % (f2 + 1);
max = f1 * x + f2 * y ;
max = max / (f1 + f2);
You can use an array of char:
#include <stdio.h>
int main(void)
{
unsigned a, b;
scanf("%u %u", &a, &b);
size_t cmp = (_Bool)(a / b) - (_Bool)(b / a);
char relation = "<=>"[cmp + 1];
printf("%u %c %u\n", a, relation, b);
return 0;
}
This approach don't require min and max found out.
Explanation:
(_Bool)exp will be 1 if exp is non-zero, and 0 if exp equals to 0.
Since a and b are positive integers, a / b will be 0 when a < b, and 1 when a >= b. See the truth table below for details.
(_Bool)(a / b) (_Bool)(b / a) (_Bool)(a / b) - (_Bool)(b / a)
a > b 1 0 1
a = b 1 1 0
a < b 0 1 -1
As a result, cmp evaluates to -1, 0, 1, like a typical comparison function. And thus, cmp + 1 will conveniently lead to 0, 1, 2 valid array indexes.
Thanks #janos for his help.
Edit:
As #chux carefully points out,
OP stated "I may not use any logical, ... operators". The C spec has
"... the logical negation operator !...". §6.5.3.3 5. Using ! may not
meet OP's goals.
So I changed !!exp to (_Bool)exp to meet OP's demand.
Edit II:
OP commented:
Thanks. This does not work when one of the inputs is 0.
But isn't the input numbers granted to be positive? Well, to handle zeros you can use size_t cmp = (_Bool)((a + (_Bool)(a - UINT_MAX)) / (b + (_Bool)(b - UINT_MAX))) - (_Bool)((b + (_Bool)(b - UINT_MAX)) / (a + (_Bool)(a - UINT_MAX)));. Don't forget to #include <limits.h>.
EDIT III (The last edit, I hope):
#include <stdio.h>
#include <limits.h>
#define ISEQUAL(x, y) (_Bool)((_Bool)((x) - (y)) - 1) // 1 if x == y, 0 if x != y
#define NOTEQUAL(x, y) (_Bool)((x) - (y)) // 0 if x == y, 1 if x != y
int main(void)
{
unsigned a, b;
printf("%u\n", UINT_MAX);
scanf("%u %u", &a, &b);
_Bool hasZero = NOTEQUAL(ISEQUAL(a, 0) + ISEQUAL(b, 0), 0);
_Bool hasMax = NOTEQUAL(ISEQUAL(a, UINT_MAX) + ISEQUAL(b, UINT_MAX), 0);
int hasBoth = ISEQUAL(hasZero + hasMax, 2);
int cmp = (_Bool)((a + hasZero + hasBoth) / (b + hasZero + hasBoth))\
- (_Bool)((b + hasZero + hasBoth) / (a + hasZero + hasBoth));
// "+ hasZero + hasBoth" to avoid div 0: UINT_MAX -> 1, while 0 -> 2.
hasBoth = 1 - hasBoth * 2; // 1 if hasBoth == 0, or -1 if hasBoth == 1
char relation = "<=>"[hasBoth * cmp + 1]; // reverse if has both 0 and UINT_MAX
printf("%u %c %u\n", a, relation, b);
return 0;
}
Fixed a bug when a == UINT_MAX - 1 and b == UINT_MAX at #chux points out.
Used macro to improve readability.
Added some comments.
As OP has x, y and has computed their minimum min and maximum max
void prt(unsigned x, unsigned y, unsigned min, unsigned max) {
// min not used
unsigned cmp = 1 + x/max - y/max;
printf("%u %c %u\n", x, "<=>"[cmp], y);
}
#include <stdio.h>
static unsigned int cmpgt(const unsigned int a, const unsigned int b)
{
return b?(a/b ? (a-b):0):a;
// if B is 0, then return A. non zero A will be treated as true
// if a is zero then is false
// if b is not zero then do a/b, if non zero then return (a-b)
// non zero (a-b) will be treated as true
// if (a-b) is zero then will be treated as false
//
// This is a very ugly way of implementing operator >
// There are other ways to do it
// But the point is, you need operator >, but you can not use it
// ( for whatever reason), then you just make it, which is doable
}
static const char *mark(const unsigned int a, const unsigned int b)
{
return cmpgt(a, b)?">":(cmpgt(b,a)?"<":"=");
// no if-else, but ternary operator is a good alternative
// so those are two nested operator ?:
// basically :
// if a>b then return ">"
// else if a<b return "<"
// else return "="
// with cmpgt/operator > implemented, this is a lot easier
}
int main(void) {
const int input[] = {1,3,4,5,5,2,3,4}; //test input
size_t input_size = sizeof(input)/sizeof(int);
for (size_t i=0;cmpgt(input_size-1, i);i++){
// while loop is banned, but for loop is still usable
// the loop condition is handled by cmpgt
printf("%d %s ",input[i],mark(input[i], input[i+1]));
}
printf("%d\n", input[input_size-1]);
return 0;
}
sample output:
1 < 3 < 4 < 5 = 5 > 2 < 3 < 4
https://ideone.com/dFHWnn
A simple compare is to use <=, >= and then lookup the compare character from a string.
void cmp1(unsigned x, unsigned y) {
int cmp = (x >= y) - (x <= y);
printf("%u %c %u\n", x, "<=>"[cmp + 1], y);
}
Yet since we cannot use various operators, etc., All we need to do is replace >=.
_Bool foo_ge(unsigned x, unsigned y) {
_Bool yeq0 = 1 - (_Bool)y; // y == 0?
_Bool q = (x + yeq0)/(y + yeq0); // Offset both x,y, by yeq0
return q + yeq0;
}
void cmp2(unsigned x, unsigned y) {
int cmp = foo_ge(x,y) - foo_ge(y,x)
printf("%u %c %u\n", x, "<=>"[cmp + 1], y);
}
Heavy use of _Bool credit to #sun qingyao
I'm implementing an algorithm in C that needs to do modular addition and subtraction quickly on unsigned integers and can handle overflow conditions correctly. Here's what I have now (which does work):
/* a and/or b may be greater than m */
uint32_t modadd_32(uint32_t a, uint32_t b, uint32_t m) {
uint32_t tmp;
if (b <= UINT32_MAX - a)
return (a + b) % m;
if (m <= (UINT32_MAX>>1))
return ((a % m) + (b % m)) % m;
tmp = a + b;
if (tmp > (uint32_t)(m * 2)) // m*2 must be truncated before compare
tmp -= m;
tmp -= m;
return tmp % m;
}
/* a and/or b may be greater than m */
uint32_t modsub_32(uint32_t a, uint32_t b, uint32_t m) {
uint32_t tmp;
if (a >= b)
return (a - b) % m;
tmp = (m - ((b - a) % m)); /* results in m when 0 is needed */
if (tmp == m)
return 0;
return tmp;
}
Anybody know of a better algorithm? The libraries I've found that do modular arithmetic all seem to be for large arbitrary precision numbers which is way overkill.
Edit: I want this to run well on a 32 bit machine. Also, my existing functions are trivially converted to work on other sizes of unsigned integers, a property which would be nice to retain.
Modular operations usually assume that a and b are less than m. This allows simpler algorithms:
umod_t sub_mod(umod_t a, umod_t b, umod_t m)
{
if ( a>=b )
return a - b;
else
return m - b + a;
}
umod_t add_mod(umod_t a, umod_t b, umod_t m)
{
if ( 0==b ) return a;
// return sub_mod(a, m-b, m);
b = m - b;
if ( a>=b )
return a - b;
else
return m - b + a;
}
Source: Matters Computational, chapter 39.1.
I'd just do the arithmetic in uint32_t if it fits and in uint64_t otherwise.
uint32_t modadd_32(uint32_t a, uint32_t b, uint32_t m) {
if (b <= UINT32_MAX - a)
return (a + b) % m;
else
return ((uint64_t)a + b) % m;
}
On an architecture with 64bit integer types, this should be almost no overhead, you could even think of just doing everything in uint64_t. On architectures where uint64_t is synthesized
let the compiler decide what he thinks is best, an then look into the generated assembler and mmeasure to see if this is satisfactory.
Overflow-safe modular addition
First establish that a<m and b<m with the usual % m.
Add updated a and b.
Should a (or b) exceed the uintN_t sum, then the mathematically sum was an uintN_t overflow and subtraction of m will "mod" the mathematically sum into the range of uintN_t.
If the sum exceeds m, then like the above step, a single subtraction of m will "mod" the sum.
uintN_t modadd_N(uintN_t a, uintN_t b, uintN_t m) {
// may omit these 2 steps if a < b and a < m are known before the call.
a %= m;
b %= m;
uintN_t sum = a + b;
if (sum >= m || sum < a) {
sum -= m;
}
return sum;
}
Quite simple in the end.
Overflow-safe modular subtraction
Variation on #Evgeny Kluev good answer.
uintN_t modsub_N(uintN_t a, uintN_t b, uintN_t m) {
// may omit these 2 steps if a < b and a < m are known before the call.
a %= m;
b %= m;
uintN_t diff = a - b;
if (a < b) {
diff += m;
}
return diff;
}
Note this approach works for various N such as 32, 64, 16 or unsigned, unsigned long, etc. without resorting to wider types. It also works for unsigned types narrower than int/unsigned.