Few days back I had an interview with Qualcomm. I was kinnda stucked to one question, the question thou looked very simple but neither me nor the interviewer were satisfied with my answers, if anyone can provide any good solution to this problem.
The question is:
Multiply 2 numbers without using any loops and additions and of course no multiplication and division.
To which I replied: recursion
He said anything else at very low level.
To which the genuine thought that came to my mind was bit shifting, but bit shifting will only multiply the number by power of 2 and for other numbers we finally have to do a addition.
For example: 10 * 7 can be done as: (binary of 7 ~~ 111)
10<< 2 + 10<<1 + 10
40 + 20 + 10 = 70
But again addition was not allowed.
Any thoughts on this issue guys.
Here is a solution just using lookup, addition and shifting. The lookup does not require multiplication as it is an array of pointers to another array - hence addition required to find the right array. Then using the second value you can repeat pointer arithmetic and get the lookup result.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
/* Note:As this is an array of pointers to an array of values, addition is
only required for the lookup.
i.e.
First part: lookup + a value -> A pointer to an array
Second part - Add a value to the pointer to above pointer to get the value
*/
unsigned char lookup[16][16] = {
{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
{ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 },
{ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45 },
{ 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60 },
{ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75 },
{ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90 },
{ 0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105 },
{ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120 },
{ 0, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135 },
{ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150 },
{ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121, 132, 143, 154, 165 },
{ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168, 180 },
{ 0, 13, 26, 39, 52, 65, 78, 91, 104, 117, 130, 143, 156, 169, 182, 195 },
{ 0, 14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210 },
{ 0, 15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225 }
};
unsigned short answer, mult;
unsigned char x, y, a, b;
x = (unsigned char)atoi(argv[1]);
y = (unsigned char)atoi(argv[2]);
printf("Multiple %d by %d\n", x, y);
answer = 0;
/* First nibble of x, First nibble of y */
a = x & 0xf;
b = y & 0xf;
mult = lookup[a][b];
answer += mult;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* First nibble of x, Second nibble of y */
a = x & 0xf;
b = (y & 0xf0) >> 4;
mult = lookup[a][b];
answer += mult << 4;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* Second nibble of x, First nibble of y */
a = (x & 0xf0) >> 4;
b = y & 0xf;
mult = lookup[a][b];
answer += mult << 4;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* Second nibble of x, Second nibble of y */
a = (x & 0xf0) >> 4;
b = (y & 0xf0) >> 4;
mult = lookup[a][b];
answer += mult << 8;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
return 0;
}
Perhaps you could recursively add, using bitwise operations as a replacement for the addition operator. See: Adding Two Numbers With Bitwise and Shift Operators
You can separate your problems by first implementing the addition and then the multiplication based on the addition.
For the addition, implement what they do on processors at the gate level using the C bitwise operators:
http://en.wikipedia.org/wiki/Full_adder
Then for the multiplication, with the addition you implemented, use goto statements and labels so no loop statement (the for, while and do iteration statements) will be used.
What about russian peasant multiplication without using addition? Is there an easy way (a few lines, no loops) to simulate addition using only AND, OR, XOR and NOT?
You can implement addition by bits operators. But still, if you want to avoid loops, you should write a lot of code. (I used to implement multiplication without arithmetic operators, but I use loop, shifting the index until it became zero. If it can help you, tell me, and I will search the file)
You could use logarithms and subtraction instead.
log(a*b) = log(a) + log(b)
a+b = -(-a-b)
exp(log(a)) = a
round(exp(-(-log(a)-log(b))))
How about multiplication tables?
Question: Multiply 2 numbers without using any loops and additions and of course no multiplication and division.
Multiplication is defined in terms of addition. It is impossible not to find addition in an implementation of multiplication.
Arbitrary precision numbers cannot be multiplied without loop/recursion.
Multiplication of two numbers of fixed bit-lengths can be implemented via a table lookup. The problem is the size of the table. Generating the table requires addition.
The answer is: It cannot be done.
Related
I've created a function to reverse an array in C. The first run of the function will reverse the array perfectly. Runs following that will fill the first and/or last index of the array with seemingly random info. I'm assuming this may have to do with the array accidentally receiving information from surrounding memory addresses after several runs.
My function is written as follows:
void reverse_array(int array[], int* result, int size) {
int index = 0;
int reverse_index;
for (reverse_index = size - 1; reverse_index > 0; reverse_index--) {
result[index] = array[reverse_index];
index++;
}
}
Running the following code with "nums" being an array filled with integers 0 to 99, the expected output is returned:
int main(void) {
int size = 100;
int nums[size];
int i;
for (i = 0; i < size; i++) {
nums[i] = i;
}
int reversed[size];
reverse_array(nums, reversed, size);
print_array(reversed);
return 0;
}
The problem arises when I try and run the function several times consecutively. In the following code, I attempt to flip an array several times:
int main(void) {
int size = 100;
int nums[size];
int i;
for (i = 0; i < size; i++) {
nums[i] = i;
}
int reversed[size];
reverse_array(nums, reversed, size);
int reversed2[size];
reverse_array(reversed, reversed2, size);
int reversed3[size];
reverse_array(reversed2, reversed3, size);
int reversed4[size];
reverse_array(reversed3, reversed4, size);
int reversed5[size];
reverse_array(reversed4, reversed5, size);
print_array(reversed5, size);
return 0;
}
Printing out the second flip, or "reversed2," returns the following:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 32760,
The output is almost exactly as to be expected, except for the final index. This happens with all following flips, though with different values. Printing the 5th flip, or "reversed5," yields the following:
0, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80,
79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60,
59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40,
39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20,
19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 32760,
I'm very new to C, and can only guess what is causing this. As stated above, I think it may have to do with the array accidentally picking up information from surrounding memory addresses, but that's only a guess. Thanks for reading, any help is greatly appreciated.
Edit: Thank you all very much. I could not ask for more straight forward and helpful responses. I feel kind of dumb now, but that's part of the learning process.
for (reverse_index = size - 1; reverse_index > 0; reverse_index--)
Your problem is with this line of code,this > should be changed to >=.
I have seen floating point bit hacks to produce the square root as seen here fast floating point square root, but this method works for floats.
Is there a similar method for finding the integer square root without loops of a 32-bit unsigned integer? I have been scouring the web for one, but haven't seen any
(my thoughts are that a pure binary representation doesn't have enough information to do it, but since it is constrained to 32-bit I would guess otherwise)
This answer assumes that the target platform does not have floating-point support, or very slow floating-point support (perhaps via emulation).
As has been pointed out in comments, a count leading zeros (CLZ) instruction can be used to provide the fast log2 functionality that is provided via the exponent part of floating-point operands. CLZ can also be emulated with reasonable efficiency on platforms that don't provide the functionality via an intrinsic, as shown below.
An initial approximation good for a few bits can be pulled from a lookup table (LUT), which can be further refined by Newton iterations just like in the floating-point case. One to two iterations will typically be sufficient for a 32-bit integer square root. The ISO-C99 code below shows working exemplary implementation including an exhaustive test.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
uint8_t clz (uint32_t a); // count leading zeros
uint32_t umul_16_16 (uint16_t a, uint16_t b); // 16x16 bit multiply
uint16_t udiv_32_16 (uint32_t x, uint16_t y); // 32/16 bit division
/* LUT for initial square root approximation */
static const uint16_t sqrt_tab[32] =
{
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff,
0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};
/* table lookup for initial guess followed by division-based Newton iteration */
uint16_t my_isqrt (uint32_t x)
{
uint16_t q, lz, y, i, xh;
if (x == 0) return x; // early out, code below can't handle zero
// initial guess based on leading 5 bits of argument normalized to 2.30
lz = clz (x);
i = ((x << (lz & ~1)) >> 27);
y = sqrt_tab[i] >> (lz >> 1);
xh = x >> 16; // use for overflow check on divisions
// first Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
if (lz < 10) {
// second Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
}
if (umul_16_16 (y, y) > x) y--; // adjust quotient if too large
return y; // (uint16_t)sqrt((double)x)
}
static const uint8_t clz_tab[32] =
{
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0
};
/* count leading zeros (for non-zero argument); a machine instruction on many architectures */
uint8_t clz (uint32_t a)
{
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acdd * a >> 27];
}
/* 16x16->32 bit unsigned multiply; machine instruction on many architectures */
uint32_t umul_16_16 (uint16_t a, uint16_t b)
{
return (uint32_t)a * b;
}
/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
uint16_t r = x / y;
return r;
}
int main (void)
{
uint32_t x;
uint16_t res, ref;
printf ("testing 32-bit integer square root\n");
x = 0;
do {
ref = (uint16_t)sqrt((double)x);
res = my_isqrt (x);
if (res != ref) {
printf ("error: x=%08x res=%08x ref=%08x\n", x, res, ref);
printf ("exhaustive test FAILED\n");
return EXIT_FAILURE;
}
x++;
} while (x);
printf ("exhaustive test PASSED\n");
return EXIT_SUCCESS;
}
Below are some variants of the code given by #njuffa, including some that are not just loop-free, but also branch-free (the ones that aren't branch free are almost branch free). All of the variants below are derived from the algorithm that's used internally for Python's math.isqrt. The variants are:
32-bit square root: almost branch-free, 1 division
32-bit square root: branch-free, 1 division
32-bit square root: branch-free, 3 divisions, no lookup table
64-bit square root: almost branch-free, 2 divisions
64-bit square root: branch-free, 4 divisions, no lookup table
To finish, we give a very fast division free (and still loop-free, almost branch-free) square root algorithm for unsigned 64-bit inputs. The catch is that it only works for inputs already known to be square; for non-square inputs it will give results that are not useful. It manages to compute each square root in around 3.3 nanoseconds on my 2018 2.7 GHz Intel Core i7 laptop. Scroll all the way down to the bottom of the answer to see it.
32-bit square root: almost branch-free, 1 division
The first variant is the one I'd recommend, all other things being equal. It uses:
a single lookup into a 192-byte lookup table
a single 32-bit-by-16-bit division with 16-bit result
a single 16-bit-by-16-bit multiplication with 32-bit result
a count-leading-zeros operation, and a handful of shifts and other cheap operations
After the early return for the case x == 0, it's essentially branch free. (There's a hint of a branch in the final correction step, but the compilers I tried manage to give jump-free code for this.)
Here's the code. Explanations follow.
#include <stdint.h>
// count leading zeros of nonzero 32-bit unsigned integer
int clz32(uint32_t x);
// isqrt32_tab[k] = isqrt(256*(k+64)-1) for 0 <= k < 192
static const uint8_t isqrt32_tab[192] = {
127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
137, 138, 139, 140, 141, 142, 143, 143, 144, 145,
146, 147, 148, 149, 150, 150, 151, 152, 153, 154,
155, 155, 156, 157, 158, 159, 159, 160, 161, 162,
163, 163, 164, 165, 166, 167, 167, 168, 169, 170,
170, 171, 172, 173, 173, 174, 175, 175, 176, 177,
178, 178, 179, 180, 181, 181, 182, 183, 183, 184,
185, 185, 186, 187, 187, 188, 189, 189, 190, 191,
191, 192, 193, 193, 194, 195, 195, 196, 197, 197,
198, 199, 199, 200, 201, 201, 202, 203, 203, 204,
204, 205, 206, 206, 207, 207, 208, 209, 209, 210,
211, 211, 212, 212, 213, 214, 214, 215, 215, 216,
217, 217, 218, 218, 219, 219, 220, 221, 221, 222,
222, 223, 223, 224, 225, 225, 226, 226, 227, 227,
228, 229, 229, 230, 230, 231, 231, 232, 232, 233,
234, 234, 235, 235, 236, 236, 237, 237, 238, 238,
239, 239, 240, 241, 241, 242, 242, 243, 243, 244,
244, 245, 245, 246, 246, 247, 247, 248, 248, 249,
249, 250, 250, 251, 251, 252, 252, 253, 253, 254,
254, 255,
};
// integer square root of 32-bit unsigned integer
uint16_t isqrt32(uint32_t x)
{
if (x == 0) return 0;
int lz = clz32(x) & 30;
x <<= lz;
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Above, we omitted the definition of the function clz32, which can be implemented in exactly the same way as clz in #njuffa's post. With gcc or Clang, you can use something like __builtin_clz in place of clz32. If you have to roll your own, steal it from #njuffa's answer.
Explanation: after handling the special case x = 0, we start by normalising x, shifting by an even integer (effectively multiplying by a power of 4) so that 2**30 <= x < 2**32. We then compute the integer square root y for x, and shift y back to compensate just before returning.
This brings us to the central three lines of code, which are the heart of the algorithm:
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
Remarkably, these three lines are all that's needed to compute an integer square root y of a 32-bit integer x under the assumption that 2**30 <= x < 2**32.
The first of the three lines uses the lookup table to retrieve an approximation to the square root of the topmost 16 bits of x, x >> 16. The approximation will always have an error smaller than 1: that is, after the first line is executed, the condition:
(y-1) * (y-1) < (x >> 16) && (x >> 16) < (y+1) * (y+1)
will always be true.
The second line is the most interesting one: before that line is executed, y is an approximation to the square root of x >> 16, with around 7-8 bits of accuracy. It follows that y << 8 is an approximation to the square root of x, again with around 7-8 accurate bits. So a single Newton iteration applied to y << 8 should give a better approximation to x, roughly doubling the number of accurate bits, and that's exactly what the second line computes. The really beautiful part is that if you work through the mathematics, it turns out that you can prove that the resulting y again has error smaller than 1: that is, the condition
(y-1) * (y-1) < x && x < (y+1) * (y+1)
will be true. That means that either y*y <= x and y is already the integer square root of x, or x < y*y and y - 1 is the integer square root of x. So the third line applies that -1 correction to y in the event that x < y*y.
There's one slightly subtle point above. If you follow through the possible sizes of y as the algorithm progresses: after the lookup we have 128 <= y <= 256. After the second line, mathematically we have 32768 <= y <= 65536. But if y = 65536 then we have a problem, since y can no longer be stored in a uint16_t. However, it's possible to prove that that can't happen: roughly, the only way for y to end up being that large is if the result of the lookup is 256, and in that case, it's straightforward to see that the next line can only produce a maximum of 65535.
32-bit square root: branch-free, 1 division
The algorithm above is tantalisingly close to being completely branch free. Here's a variant that's genuinely branch free, at least with the compilers I tried. It uses the same clz32 and lookup table as the previous example.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
int index = (x >> 24) - 64;
uint16_t y = 1 + sqrt32_tab[index >= 0 ? index : 0];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Here we're simply letting the special case x == 0 propagate through the algorithm as usual. We compute clz32(x | 1) in place of clz32(x), partly because clz32(0) may not be well-defined (it isn't for gcc's __builtin_clz, for example), and partly because even if it were well-defined, the resulting shift of 32 would give us undefined behaviour in x <<= lz on the following line. And we have to adjust our lookup to correct for a possibly negative lookup index. (We could also extend the lookup table from 192 entries to 256 entries just to accommodate this case, but that seems wasteful.) Most platforms should have compilers clever enough to turn index >= 0 ? index : 0 into something branch-free. Some architectures even provide a saturating subtraction instruction.
32-bit square root: branch-free, 3 divisions, no lookup table
The next variant needs no lookup table, but uses three divisions instead of one.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
uint32_t y = 1 + (x >> 30);
y = (y << 1) + (x >> 27) / y;
y = (y << 3) + (x >> 21) / y;
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
The idea is exactly the same as before, except that we start with smaller approximations.
After the initial line uint32_t y = 1 + (x >> 30);, y is an approximation to the square root of x >> 28.
After y = (y << 1) + (x >> 27) / y;, y is an approximation to the square root of the eight topmost significant bits of x, x >> 24
After y = (y << 3) + (x >> 21) / y;, y is an approximation to
the square root of the sixteen topmost significant bits of x, x >> 16.
The rest of the algorithm proceeds as before.
64-bit square root: almost branch-free, 2 divisions
The OP asked for an unsigned 32-bit square root, but the technique above works just as well for computing the integer square root of an unsigned 64-bit integer.
Here's some code for that case, which is essentially the same code that Python's math.isqrt uses for inputs smaller than 2**64 (and also to provide the starting guesses for larger inputs).
#include <stdint.h>
// count leading zeros of nonzero 64-bit unsigned integer
int clz64(uint64_t x);
// isqrt64_tab[k] = isqrt(256*(k+65)-1) for 0 <= k < 192
static const uint8_t isqrt64_tab[192] = {
128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
138, 139, 140, 141, 142, 143, 143, 144, 145, 146,
147, 148, 149, 150, 150, 151, 152, 153, 154, 155,
155, 156, 157, 158, 159, 159, 160, 161, 162, 163,
163, 164, 165, 166, 167, 167, 168, 169, 170, 170,
171, 172, 173, 173, 174, 175, 175, 176, 177, 178,
178, 179, 180, 181, 181, 182, 183, 183, 184, 185,
185, 186, 187, 187, 188, 189, 189, 190, 191, 191,
192, 193, 193, 194, 195, 195, 196, 197, 197, 198,
199, 199, 200, 201, 201, 202, 203, 203, 204, 204,
205, 206, 206, 207, 207, 208, 209, 209, 210, 211,
211, 212, 212, 213, 214, 214, 215, 215, 216, 217,
217, 218, 218, 219, 219, 220, 221, 221, 222, 222,
223, 223, 224, 225, 225, 226, 226, 227, 227, 228,
229, 229, 230, 230, 231, 231, 232, 232, 233, 234,
234, 235, 235, 236, 236, 237, 237, 238, 238, 239,
239, 240, 241, 241, 242, 242, 243, 243, 244, 244,
245, 245, 246, 246, 247, 247, 248, 248, 249, 249,
250, 250, 251, 251, 252, 252, 253, 253, 254, 254,
255, 255,
};
// integer square root of a 64-bit unsigned integer
uint32_t isqrt64(uint64_t x)
{
if (x == 0) return 0;
int lz = clz64(x) & 62;
x <<= lz;
uint32_t y = isqrt64_tab[(x >> 56) - 64];
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
The above code assumes the existence of a function clz64 for counting leading zeros in a uint64_t value. The function clz64 is permitted to have undefined behaviour in the case x == 0. Again, on gcc and Clang, __builtin_clzll should be usable in place of clz64, assuming that unsigned long long has a width of 64. For completeness, here's an implementation of clz64, based on the usual de Bruijn sequence trickery.
#include <assert.h>
static const uint8_t clz64_tab[64] = {
63, 5, 62, 4, 16, 10, 61, 3, 24, 15, 36, 9, 30, 21, 60, 2,
12, 26, 23, 14, 45, 35, 43, 8, 33, 29, 52, 20, 49, 41, 59, 1,
6, 17, 11, 25, 37, 31, 22, 13, 27, 46, 44, 34, 53, 50, 42, 7,
18, 38, 32, 28, 47, 54, 51, 19, 39, 48, 55, 40, 56, 57, 58, 0,
};
// count leading zeros of nonzero 64-bit unsigned integer. Analogous to the
// 32-bit version at
// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn.
int clz64(uint64_t x)
{
assert(x);
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x |= x >> 32;
return clz64_tab[(uint64_t)(x * 0x03f6eaf2cd271461u) >> 58];
}
Doing exhaustive testing for a 64-bit input is no longer reasonable without a supercomputer to hand, but checking all inputs of the form s*s, s*s + s, and s*s + 2*s for 0 <= s < 2**32 is feasible, and gives some confidence that the code is working correctly. The following code performs that check. It takes around 148 seconds to run to completion on my Intel Core i7-based laptop, which works out at around 11.5ns per square root computation. That comes down to around 9.2ns per square root if I replace the custom clz64 implementation with __builtin_clzll. (Both of those timings still include the overhead of the testing code, of course.)
#include <stdio.h>
int check_isqrt64(uint64_t x) {
uint64_t y = isqrt64(x);
int y_ok = y*y <= x && x - y*y <= 2*y;
if (!y_ok) {
printf("isqrt64(%llu) returned incorrect answer %llu\n", x, y);
}
return y_ok;
}
int main(void)
{
printf("Checking isqrt64 for selected values in [0, 2**64) ...\n");
for (uint64_t s = 0; s < 0x100000000u; s++) {
if (!check_isqrt64(s*s)) return 1;
if (!check_isqrt64(s*s + s)) return 1;
if (!check_isqrt64(s*s + 2*s)) return 1;
};
printf("All tests passed\n");
return 0;
}
64-bit square root: branch-free, 4 divisions, no lookup table
The final variant gives a branch-free, lookup-table-free implementation of integer square root for a 64-bit input, just to demonstrate that it's possible. It needs 4 divisions. On most machines, those divisions will be slow enough that this variant is slower than the previous variant.
uint32_t isqrt64(uint64_t x)
{
int lz = clz64(x | 1) & 62;
x <<= lz;
uint32_t y = 2 + (x >> 63);
y = (y << 1) + (x >> 59) / y;
y = (y << 3) + (x >> 53) / y;
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
64-bit square root of exact square; division-free!
Finally, as promised, here's an algorithm that's a bit different from those above. It computes the square root of an input that's already known to be square, using a Newton iteration for the inverse square root of the input, and operating in the 2-adic domain rather than the real domain. It requires no divisions, but it uses a lookup table, as well as relying on GCC's __builtin_ctzl intrinsic to count trailing zeros efficiently.
#include <stdint.h>
static const uint8_t lut[128] = {
0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};
uint32_t isqrt64_exact(uint64_t n)
{
uint32_t m, k, x, b;
if (n == 0)
return 0;
int j = __builtin_ctzl(n);
n >>= j;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = lut[k >> 1 & 127];
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (b - ~b) << (j >> 1);
}
For purists, it's not hard to rewrite it to be completely branch free, to eliminate the lookup table, and to make the code completely portable and C99-compliant. Here's one way to do that:
#include <stdint.h>
uint32_t isqrt64_exact(uint64_t n)
{
uint64_t t = (n & -n) - 1 & 0x5555555555555555U;
t = t + (t >> 2) & 0x3333333333333333U;
t = t + (t >> 4) & 0x0f0f0f0f0f0f0f0fU;
int j = (uint64_t)(t * 0x0101010101010101U) >> 56;
n >>= 2 * j;
uint32_t m, k, x, b;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = k * ~k;
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (2 * b | (m & 1)) << j;
}
For a complete explanation of how this code works, along with code to do exhaustive testing, see the GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40.
No. You'd need to introduce a log somewhere; the fast floating point square root works because of the log in the bit representation.
The fastest method is probably a lookup table of n -> floor(sqrt(n)). You don't store all the values in the table, but only the values for which the square root changes. Use binary search to find the result in the table in log(n) time.
I have code to perform a permutation on a block of data (64 bit) and the code works but I don't understand how the addbit function works. This is the function that performs the permutation on a to and from bit position.
I understand that because if a bit gets over-written in the destination data block then if that previous bit needs to be permuted then it will be lost and that is why there is a source and destination data block.
But I dont understand the logic in addbit.
Why is FIRSTBIT used?
The code works, but I would like to understand why.
#include <stdint.h>
#include <stdio.h>
// FIRSTBIT is first bit in 64 bit data?
#define FIRSTBIT 0x8000000000000000 // 1000000000...
// eg move bit 64 in input data to bit position 1
// then move bit 63 in input data to bit position 2 etc
const int TestPermutation[64] = {
64, 63, 62, 61, 60, 59, 58, 57,
56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41,
40, 39, 38, 37, 36, 35, 34, 33,
32, 31, 30, 29, 28, 27, 26, 25,
24, 23, 22, 21, 20, 19, 18, 17,
16, 15, 14, 13, 12, 11, 10, 9,
8, 7, 6, 5, 4, 3, 2, 1
};
// move data bit from 'from' at position_from to position_to in block
// How does this work?
void addbit(uint64_t *block, uint64_t from, int position_from, int position_to)
{
if (((from << (position_from)) & FIRSTBIT) != 0)
*block += (FIRSTBIT >> position_to);
}
// perform permutation based on TestPermutation array
void permute(uint64_t* data) {
uint64_t data_temp = 0;
for (int ii = 0; ii < 64; ii++)
{
addbit(&data_temp, *data, TestPermutation[ii] - 1, ii);
}
*data = data_temp;
}
void print_binary(uint64_t number) {
for (int i = sizeof(uint64_t) * 8 - 1; i >= 0; --i) {
printf("%c", ((number >> i) & 1) ? '1' : '0');
}
printf("\n");
}
int main() {
uint64_t data = 0xF0F0F0F0F0F0F0F0; // test block
print_binary(data); // 1111000011110000111100001111000011110000111100001111000011110000
permute(&data);
print_binary(data); // 0000111100001111000011110000111100001111000011110000111100001111
}
My code is as the following:
#include<stdio.h>
int max(int a,int b)
{
return a>b?a:b;
}
int Knapsack(int items,int weight[],int value[],int maxWeight)
{
int dp[items+1][maxWeight+1];
/* dp[i][w] represents maximum value that can be attained if the maximum weight is w and
items are chosen from 1...i */
/* dp[0][w] = 0 for all w because we have chosen 0 items */
int iter,w;
for(iter=0;iter<=maxWeight;iter++)
{
dp[0][iter]=0;
}
/* dp[i][0] = 0 for all w because maximum weight we can take is 0 */
for(iter=0;iter<=items;iter++)
{
dp[iter][0]=0;
}
for(iter=1;iter<=items;iter++)
{
for(w=0;w<=maxWeight;w=w+1)
{
dp[iter][w] = dp[iter-1][w]; /* If I do not take this item */
if(w-weight[iter] >=0)
{
/* suppose if I take this item */
dp[iter][w] = max( (dp[iter][w]) , (dp[iter-1][w-weight[iter]])+value[iter]);
}
}
}
return dp[items][maxWeight];
}
int main()
{
int items=9;
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
int iter;
int i;
int maxWeight=180;
for (i=0;i<10;i++){
value[i] = value[i]*weight[i];
}
printf("Max value attained can be %d\n",Knapsack(items,weight,value,maxWeight));
}
My knapsack code is working when
items=12;
int weight[/*items+1*/13]={60, 20, 20, 20, 10, 20, 10, 10, 10, 20, 20, 10};
int value[/*items+1*/13]={48, 77, 46, 82, 85, 43, 49, 73, 65, 48, 47, 51};
where it returned the correct output 7820.
But it doesn't returned the correct output when
items=9;
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
where it returned the output 9730, the correct output should be 14110.
From observation, the program somehow skipped the 1st value (weight=60, value =73).
I have checked the code several times, but I just cant find what's wrong.
Can someone explain to me why? Thank you!
In your code, you are trying to access out of bounds index for weight and value array. When iter reaches value 9, weight[iter] and value[iter] becomes out of bounds index. I guess, in C you simply get some garbage value for out of index access, but in Java, that will throw an exception. Change the code in your inner for loop to:
dp[iter][w] = dp[iter-1][w]; /* If I do not take this item */
if(w-weight[iter - 1] >=0)
{
/* suppose if I take this item */
dp[iter][w] = maximum( (dp[iter][w]) , (dp[iter-1][w-weight[iter - 1]])+value[iter - 1]);
}
and it will work fine.
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
Your arrays are of length 10, but you are filling only 9 entries. Hence the last entry gets filled to 0. How to initialize all members of an array to the same value?
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10, 0};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87, 0};
But you are trying to access the indices (1 to 9) in your algorithm.
Instead try filling all entries:
int weight[/*items+1*/10]={0, 60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={0, 73, 81, 86, 72, 90, 77, 85, 70, 87};
EDIT:
The first case gives correct output since in that case the first entry is not included in the optimal solution.
I'm developing a custom thin-client server that serves rendered webpages to its clients. Server is running on multicore Linux box, with Webkit providing the html rendering engine.
The only problem is the fact that clients display is limited with a 4bit (16 colors) grayscale palette. I'm currently using LibGraphicsMagick to dither images (RGB->4bit grayscale), which is an apparent bottleneck in the server performance. Profiling shows that more than 70% of time is spent running GraphicsMagick dithering functions.
I've explored stackoverflow and the Interwebs for a good high performance solution, but it seems that nobody did any benchmarks on various image manipulation libraries and dithering solutions.
I would be more that happy to find out:
What are the highest performance libraries in regards to dithering / halftoning / quantizing RGB images to 4bit grayscale.
Are there any specilized dithering libs or any public domain code snippets that you could point me to?
What libraries do you prefer for manipulating graphics in regards to high performance?
C language libraries are prefered.
Dithering is going to take quite a bit of time depending on the algorithm chosen.
It's fairly trivial to implement Bayer (Matrix) and Floyd-Steinberg (Diffusion) dithering.
Bayer filtering can be made extremely fast when coded with with MMX/SSE to process parallel pixels. You may also be able to do the dithering / conversion using a GPU shader.
FWIW, you're already using GraphicsMagick but there's an entire list of OSS graphics libraries here
From the list provided by Adisak, without any testing, I would bet on AfterImage. The Afterstep people are obsessed with speed, and also described a clever algorithm.
You could take an alternative approach, if your server could be equipped with a decent PCI-express graphics card featuring OpenGL. Here are some specs from Nvidia. Search for "index mode". What you could do is select a 16 or 256 color display mode, render your image as a texture on a flat polygon (like the side of cube) and then read the frame back.
When reading a frame from an OpenGL card, it is important that bandwidth is good from the card, hence the need for PCI-express. As the documentation says, you also have to choose your colors in indexed mode for decent effects.
I know it's not a C library, but this got me curious about what's available for .NET to do error-diffusion which I used some 20 years ago on a project. I found this and specifically this method.
But to try be helpful :) I found this C library.
Here is an implementation of the Floyd-Steinberg method for half-toning:
#include <opencv2/opencv.hpp>
using namespace cv;
int main(){
uchar scale = 128; // change this parameter to 8, 32, 64, 128 to change the dot size
Mat img = imread("../halftone/tom.jpg", CV_LOAD_IMAGE_GRAYSCALE);
for (int r=1; r<img.rows-1; r++) {
for (int c=1; c<img.cols-1; c++) {
uchar oldPixel = img.at<uchar>(r,c);
uchar newPixel = oldPixel / scale * scale;
img.at<uchar>(r,c) = newPixel;
int quantError = oldPixel - newPixel;
img.at<uchar>(r+1,c) += 7./16. * quantError;
img.at<uchar>(r-1,c+1) += 3./16. * quantError;
img.at<uchar>(r,c+1) += 5./16. * quantError;
img.at<uchar>(r+1,c+1) += 1./16. * quantError;
}
}
imshow("halftone", img);
waitKey();
}
Here is the solution you are looking for.
This is a C function that performs Ordered Dither (Bayer) with a parameter for colors.
It is fast enough to be used in realtime processing.
#ifndef MIN
#define MIN(a,b) (((a) < (b)) ? (a) : (b))
#endif
#ifndef MAX
#define MAX(a,b) (((a) > (b)) ? (a) : (b))
#endif
#ifndef CLAMP
// This produces faster code without jumps
#define CLAMP( x, xmin, xmax ) (x) = MAX( (xmin), (x) ); \
(x) = MIN( (xmax), (x) )
#define CLAMPED( x, xmin, xmax ) MAX( (xmin), MIN( (xmax), (x) ) )
#endif
const int BAYER_PATTERN_16X16[16][16] = { // 16x16 Bayer Dithering Matrix. Color levels: 256
{ 0, 191, 48, 239, 12, 203, 60, 251, 3, 194, 51, 242, 15, 206, 63, 254 },
{ 127, 64, 175, 112, 139, 76, 187, 124, 130, 67, 178, 115, 142, 79, 190, 127 },
{ 32, 223, 16, 207, 44, 235, 28, 219, 35, 226, 19, 210, 47, 238, 31, 222 },
{ 159, 96, 143, 80, 171, 108, 155, 92, 162, 99, 146, 83, 174, 111, 158, 95 },
{ 8, 199, 56, 247, 4, 195, 52, 243, 11, 202, 59, 250, 7, 198, 55, 246 },
{ 135, 72, 183, 120, 131, 68, 179, 116, 138, 75, 186, 123, 134, 71, 182, 119 },
{ 40, 231, 24, 215, 36, 227, 20, 211, 43, 234, 27, 218, 39, 230, 23, 214 },
{ 167, 104, 151, 88, 163, 100, 147, 84, 170, 107, 154, 91, 166, 103, 150, 87 },
{ 2, 193, 50, 241, 14, 205, 62, 253, 1, 192, 49, 240, 13, 204, 61, 252 },
{ 129, 66, 177, 114, 141, 78, 189, 126, 128, 65, 176, 113, 140, 77, 188, 125 },
{ 34, 225, 18, 209, 46, 237, 30, 221, 33, 224, 17, 208, 45, 236, 29, 220 },
{ 161, 98, 145, 82, 173, 110, 157, 94, 160, 97, 144, 81, 172, 109, 156, 93 },
{ 10, 201, 58, 249, 6, 197, 54, 245, 9, 200, 57, 248, 5, 196, 53, 244 },
{ 137, 74, 185, 122, 133, 70, 181, 118, 136, 73, 184, 121, 132, 69, 180, 117 },
{ 42, 233, 26, 217, 38, 229, 22, 213, 41, 232, 25, 216, 37, 228, 21, 212 },
{ 169, 106, 153, 90, 165, 102, 149, 86, 168, 105, 152, 89, 164, 101, 148, 85 }
};
// This is the ultimate method for Bayer Ordered Diher with 16x16 matrix
// ncolors - number of colors diapazons to use. Valid values 0..255, but interesed are 0..40
// 1 - color (1 bit per color plane, 3 bits per pixel)
// 3 - color (2 bit per color plane, 6 bits per pixel)
// 7 - color (3 bit per color plane, 9 bits per pixel)
// 15 - color (4 bit per color plane, 12 bits per pixel)
// 31 - color (5 bit per color plane, 15 bits per pixel)
void makeDitherBayerRgbNbpp( BYTE* pixels, int width, int height, int ncolors ) noexcept
{
int divider = 256 / ncolors;
for( int y = 0; y < height; y++ )
{
const int row = y & 15; // y % 16
for( int x = 0; x < width; x++ )
{
const int col = x & 15; // x % 16
const int t = BAYER_PATTERN_16X16[col][row];
const int corr = (t / ncolors);
const int blue = pixels[x * 3 + 0];
const int green = pixels[x * 3 + 1];
const int red = pixels[x * 3 + 2];
int i1 = (blue + corr) / divider; CLAMP( i1, 0, ncolors );
int i2 = (green + corr) / divider; CLAMP( i2, 0, ncolors );
int i3 = (red + corr) / divider; CLAMP( i3, 0, ncolors );
// If you want to compress the image, use the values of i1,i2,i3
// they have values in the range 0..ncolors
// So if the ncolors is 4 - you have values: 0,1,2,3 which is encoded in 2 bits
// 2 bits for 3 planes == 6 bits per pixel
pixels[x * 3 + 0] = CLAMPED( i1 * divider, 0, 255 ); // blue
pixels[x * 3 + 1] = CLAMPED( i2 * divider, 0, 255 ); // green
pixels[x * 3 + 2] = CLAMPED( i3 * divider, 0, 255 ); // red
}
pixels += width * 3;
}
}
In your case, you need to call the function with parameter ncolors=4
This means that each color plane (for grayscale it is 1 plane) will use 4 bits per pixel.
So, you have to call:
makeDitherBayerRgbNbpp( pixels, width, height, 4 );
The input pixels are in BGR format.
The output pixels are also in BGR format for visualisation purposes.
To obtain the bits, you have to replace this code:
pixels[x * 3 + 0] = CLAMPED( i1 * divider, 0, 255 ); // blue
pixels[x * 3 + 1] = CLAMPED( i2 * divider, 0, 255 ); // green
pixels[x * 3 + 2] = CLAMPED( i3 * divider, 0, 255 ); // red
With something like this:
out.writeBit( i1 ); // blue
out.writeBit( i2 ); // green
out.writeBit( i3 ); // red
Here is a sample picture with your parameters (4bit grayscale)
For more dithering source code and demo app, you can see here