Fastest dithering / halftoning library in C - c

I'm developing a custom thin-client server that serves rendered webpages to its clients. Server is running on multicore Linux box, with Webkit providing the html rendering engine.
The only problem is the fact that clients display is limited with a 4bit (16 colors) grayscale palette. I'm currently using LibGraphicsMagick to dither images (RGB->4bit grayscale), which is an apparent bottleneck in the server performance. Profiling shows that more than 70% of time is spent running GraphicsMagick dithering functions.
I've explored stackoverflow and the Interwebs for a good high performance solution, but it seems that nobody did any benchmarks on various image manipulation libraries and dithering solutions.
I would be more that happy to find out:
What are the highest performance libraries in regards to dithering / halftoning / quantizing RGB images to 4bit grayscale.
Are there any specilized dithering libs or any public domain code snippets that you could point me to?
What libraries do you prefer for manipulating graphics in regards to high performance?
C language libraries are prefered.

Dithering is going to take quite a bit of time depending on the algorithm chosen.
It's fairly trivial to implement Bayer (Matrix) and Floyd-Steinberg (Diffusion) dithering.
Bayer filtering can be made extremely fast when coded with with MMX/SSE to process parallel pixels. You may also be able to do the dithering / conversion using a GPU shader.
FWIW, you're already using GraphicsMagick but there's an entire list of OSS graphics libraries here

From the list provided by Adisak, without any testing, I would bet on AfterImage. The Afterstep people are obsessed with speed, and also described a clever algorithm.
You could take an alternative approach, if your server could be equipped with a decent PCI-express graphics card featuring OpenGL. Here are some specs from Nvidia. Search for "index mode". What you could do is select a 16 or 256 color display mode, render your image as a texture on a flat polygon (like the side of cube) and then read the frame back.
When reading a frame from an OpenGL card, it is important that bandwidth is good from the card, hence the need for PCI-express. As the documentation says, you also have to choose your colors in indexed mode for decent effects.

I know it's not a C library, but this got me curious about what's available for .NET to do error-diffusion which I used some 20 years ago on a project. I found this and specifically this method.
But to try be helpful :) I found this C library.

Here is an implementation of the Floyd-Steinberg method for half-toning:
#include <opencv2/opencv.hpp>
using namespace cv;
int main(){
uchar scale = 128; // change this parameter to 8, 32, 64, 128 to change the dot size
Mat img = imread("../halftone/tom.jpg", CV_LOAD_IMAGE_GRAYSCALE);
for (int r=1; r<img.rows-1; r++) {
for (int c=1; c<img.cols-1; c++) {
uchar oldPixel = img.at<uchar>(r,c);
uchar newPixel = oldPixel / scale * scale;
img.at<uchar>(r,c) = newPixel;
int quantError = oldPixel - newPixel;
img.at<uchar>(r+1,c) += 7./16. * quantError;
img.at<uchar>(r-1,c+1) += 3./16. * quantError;
img.at<uchar>(r,c+1) += 5./16. * quantError;
img.at<uchar>(r+1,c+1) += 1./16. * quantError;
}
}
imshow("halftone", img);
waitKey();
}

Here is the solution you are looking for.
This is a C function that performs Ordered Dither (Bayer) with a parameter for colors.
It is fast enough to be used in realtime processing.
#ifndef MIN
#define MIN(a,b) (((a) < (b)) ? (a) : (b))
#endif
#ifndef MAX
#define MAX(a,b) (((a) > (b)) ? (a) : (b))
#endif
#ifndef CLAMP
// This produces faster code without jumps
#define CLAMP( x, xmin, xmax ) (x) = MAX( (xmin), (x) ); \
(x) = MIN( (xmax), (x) )
#define CLAMPED( x, xmin, xmax ) MAX( (xmin), MIN( (xmax), (x) ) )
#endif
const int BAYER_PATTERN_16X16[16][16] = { // 16x16 Bayer Dithering Matrix. Color levels: 256
{ 0, 191, 48, 239, 12, 203, 60, 251, 3, 194, 51, 242, 15, 206, 63, 254 },
{ 127, 64, 175, 112, 139, 76, 187, 124, 130, 67, 178, 115, 142, 79, 190, 127 },
{ 32, 223, 16, 207, 44, 235, 28, 219, 35, 226, 19, 210, 47, 238, 31, 222 },
{ 159, 96, 143, 80, 171, 108, 155, 92, 162, 99, 146, 83, 174, 111, 158, 95 },
{ 8, 199, 56, 247, 4, 195, 52, 243, 11, 202, 59, 250, 7, 198, 55, 246 },
{ 135, 72, 183, 120, 131, 68, 179, 116, 138, 75, 186, 123, 134, 71, 182, 119 },
{ 40, 231, 24, 215, 36, 227, 20, 211, 43, 234, 27, 218, 39, 230, 23, 214 },
{ 167, 104, 151, 88, 163, 100, 147, 84, 170, 107, 154, 91, 166, 103, 150, 87 },
{ 2, 193, 50, 241, 14, 205, 62, 253, 1, 192, 49, 240, 13, 204, 61, 252 },
{ 129, 66, 177, 114, 141, 78, 189, 126, 128, 65, 176, 113, 140, 77, 188, 125 },
{ 34, 225, 18, 209, 46, 237, 30, 221, 33, 224, 17, 208, 45, 236, 29, 220 },
{ 161, 98, 145, 82, 173, 110, 157, 94, 160, 97, 144, 81, 172, 109, 156, 93 },
{ 10, 201, 58, 249, 6, 197, 54, 245, 9, 200, 57, 248, 5, 196, 53, 244 },
{ 137, 74, 185, 122, 133, 70, 181, 118, 136, 73, 184, 121, 132, 69, 180, 117 },
{ 42, 233, 26, 217, 38, 229, 22, 213, 41, 232, 25, 216, 37, 228, 21, 212 },
{ 169, 106, 153, 90, 165, 102, 149, 86, 168, 105, 152, 89, 164, 101, 148, 85 }
};
// This is the ultimate method for Bayer Ordered Diher with 16x16 matrix
// ncolors - number of colors diapazons to use. Valid values 0..255, but interesed are 0..40
// 1 - color (1 bit per color plane, 3 bits per pixel)
// 3 - color (2 bit per color plane, 6 bits per pixel)
// 7 - color (3 bit per color plane, 9 bits per pixel)
// 15 - color (4 bit per color plane, 12 bits per pixel)
// 31 - color (5 bit per color plane, 15 bits per pixel)
void makeDitherBayerRgbNbpp( BYTE* pixels, int width, int height, int ncolors ) noexcept
{
int divider = 256 / ncolors;
for( int y = 0; y < height; y++ )
{
const int row = y & 15; // y % 16
for( int x = 0; x < width; x++ )
{
const int col = x & 15; // x % 16
const int t = BAYER_PATTERN_16X16[col][row];
const int corr = (t / ncolors);
const int blue = pixels[x * 3 + 0];
const int green = pixels[x * 3 + 1];
const int red = pixels[x * 3 + 2];
int i1 = (blue + corr) / divider; CLAMP( i1, 0, ncolors );
int i2 = (green + corr) / divider; CLAMP( i2, 0, ncolors );
int i3 = (red + corr) / divider; CLAMP( i3, 0, ncolors );
// If you want to compress the image, use the values of i1,i2,i3
// they have values in the range 0..ncolors
// So if the ncolors is 4 - you have values: 0,1,2,3 which is encoded in 2 bits
// 2 bits for 3 planes == 6 bits per pixel
pixels[x * 3 + 0] = CLAMPED( i1 * divider, 0, 255 ); // blue
pixels[x * 3 + 1] = CLAMPED( i2 * divider, 0, 255 ); // green
pixels[x * 3 + 2] = CLAMPED( i3 * divider, 0, 255 ); // red
}
pixels += width * 3;
}
}
In your case, you need to call the function with parameter ncolors=4
This means that each color plane (for grayscale it is 1 plane) will use 4 bits per pixel.
So, you have to call:
makeDitherBayerRgbNbpp( pixels, width, height, 4 );
The input pixels are in BGR format.
The output pixels are also in BGR format for visualisation purposes.
To obtain the bits, you have to replace this code:
pixels[x * 3 + 0] = CLAMPED( i1 * divider, 0, 255 ); // blue
pixels[x * 3 + 1] = CLAMPED( i2 * divider, 0, 255 ); // green
pixels[x * 3 + 2] = CLAMPED( i3 * divider, 0, 255 ); // red
With something like this:
out.writeBit( i1 ); // blue
out.writeBit( i2 ); // green
out.writeBit( i3 ); // red
Here is a sample picture with your parameters (4bit grayscale)
For more dithering source code and demo app, you can see here

Related

Is there a non-looping unsigned 32-bit integer square root function C

I have seen floating point bit hacks to produce the square root as seen here fast floating point square root, but this method works for floats.
Is there a similar method for finding the integer square root without loops of a 32-bit unsigned integer? I have been scouring the web for one, but haven't seen any
(my thoughts are that a pure binary representation doesn't have enough information to do it, but since it is constrained to 32-bit I would guess otherwise)
This answer assumes that the target platform does not have floating-point support, or very slow floating-point support (perhaps via emulation).
As has been pointed out in comments, a count leading zeros (CLZ) instruction can be used to provide the fast log2 functionality that is provided via the exponent part of floating-point operands. CLZ can also be emulated with reasonable efficiency on platforms that don't provide the functionality via an intrinsic, as shown below.
An initial approximation good for a few bits can be pulled from a lookup table (LUT), which can be further refined by Newton iterations just like in the floating-point case. One to two iterations will typically be sufficient for a 32-bit integer square root. The ISO-C99 code below shows working exemplary implementation including an exhaustive test.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
uint8_t clz (uint32_t a); // count leading zeros
uint32_t umul_16_16 (uint16_t a, uint16_t b); // 16x16 bit multiply
uint16_t udiv_32_16 (uint32_t x, uint16_t y); // 32/16 bit division
/* LUT for initial square root approximation */
static const uint16_t sqrt_tab[32] =
{
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff,
0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};
/* table lookup for initial guess followed by division-based Newton iteration */
uint16_t my_isqrt (uint32_t x)
{
uint16_t q, lz, y, i, xh;
if (x == 0) return x; // early out, code below can't handle zero
// initial guess based on leading 5 bits of argument normalized to 2.30
lz = clz (x);
i = ((x << (lz & ~1)) >> 27);
y = sqrt_tab[i] >> (lz >> 1);
xh = x >> 16; // use for overflow check on divisions
// first Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
if (lz < 10) {
// second Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
}
if (umul_16_16 (y, y) > x) y--; // adjust quotient if too large
return y; // (uint16_t)sqrt((double)x)
}
static const uint8_t clz_tab[32] =
{
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0
};
/* count leading zeros (for non-zero argument); a machine instruction on many architectures */
uint8_t clz (uint32_t a)
{
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acdd * a >> 27];
}
/* 16x16->32 bit unsigned multiply; machine instruction on many architectures */
uint32_t umul_16_16 (uint16_t a, uint16_t b)
{
return (uint32_t)a * b;
}
/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
uint16_t r = x / y;
return r;
}
int main (void)
{
uint32_t x;
uint16_t res, ref;
printf ("testing 32-bit integer square root\n");
x = 0;
do {
ref = (uint16_t)sqrt((double)x);
res = my_isqrt (x);
if (res != ref) {
printf ("error: x=%08x res=%08x ref=%08x\n", x, res, ref);
printf ("exhaustive test FAILED\n");
return EXIT_FAILURE;
}
x++;
} while (x);
printf ("exhaustive test PASSED\n");
return EXIT_SUCCESS;
}
Below are some variants of the code given by #njuffa, including some that are not just loop-free, but also branch-free (the ones that aren't branch free are almost branch free). All of the variants below are derived from the algorithm that's used internally for Python's math.isqrt. The variants are:
32-bit square root: almost branch-free, 1 division
32-bit square root: branch-free, 1 division
32-bit square root: branch-free, 3 divisions, no lookup table
64-bit square root: almost branch-free, 2 divisions
64-bit square root: branch-free, 4 divisions, no lookup table
To finish, we give a very fast division free (and still loop-free, almost branch-free) square root algorithm for unsigned 64-bit inputs. The catch is that it only works for inputs already known to be square; for non-square inputs it will give results that are not useful. It manages to compute each square root in around 3.3 nanoseconds on my 2018 2.7 GHz Intel Core i7 laptop. Scroll all the way down to the bottom of the answer to see it.
32-bit square root: almost branch-free, 1 division
The first variant is the one I'd recommend, all other things being equal. It uses:
a single lookup into a 192-byte lookup table
a single 32-bit-by-16-bit division with 16-bit result
a single 16-bit-by-16-bit multiplication with 32-bit result
a count-leading-zeros operation, and a handful of shifts and other cheap operations
After the early return for the case x == 0, it's essentially branch free. (There's a hint of a branch in the final correction step, but the compilers I tried manage to give jump-free code for this.)
Here's the code. Explanations follow.
#include <stdint.h>
// count leading zeros of nonzero 32-bit unsigned integer
int clz32(uint32_t x);
// isqrt32_tab[k] = isqrt(256*(k+64)-1) for 0 <= k < 192
static const uint8_t isqrt32_tab[192] = {
127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
137, 138, 139, 140, 141, 142, 143, 143, 144, 145,
146, 147, 148, 149, 150, 150, 151, 152, 153, 154,
155, 155, 156, 157, 158, 159, 159, 160, 161, 162,
163, 163, 164, 165, 166, 167, 167, 168, 169, 170,
170, 171, 172, 173, 173, 174, 175, 175, 176, 177,
178, 178, 179, 180, 181, 181, 182, 183, 183, 184,
185, 185, 186, 187, 187, 188, 189, 189, 190, 191,
191, 192, 193, 193, 194, 195, 195, 196, 197, 197,
198, 199, 199, 200, 201, 201, 202, 203, 203, 204,
204, 205, 206, 206, 207, 207, 208, 209, 209, 210,
211, 211, 212, 212, 213, 214, 214, 215, 215, 216,
217, 217, 218, 218, 219, 219, 220, 221, 221, 222,
222, 223, 223, 224, 225, 225, 226, 226, 227, 227,
228, 229, 229, 230, 230, 231, 231, 232, 232, 233,
234, 234, 235, 235, 236, 236, 237, 237, 238, 238,
239, 239, 240, 241, 241, 242, 242, 243, 243, 244,
244, 245, 245, 246, 246, 247, 247, 248, 248, 249,
249, 250, 250, 251, 251, 252, 252, 253, 253, 254,
254, 255,
};
// integer square root of 32-bit unsigned integer
uint16_t isqrt32(uint32_t x)
{
if (x == 0) return 0;
int lz = clz32(x) & 30;
x <<= lz;
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Above, we omitted the definition of the function clz32, which can be implemented in exactly the same way as clz in #njuffa's post. With gcc or Clang, you can use something like __builtin_clz in place of clz32. If you have to roll your own, steal it from #njuffa's answer.
Explanation: after handling the special case x = 0, we start by normalising x, shifting by an even integer (effectively multiplying by a power of 4) so that 2**30 <= x < 2**32. We then compute the integer square root y for x, and shift y back to compensate just before returning.
This brings us to the central three lines of code, which are the heart of the algorithm:
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
Remarkably, these three lines are all that's needed to compute an integer square root y of a 32-bit integer x under the assumption that 2**30 <= x < 2**32.
The first of the three lines uses the lookup table to retrieve an approximation to the square root of the topmost 16 bits of x, x >> 16. The approximation will always have an error smaller than 1: that is, after the first line is executed, the condition:
(y-1) * (y-1) < (x >> 16) && (x >> 16) < (y+1) * (y+1)
will always be true.
The second line is the most interesting one: before that line is executed, y is an approximation to the square root of x >> 16, with around 7-8 bits of accuracy. It follows that y << 8 is an approximation to the square root of x, again with around 7-8 accurate bits. So a single Newton iteration applied to y << 8 should give a better approximation to x, roughly doubling the number of accurate bits, and that's exactly what the second line computes. The really beautiful part is that if you work through the mathematics, it turns out that you can prove that the resulting y again has error smaller than 1: that is, the condition
(y-1) * (y-1) < x && x < (y+1) * (y+1)
will be true. That means that either y*y <= x and y is already the integer square root of x, or x < y*y and y - 1 is the integer square root of x. So the third line applies that -1 correction to y in the event that x < y*y.
There's one slightly subtle point above. If you follow through the possible sizes of y as the algorithm progresses: after the lookup we have 128 <= y <= 256. After the second line, mathematically we have 32768 <= y <= 65536. But if y = 65536 then we have a problem, since y can no longer be stored in a uint16_t. However, it's possible to prove that that can't happen: roughly, the only way for y to end up being that large is if the result of the lookup is 256, and in that case, it's straightforward to see that the next line can only produce a maximum of 65535.
32-bit square root: branch-free, 1 division
The algorithm above is tantalisingly close to being completely branch free. Here's a variant that's genuinely branch free, at least with the compilers I tried. It uses the same clz32 and lookup table as the previous example.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
int index = (x >> 24) - 64;
uint16_t y = 1 + sqrt32_tab[index >= 0 ? index : 0];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Here we're simply letting the special case x == 0 propagate through the algorithm as usual. We compute clz32(x | 1) in place of clz32(x), partly because clz32(0) may not be well-defined (it isn't for gcc's __builtin_clz, for example), and partly because even if it were well-defined, the resulting shift of 32 would give us undefined behaviour in x <<= lz on the following line. And we have to adjust our lookup to correct for a possibly negative lookup index. (We could also extend the lookup table from 192 entries to 256 entries just to accommodate this case, but that seems wasteful.) Most platforms should have compilers clever enough to turn index >= 0 ? index : 0 into something branch-free. Some architectures even provide a saturating subtraction instruction.
32-bit square root: branch-free, 3 divisions, no lookup table
The next variant needs no lookup table, but uses three divisions instead of one.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
uint32_t y = 1 + (x >> 30);
y = (y << 1) + (x >> 27) / y;
y = (y << 3) + (x >> 21) / y;
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
The idea is exactly the same as before, except that we start with smaller approximations.
After the initial line uint32_t y = 1 + (x >> 30);, y is an approximation to the square root of x >> 28.
After y = (y << 1) + (x >> 27) / y;, y is an approximation to the square root of the eight topmost significant bits of x, x >> 24
After y = (y << 3) + (x >> 21) / y;, y is an approximation to
the square root of the sixteen topmost significant bits of x, x >> 16.
The rest of the algorithm proceeds as before.
64-bit square root: almost branch-free, 2 divisions
The OP asked for an unsigned 32-bit square root, but the technique above works just as well for computing the integer square root of an unsigned 64-bit integer.
Here's some code for that case, which is essentially the same code that Python's math.isqrt uses for inputs smaller than 2**64 (and also to provide the starting guesses for larger inputs).
#include <stdint.h>
// count leading zeros of nonzero 64-bit unsigned integer
int clz64(uint64_t x);
// isqrt64_tab[k] = isqrt(256*(k+65)-1) for 0 <= k < 192
static const uint8_t isqrt64_tab[192] = {
128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
138, 139, 140, 141, 142, 143, 143, 144, 145, 146,
147, 148, 149, 150, 150, 151, 152, 153, 154, 155,
155, 156, 157, 158, 159, 159, 160, 161, 162, 163,
163, 164, 165, 166, 167, 167, 168, 169, 170, 170,
171, 172, 173, 173, 174, 175, 175, 176, 177, 178,
178, 179, 180, 181, 181, 182, 183, 183, 184, 185,
185, 186, 187, 187, 188, 189, 189, 190, 191, 191,
192, 193, 193, 194, 195, 195, 196, 197, 197, 198,
199, 199, 200, 201, 201, 202, 203, 203, 204, 204,
205, 206, 206, 207, 207, 208, 209, 209, 210, 211,
211, 212, 212, 213, 214, 214, 215, 215, 216, 217,
217, 218, 218, 219, 219, 220, 221, 221, 222, 222,
223, 223, 224, 225, 225, 226, 226, 227, 227, 228,
229, 229, 230, 230, 231, 231, 232, 232, 233, 234,
234, 235, 235, 236, 236, 237, 237, 238, 238, 239,
239, 240, 241, 241, 242, 242, 243, 243, 244, 244,
245, 245, 246, 246, 247, 247, 248, 248, 249, 249,
250, 250, 251, 251, 252, 252, 253, 253, 254, 254,
255, 255,
};
// integer square root of a 64-bit unsigned integer
uint32_t isqrt64(uint64_t x)
{
if (x == 0) return 0;
int lz = clz64(x) & 62;
x <<= lz;
uint32_t y = isqrt64_tab[(x >> 56) - 64];
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
The above code assumes the existence of a function clz64 for counting leading zeros in a uint64_t value. The function clz64 is permitted to have undefined behaviour in the case x == 0. Again, on gcc and Clang, __builtin_clzll should be usable in place of clz64, assuming that unsigned long long has a width of 64. For completeness, here's an implementation of clz64, based on the usual de Bruijn sequence trickery.
#include <assert.h>
static const uint8_t clz64_tab[64] = {
63, 5, 62, 4, 16, 10, 61, 3, 24, 15, 36, 9, 30, 21, 60, 2,
12, 26, 23, 14, 45, 35, 43, 8, 33, 29, 52, 20, 49, 41, 59, 1,
6, 17, 11, 25, 37, 31, 22, 13, 27, 46, 44, 34, 53, 50, 42, 7,
18, 38, 32, 28, 47, 54, 51, 19, 39, 48, 55, 40, 56, 57, 58, 0,
};
// count leading zeros of nonzero 64-bit unsigned integer. Analogous to the
// 32-bit version at
// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn.
int clz64(uint64_t x)
{
assert(x);
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x |= x >> 32;
return clz64_tab[(uint64_t)(x * 0x03f6eaf2cd271461u) >> 58];
}
Doing exhaustive testing for a 64-bit input is no longer reasonable without a supercomputer to hand, but checking all inputs of the form s*s, s*s + s, and s*s + 2*s for 0 <= s < 2**32 is feasible, and gives some confidence that the code is working correctly. The following code performs that check. It takes around 148 seconds to run to completion on my Intel Core i7-based laptop, which works out at around 11.5ns per square root computation. That comes down to around 9.2ns per square root if I replace the custom clz64 implementation with __builtin_clzll. (Both of those timings still include the overhead of the testing code, of course.)
#include <stdio.h>
int check_isqrt64(uint64_t x) {
uint64_t y = isqrt64(x);
int y_ok = y*y <= x && x - y*y <= 2*y;
if (!y_ok) {
printf("isqrt64(%llu) returned incorrect answer %llu\n", x, y);
}
return y_ok;
}
int main(void)
{
printf("Checking isqrt64 for selected values in [0, 2**64) ...\n");
for (uint64_t s = 0; s < 0x100000000u; s++) {
if (!check_isqrt64(s*s)) return 1;
if (!check_isqrt64(s*s + s)) return 1;
if (!check_isqrt64(s*s + 2*s)) return 1;
};
printf("All tests passed\n");
return 0;
}
64-bit square root: branch-free, 4 divisions, no lookup table
The final variant gives a branch-free, lookup-table-free implementation of integer square root for a 64-bit input, just to demonstrate that it's possible. It needs 4 divisions. On most machines, those divisions will be slow enough that this variant is slower than the previous variant.
uint32_t isqrt64(uint64_t x)
{
int lz = clz64(x | 1) & 62;
x <<= lz;
uint32_t y = 2 + (x >> 63);
y = (y << 1) + (x >> 59) / y;
y = (y << 3) + (x >> 53) / y;
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
64-bit square root of exact square; division-free!
Finally, as promised, here's an algorithm that's a bit different from those above. It computes the square root of an input that's already known to be square, using a Newton iteration for the inverse square root of the input, and operating in the 2-adic domain rather than the real domain. It requires no divisions, but it uses a lookup table, as well as relying on GCC's __builtin_ctzl intrinsic to count trailing zeros efficiently.
#include <stdint.h>
static const uint8_t lut[128] = {
0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};
uint32_t isqrt64_exact(uint64_t n)
{
uint32_t m, k, x, b;
if (n == 0)
return 0;
int j = __builtin_ctzl(n);
n >>= j;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = lut[k >> 1 & 127];
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (b - ~b) << (j >> 1);
}
For purists, it's not hard to rewrite it to be completely branch free, to eliminate the lookup table, and to make the code completely portable and C99-compliant. Here's one way to do that:
#include <stdint.h>
uint32_t isqrt64_exact(uint64_t n)
{
uint64_t t = (n & -n) - 1 & 0x5555555555555555U;
t = t + (t >> 2) & 0x3333333333333333U;
t = t + (t >> 4) & 0x0f0f0f0f0f0f0f0fU;
int j = (uint64_t)(t * 0x0101010101010101U) >> 56;
n >>= 2 * j;
uint32_t m, k, x, b;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = k * ~k;
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (2 * b | (m & 1)) << j;
}
For a complete explanation of how this code works, along with code to do exhaustive testing, see the GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40.
No. You'd need to introduce a log somewhere; the fast floating point square root works because of the log in the bit representation.
The fastest method is probably a lookup table of n -> floor(sqrt(n)). You don't store all the values in the table, but only the values for which the square root changes. Use binary search to find the result in the table in log(n) time.

Generate a random array of integers in a loop in C

I was trying to generate a random array of integers of length 16 inside a loop. The elements of the array will lie inside [0, 255]. I thought assigning random integers from [0, 255] would suffice; but it does not work (one random array is created which remains unchanged over iterations). Then I tried shuffling, but the situation does not improve (I also notice that the probability of 0 in the 'random' array is significantly larger).
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
//#define Shuffle
/* Generate a random integer array of length `size` with elements from [0, 255] */
int* random_sample(int size)
{
int i, j, k;
int *elements = malloc(size*sizeof(int));
/* Assign random integers */
for (i = size - 1; i > 0; --i)
{
elements[i] = rand() % 256;
}
/* Shuffle */
#ifdef Shuffle
for (i = size - 1; i > 0; --i) {
j = rand() % size;
k = elements[i];
elements[i] = elements[j];
elements[j] = k;
}
#endif
return elements;
}
int main(int argc, char const *argv[])
{
int LENGTH = 16, i, iteration;
int *random_array = random_sample(LENGTH);
srand(time(NULL));
for (iteration = 0; iteration < 10; ++iteration)
{
for (i = 0; i < LENGTH; ++i)
printf("%d, ", random_array[i]);
puts("");
}
return 0;
}
A typical output looks like:
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
0, 227, 251, 242, 171, 186, 205, 41, 236, 74, 255, 81, 115, 105, 198, 103,
I tried several variations of the above code, but none of them works. Need help!
EDIT
I was trying variations of the code, and mistakenly printed the same unchanged array (as pointed out by some others - thanks to them); originally I wrote
int main(int argc, char const *argv[])
{
int LENGTH = 16, i, iteration;
int *random_array;
srand(time(NULL));
for (iteration = 0; iteration < 10; ++iteration)
{
random_array = random_sample(LENGTH);
for (i = 0; i < LENGTH; ++i)
printf("%d, ", random_array[i]);
puts("");
}
return 0;
}
which print much more 0s.
EDIT (2)
Thanks to #pmg, I found the problem. Inside random_sample function, changing the first
for (i = size - 1; i > 0; --i) to for (i = size - 1; i >= 0; --i) works fine!
Thank you all.
Try the code below. For the array to contain different (random) values at every iteration, you need to put different values in it :)
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
//#define Shuffle
/* Generate a random integer array of length `size` with elements from [0, 255] */
int* random_sample(int size)
{
int i, j, k;
int *elements = malloc(size*sizeof(int));
/* Assign random integers */
for (i = size - 1; i > 0; --i)
{
elements[i] = rand() % 256;
}
/* Shuffle */
#ifdef Shuffle
for (i = size - 1; i > 0; --i) {
j = rand() % size;
k = elements[i];
elements[i] = elements[j];
elements[j] = k;
}
#endif
return elements;
}
int main(int argc, char const *argv[])
{
int LENGTH = 16, i, iteration;
int *random_array;
srand(time(NULL));
for (iteration = 0; iteration < 10; ++iteration)
{
random_array = random_sample(LENGTH);
for (i = 0; i < LENGTH; ++i)
printf("%d, ", random_array[i]);
puts("");
free(random_array);
}
return 0;
}
Here's some comments on your code to clarify the problem.
int main(int argc, char const *argv[])
{
int LENGTH = 16, i, iteration;
int *random_array = random_sample(LENGTH); // ARRAY is created and filled in here.
srand(time(NULL)); // Random generator is initialized here.
// WAIT WHAT? You already filled in the array above!
// It is too late now to initialize the Generator!
In addition to what others have said above, I think its important to note the following on the rand function as written in the man page:
Note:
The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher-order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)
There are two possible alternatives:
1) Use the random function as recommended by the note.
2) Use the higher-order bits of the random number as follows:
elements[i] = (rand() * 256) / RANDMAX;

Knapsack code - not working for some cases

My code is as the following:
#include<stdio.h>
int max(int a,int b)
{
return a>b?a:b;
}
int Knapsack(int items,int weight[],int value[],int maxWeight)
{
int dp[items+1][maxWeight+1];
/* dp[i][w] represents maximum value that can be attained if the maximum weight is w and
items are chosen from 1...i */
/* dp[0][w] = 0 for all w because we have chosen 0 items */
int iter,w;
for(iter=0;iter<=maxWeight;iter++)
{
dp[0][iter]=0;
}
/* dp[i][0] = 0 for all w because maximum weight we can take is 0 */
for(iter=0;iter<=items;iter++)
{
dp[iter][0]=0;
}
for(iter=1;iter<=items;iter++)
{
for(w=0;w<=maxWeight;w=w+1)
{
dp[iter][w] = dp[iter-1][w]; /* If I do not take this item */
if(w-weight[iter] >=0)
{
/* suppose if I take this item */
dp[iter][w] = max( (dp[iter][w]) , (dp[iter-1][w-weight[iter]])+value[iter]);
}
}
}
return dp[items][maxWeight];
}
int main()
{
int items=9;
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
int iter;
int i;
int maxWeight=180;
for (i=0;i<10;i++){
value[i] = value[i]*weight[i];
}
printf("Max value attained can be %d\n",Knapsack(items,weight,value,maxWeight));
}
My knapsack code is working when
items=12;
int weight[/*items+1*/13]={60, 20, 20, 20, 10, 20, 10, 10, 10, 20, 20, 10};
int value[/*items+1*/13]={48, 77, 46, 82, 85, 43, 49, 73, 65, 48, 47, 51};
where it returned the correct output 7820.
But it doesn't returned the correct output when
items=9;
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
where it returned the output 9730, the correct output should be 14110.
From observation, the program somehow skipped the 1st value (weight=60, value =73).
I have checked the code several times, but I just cant find what's wrong.
Can someone explain to me why? Thank you!
In your code, you are trying to access out of bounds index for weight and value array. When iter reaches value 9, weight[iter] and value[iter] becomes out of bounds index. I guess, in C you simply get some garbage value for out of index access, but in Java, that will throw an exception. Change the code in your inner for loop to:
dp[iter][w] = dp[iter-1][w]; /* If I do not take this item */
if(w-weight[iter - 1] >=0)
{
/* suppose if I take this item */
dp[iter][w] = maximum( (dp[iter][w]) , (dp[iter-1][w-weight[iter - 1]])+value[iter - 1]);
}
and it will work fine.
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87};
Your arrays are of length 10, but you are filling only 9 entries. Hence the last entry gets filled to 0. How to initialize all members of an array to the same value?
int weight[/*items+1*/10]={60, 10, 20, 20, 20, 20, 10, 10, 10, 0};
int value[/*items+1*/10]={73, 81, 86, 72, 90, 77, 85, 70, 87, 0};
But you are trying to access the indices (1 to 9) in your algorithm.
Instead try filling all entries:
int weight[/*items+1*/10]={0, 60, 10, 20, 20, 20, 20, 10, 10, 10};
int value[/*items+1*/10]={0, 73, 81, 86, 72, 90, 77, 85, 70, 87};
EDIT:
The first case gives correct output since in that case the first entry is not included in the optimal solution.

multiplication of two numbers

Few days back I had an interview with Qualcomm. I was kinnda stucked to one question, the question thou looked very simple but neither me nor the interviewer were satisfied with my answers, if anyone can provide any good solution to this problem.
The question is:
Multiply 2 numbers without using any loops and additions and of course no multiplication and division.
To which I replied: recursion
He said anything else at very low level.
To which the genuine thought that came to my mind was bit shifting, but bit shifting will only multiply the number by power of 2 and for other numbers we finally have to do a addition.
For example: 10 * 7 can be done as: (binary of 7 ~~ 111)
10<< 2 + 10<<1 + 10
40 + 20 + 10 = 70
But again addition was not allowed.
Any thoughts on this issue guys.
Here is a solution just using lookup, addition and shifting. The lookup does not require multiplication as it is an array of pointers to another array - hence addition required to find the right array. Then using the second value you can repeat pointer arithmetic and get the lookup result.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
/* Note:As this is an array of pointers to an array of values, addition is
only required for the lookup.
i.e.
First part: lookup + a value -> A pointer to an array
Second part - Add a value to the pointer to above pointer to get the value
*/
unsigned char lookup[16][16] = {
{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
{ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 },
{ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45 },
{ 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60 },
{ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75 },
{ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90 },
{ 0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105 },
{ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120 },
{ 0, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135 },
{ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150 },
{ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121, 132, 143, 154, 165 },
{ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168, 180 },
{ 0, 13, 26, 39, 52, 65, 78, 91, 104, 117, 130, 143, 156, 169, 182, 195 },
{ 0, 14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210 },
{ 0, 15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225 }
};
unsigned short answer, mult;
unsigned char x, y, a, b;
x = (unsigned char)atoi(argv[1]);
y = (unsigned char)atoi(argv[2]);
printf("Multiple %d by %d\n", x, y);
answer = 0;
/* First nibble of x, First nibble of y */
a = x & 0xf;
b = y & 0xf;
mult = lookup[a][b];
answer += mult;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* First nibble of x, Second nibble of y */
a = x & 0xf;
b = (y & 0xf0) >> 4;
mult = lookup[a][b];
answer += mult << 4;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* Second nibble of x, First nibble of y */
a = (x & 0xf0) >> 4;
b = y & 0xf;
mult = lookup[a][b];
answer += mult << 4;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
/* Second nibble of x, Second nibble of y */
a = (x & 0xf0) >> 4;
b = (y & 0xf0) >> 4;
mult = lookup[a][b];
answer += mult << 8;
printf("Looking up %d, %d get %d - Answer so far %d\n", a, b, mult, answer);
return 0;
}
Perhaps you could recursively add, using bitwise operations as a replacement for the addition operator. See: Adding Two Numbers With Bitwise and Shift Operators
You can separate your problems by first implementing the addition and then the multiplication based on the addition.
For the addition, implement what they do on processors at the gate level using the C bitwise operators:
http://en.wikipedia.org/wiki/Full_adder
Then for the multiplication, with the addition you implemented, use goto statements and labels so no loop statement (the for, while and do iteration statements) will be used.
What about russian peasant multiplication without using addition? Is there an easy way (a few lines, no loops) to simulate addition using only AND, OR, XOR and NOT?
You can implement addition by bits operators. But still, if you want to avoid loops, you should write a lot of code. (I used to implement multiplication without arithmetic operators, but I use loop, shifting the index until it became zero. If it can help you, tell me, and I will search the file)
You could use logarithms and subtraction instead.
log(a*b) = log(a) + log(b)
a+b = -(-a-b)
exp(log(a)) = a
round(exp(-(-log(a)-log(b))))
How about multiplication tables?
Question: Multiply 2 numbers without using any loops and additions and of course no multiplication and division.
Multiplication is defined in terms of addition. It is impossible not to find addition in an implementation of multiplication.
Arbitrary precision numbers cannot be multiplied without loop/recursion.
Multiplication of two numbers of fixed bit-lengths can be implemented via a table lookup. The problem is the size of the table. Generating the table requires addition.
The answer is: It cannot be done.

what is the fastest way to generate random ip numbers in c?

I need fast way to generate ip numbers that are valid (reserved ips are valid too).
For now i am using this:
unsigned char *p_ip;
unsigned long ul_dst;
p_ip = (unsigned char*) &ul_dst;
for(int i=0;i<sizeof(unsigned long);i++)
*p_ip++ = rand()%255;
ip.sin_addr.s_addr = ul_dst;
But sometimes it generate non-valid numbers, but this code can generate about 10k of valid ips in a second. Can anyone contribute?
Thank you
calling rand() is probably the slowest part of your code, if you use the implementation of a random function found at http://en.wikipedia.org/wiki/Multiply-with-carry
This is an ultra fast C function for generating random numbers.
storing sizeof(unsigned long) in a registered variable i.e.:
register int size = sizeof(unsigned long)
should also help slightly.
Since you are using 4 chars = 4 x 8 byte memory, you can instead use a 32bit integer which will only require one memory address.
combining the bitshifting, new random method, registered variables, should reduce running times by quite a bit.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
uint32_t ul_dst;
init_rand(time(NULL));
uint32_t random_num = rand_cmwc();
ul_dst = (random_num >> 24 & 0xFF) << 24 |
(random_num >> 16 & 0xFF) << 16 |
(random_num >> 8 & 0xFF) << 8 |
(random_num & 0xFF);
printf("%u\n",ul_dst);
return 0;
Above this code I have the exact copy of the random function from wikipedia.
Hopefully this will run much faster.
We know the size of a 32bit int is 4*8 so no need for the sizeof anymore, and instead of %255 I replaced it with a 255 bit mask
Currently, your code writes four chars to memory. You can optimize this by writing one int32 to memory.
Roll your own random generator. For this purpose anything with a period of (1<<32) is valid, so you could construct a lineair congruential thing. (you would not need to construct from 4 separate characters, too)
Also, your *p_ip is uninitialised. you probably want a
p_ip = (unsigned char *) &ul_dst;
somewhere.
Building over #Serdalis idea of using init_rand(), maybe you can try something I had come across called Knuth Shuffle which will help you generate Random Numbers with a uniform Distribution.
The code will look like this I guess now but the below example uses rand():
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define TOTAL_VAL_COUNT 254
int byteval_array[TOTAL_VAL_COUNT] = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
151, 152, 153, 154, 155, 156, 157, 158, 159, 160,
161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
171, 172, 173, 174, 175, 176, 177, 178, 179, 180,
181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199, 200,
201, 202, 203, 204, 205, 206, 207, 208, 209, 210,
211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
221, 222, 223, 224, 225, 226, 227, 228, 229, 230,
231, 232, 233, 234, 235, 236, 237, 238, 239, 240,
241, 242, 243, 244, 245, 246, 247, 248, 249, 250,
251, 252, 253, 254
};
unsigned char denominator = TOTAL_VAL_COUNT+1;
unsigned char generate_byte_val();
unsigned char generate_byte_val() {
unsigned char inx, random_val;
if (denominator == 1)
denominator = TOTAL_VAL_COUNT+1;
inx = rand() % denominator;
random_val = byteval_array[inx];
byteval_array[inx] = byteval_array[--denominator];
byteval_array[denominator] = random_val;
return random_val;
}
int main(int argc, char **argv) {
int i;
struct in_addr ip;
for (i = 1; i < 255; ++i) {
ip.s_addr = (generate_byte_val() |
(generate_byte_val() << 8) |
(generate_byte_val() << 16) |
(generate_byte_val() << 24));
printf ("IP = %s\n", inet_ntoa(ip));
}
}
HTH

Resources