Is there any fast algorithm to compute log2 for numbers that are all power of 2? - c

Is there any fast algorithm to compute log2 for numbers that are all power of 2,eg:
log2(1), log2(2), log2(4), log2(1024), log2(4096)...
I'm considering using it to implement bit set iteration.

assuming you know the number must be power of 2, so in binary, it is 1 following with n 0 where n is the number you are looking for.
if you are using gcc or clang, you can use builtin function
— Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the least
significant bit position. If x is 0, the result is undefined.
for pure C implementation, it is already answered
Finding trailing 0s in a binary number

Three more theoretically possibly efficient algorithms in addition to the ones already given or linked. n is the number of bits, N = 2^n:
Big LUT: one lookup
Simple binary search: log2(n) comparisons
LUT[N % k] with k-position LUT: one modulo, one lookup (k=37 for 32-bit and 67 for 64-bit numbers)
In practice, #1 is great with small n, #2 may be fastest on certain hardware (something without fast multiply), but the code looks ugly. #3 probably never beats DeBruijn on a real machine, but it has fewer operations.

Related

Optimal frequency of modulo operation in finite field arithmetic implementation

I'm trying to implement finite field arithmetic to use it in Elliptic Curve calculations. Since all that's ever used are arithmetic operations that commute with the modulo operator, I don't see a reason not to delaying that operation till the very end. One thing that may happen is that the numbers involved might become (way) too big and impractical/inefficient to work with, but I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
I'm coding in C.
To avoid the complexity of elliptic curve crypto (as I'm unfamiliar with its algorithm); let assume you're doing temp = (a * b) % M; result = (temp * c) % M, and you're thinking about just doing result = (a * b * c) % M instead.
Let's also assume that you're doing this a lot with the same modulo M; so you've precomputed "multiples of M" lookup tables, so that your modulo code can use the table to find the highest multiple of "M shifted left by N" that is not greater than the dividend and subtract it from dividend, and repeat that with decreasing values of N until you're left with the quotient.
If your lookup table has 256 entries, the dividend is 4096 bits and the divisor is 2048 bits; then you'd reduce the size of the dividend by 8 bits per iteration, so dividend would become smaller than the divisor (and you'd find the quotient) after no more than 256 "search and subtract" operations.
For multiplication; it's almost purely "multiply and add digits" for each pair of digits. E.g. using uint64_t as a digit, multiplying 2048 bit numbers is multiplying 32 digit numbers and involves 32 * 32 = 1024 of those "multiply and add digits" operations.
Now we can make comparisons. Specifically, assuming a, b, c, M are 2048-bit numbers:
a) the original temp = (a * b) % M; result = (temp * c) % M would be 1024 "multiply and add", then 256 "search and subtract", then 1024 "multiply and add", then 256 "search and subtract". For totals it'd be 2048 "multiply and add" and 512 "search and subtract".
b) the proposed result = (a * b * c) % M would be 1024 "multiply and add", then would be 2048 "multiply and add" (as the result of a*b will be a "twice as big" 4096-bit number), then 512 "search and subtract" (as a*b*c will be twice as big as a*b). For totals it'd be 3072 "multiply and add" and 512 "search and subtract".
In other words; (assuming lots of assumptions) the proposed result = (a * b * c) % M would be worse, with 50% more "multiply and add" and the exact same "search and subtract".
Of course none of this (the operations you need for elliptic curve crypto, the sizes of your variables, etc) can be assumed to apply for your specific case.
I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
Yes; the way to determine the optimal conditions/frequency is to do similar to what I did above - determine the true costs (in terms of lower level operations, like my "search and subtract" and "multiply and add") and compare them.
In general (regardless of how modulo is implemented, etc) I'd expect you'll find that doing modulo as often as possible is the fastest option (as it reduces the cost of multiplications and also reduces the cost of later/final modulo) for all cases don't involve addition or subtraction, and that don't fit in simple integers.
If M is a constant, then an alternative for modulo is to multiply by the logical inverse of M. Looking at Polk's comment about 256 bits being a common case, then assuming M is polynomial of degree 256 with 1 bit coefficients, then define the inverse of M to be x^512 / M, which results in a 256 bit "inverse". Name this inverse to be I. Then for a multiply modulo M:
C = A * B ; 512 bit product
Q = (upper 256 bits of C * I)>>256 ; Q = C / M = 256 bit quotient
P = M * Q ; 512 bit product
R = lower 256 bits of (C xor P) ; (A * B)% M
So this require 3 extended precision multiplies and one xor.
If the processor for this code has a carryless multiply, such as X86 PCLMULQDQ, which multiplies two 64 bit operands to produce a 128 bit result, then that could be used as the basis for an extended precision multiply. A basic implementation would need 16 multiplies for a 256 bit by 256 bit multiply to produce a 512 bit product. This could be improved using somthing like Karatsuba:
https://en.wikipedia.org/wiki/Karatsuba_algorithm
but on currernt X86, PCLMULQDQ is fast, taking 1 to 3 cycles, so the main issue would be loading the data into the XMM registers, and I'm not sure Karatsuba would save much time.
optimal conditions/frequency which should trigger a modulo operation in the calculations
Standard practice is to replace all actual modulo operations with something else. So the frequency is never. There are different ways to accomplish that:
Choose the modulus to be a Mersenne prime or pseudo-Mersenne prime. There is a large repertoire of mathematical tricks to implement arithmetic modulo a (pseudo-)Mersenne prime efficiently, without doing any actual modulo operations. In the context of elliptic curves, the prime-modulus NIST curves are chosen this way and for this reason.
Use Barrett reduction. This has the same effect as a real modulo operation, but relies on some precomputation and a precondition on the range of the input to be able to reduce the cost of a modulo-like operation to the cost to a couple of multiplications (plus some supporting operations). Also applicable to polynomial fields.
Do arithmetic in Montgomery form.
Additionally, and perhaps more in the spirit of your question, a common technique is to do various additions without reducing every time (addition does not significantly change the size of a number). It takes a lot of additions before you need an extra limb in your integers, so a lot of them can be done before it starts to make sense to reduce. For multiplications, unless it's by a small constant it almost always makes sense to reduce immediately afterwards to prevent the numbers from getting much physically larger than they need to be (which would be especially bad if the result was fed into another multiplication).
Another technique especially associated with Barrett reductions is to work, most of the time, in a slightly larger range than [0 .. N), eg [0 .. 2N). This enables skipping the conditional subtraction that Barrett reduction needs in order to fully reduce to the range [0 .. N), while still using the most important part, the reduction from the range [0 .. N²) to the range [0 .. 2N).

How does C perform the % operation interally

I am curious to understand the logic behind the mod operation since I understand that bit-shifting operations can be performed to do different things such as bit shifting to multiply.
One way I can see it being done is by a recursive algorithm that keeps dividing until you cannot divide anymore, but this does not seem efficient.
Any ideas will be helpful. Thanks in advance!
The quick version is: Depends on hardware, the optimizer, if it's division by a constant or not (pdf), if there's exceptions to be checked for (e.g. modulo by 0), if and how negative numbers are handled (this is a scary question for C++), etc...
R gave a nice, concise answer for unsigned integers, but it's difficult to understand unless you're well versed with C.
The crux of the technique illuminated by R is to strip away multiples of q until there's no more multiples of q left. We could naively do this with a simple loop:
while (p >= q) p -= q; // One liner, woohoo!
The code may be short, but for large values of p and small values of q this might take a very long time.
Better than stripping away one q at a time would be to strip away many q's at a time. Note that we actually want to strip away as many q's as possible -- that is, floor(p/q) many q's... And indeed, that's a valid technique. For unsigned integers, one would expect that p % q == p - (p / q) * q. (Note that unsigned integer division rounds down.)
But this almost feels like cheating because division and remainder operations are so intimately related. (In fact, often if hardware natively supports division, it supports a divide-and-compute-remainder operation because they're so strongly related.)
Assuming we've no access to division, how shall we find a multiple of q greater than 1 to strip away? In hardware, fixed shift operations are cheap (if not practically free) and conceptually represent multiplication by a non-negative power of two. For example, shifting a bit string left by 3 is equivalent to multiplying by 8 (that is, 2^3), e.g. 5 decimal is equivalent to '101' binary. Shift '101' in binary by adding three zeroes on the right (giving '101000') and the result is 50 in decimal -- five times eight.
Likewise, shift operations are very cheap as software operations and you'll struggle to find a controller that doesn't support them and quickly. (Some architectures such as ARM can even combine shifts with other instructions to make them 'free' a good deal of the time.)
ARMed (couldn't resist) with these shift operations, we can proceed as follows:
Find out the largest power of two we can multiply q by and still be less than p.
Working from the largest power of two to the smallest, multiply q by each power of two and if it's less than what's left of p subtract it from what's left of p.
Whatever you've got left is the remainder.
Why does this work? Because in the end you'll find that all the subtracted powers of two actually sum to floor(p / q)! Don't take my word for it, similar knowledge has been known for a very long time.
Breaking apart R's answer:
#define HI (-1U-(-1U/2))
This effectively gives you an unsigned integer with only the highest value bit set.
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
This line actually finds the highest power of two q can be multiplied before overflowing an unsigned integer. This isn't strictly necessary, but it doesn't change the results other than increasing the amount of execution time required.
In case you're not familiar with the C-isms in this line:
(q<<i) is a left bit shift by i. Recall this is equivalent to multiplying by 2^i.
HI & (q<<i) performs a bitwise-AND. Since HI only has its top bit populated this will only result in a non-zero value when (q<<i) is large enough to cause the top bit to be non-zero. One more shift over to the left and there'd be an integer overflow.
!(HI & (q<<i)) is 'true' when (HI & (q<<i)) is zero and 'false' otherwise.
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
This is a simple decreasing loop do { .... } while (i--);. Note that post-decrementing is used on i so the loop executes, then it checks to see if i is not zero, then it subtracts one from i, and then if its earlier check resulted in true it continues. This has the property that the loop executes its last time when i is 0. This is important because we may need to strip away an unmultiplied copy of q.
if (p >= (q<<i)) checks if the 2^i * q is less than or equal to p. If it is, p -= (q<<i) strips it away.
The remainder is left.
While most C implementations run on hardware that has a division instruction, the remainder operation can be performed roughly like this, for computing p%q, assuming unsigned values:
#define HI (-1U-(-1U/2))
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
The resulting remainder is in p.
In addition to a hardware instruction and implementation using shifts, as R.. suggests, there's also reciprocal multiplication.
This technique can be used when the right-hand side of % is a constant, known at compile time.
Reciprocal multiplication is used to implement division, but using it for % is easy, based on the formula a%b == a-(a/b)*b.
Depending on the smarts of the optimizer, there is a shortcut for modulo base 2. For example, a % 32 can be implemented as a & 31. In general, a % (2^N) == a & (2^N -1). This is lightning fast compared to division. Most dividers (ever hardware) require at least 1 cycle for each bit of the result to calculate, while logic AND is just a few cycle operation (in the pipeline).
EDIT: this only works if a is unsigned !

loop over 2^n states of n bits in C with n > 32

I'd like to have a loop in C over all possible 2^n states of n bits. For example if n=4 I'd like to loop over 0000, 0001, 0010, 0011, ..., 1110, 1111. The bits can be represented in any way, for example an integer array of length n with values 0 or 1, or a character array of length n with values "0" or "1", etc, it doesn't really matter.
For smallish n what I do is calculate x=2^n using integer arithmetic (both n and x are integers), then
for(i=0;i<x;i++) {
bits = convert_integer_to_bits( i );
work_on_bits( bits );
}
Here 'bits' is in the given representation of bits, what was useful so far is an integer array of length n with values 0 or 1 (but can be anything else).
If n>32 this approach obviously doesn't work even with longs.
How would I work with n>32?
Specifically, do I really need to evaluate 2^n, or is there a tricky way of writing the loop which does not refer to the actual value of 2^n but nevertheless iterates 2^n times?
For n > 32 use unsigned long long. This will work for n up to 64. Still for values even close to 50 you will have to wait long time until the cycle finishes.
It's not clear why you say that if n>32, it obviously won't work. Is your concern the width of bits, or is your concern the run time?
If you're concerned about number width, investigate a big math library such as http://gmplib.org/.
If you're concerned about run time... you won't live long enough for your loop to complete if the width is large enough, so get a different hobby ;) Seriously... figure out the rough run time of one iteration through your loop and multiply that by 4 billion, divide by 20 years, and you'll have an estimate of the number of generations of your ancestors that will need to wait for the answer.

How to determine the number of useful bits of a number?

I'm writing a function, which determine the number of useful bits of a 16 bits integer.
int16_t
f(int16_t x)
{
/* ... */
}
For example, the number "00000010 00100101" has 10 useful bits. I think I should use some bitwise operators, but I don't know how. I'm looking for some ways to do it.
If you're using gcc (or a gcc-compatible compiler such as ICC) then you can use built in intrinsics, e.g.
#include <limits.h>
int f(int16_t x)
{
return x != 0 ? sizeof(x) * CHAR_BIT - __builtin_clz(x) : 0;
}
This assumes you just want the number of bits to the right of the last leading zero bit.
For MSVC you can use _BitScanReverse with some adjustment.
Otherwise if you need this to be portable then you can implement your own general purpose clz function, see e.g. http://en.wikipedia.org/wiki/Find_first_set
These are called bitscan operations, and on intel architecture there are assembly instruction ( you can call directly from C ) see here. If you are using a MS compiler start from here.
Logarithms compute the number of digits needed to represent a certain number to a certain base:
Let [x] be x, rounded to the next integer.
Then [log_b(x)] is the number of digits needed to represent x to base b.
Hence, if you want to know the number of significant bits of some x in C, then ceil(log2(x)) will tell you.
Since there is no algorithm that will tell you the number of leading zeros of a binary representation in constant time, computing the logarithm may actually be faster than naively iterating.

Fast inverse square of double in C/C++

Recently I was profiling a program in which the hotspot is definitely this
double d = somevalue();
double d2=d*d;
double c = 1.0/d2 // HOT SPOT
The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.
I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)
The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)
Are you sure you need double precision? You can sacrifice precision easily enough:
double d = somevalue();
float c = 1.0f / ((float) d * (float) d);
The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.
Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)
If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)
I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).
The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.
Tricks that don't work
Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.
Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.
Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.
Update
Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.
static double invsq(double x)
{
double y;
int i;
__asm__ (
"cvtpd2ps %1, %0\n\t"
"rcpss %0, %0\n\t"
"cvtps2pd %0, %0"
: "=x"(y)
: "x"(x));
for (i = 0; i < RECIP_ITER; ++i)
y *= 2 - x * y;
return y * y;
}
Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.
I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.
For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)
I suspect the RCPSS method shines if you want to compute several of these in parallel.
Yes, you can certainly try and work something out. Let me just give you some general ideas, you can fill in the details.
First, let's see why Carmack's root works:
We write x = M × 2E in the usual way. Now recall that the IEEE float stores the exponent offset by a bias: If e denoted the exponent field, we have e = Bias + E ≥ 0. Rearranging, we get E = e − Bias.
Now for the inverse square root: x−1/2 = M-1/2 × 2−E/2. The new exponent field is:
e' = Bias − E/2 = 3/2 Bias − e/2
With bit fiddling, we can get the value e/2 from e by shifting, and 3/2 Bias is just a constant.
Moreover, the mantissa M is stored as 1.0 + x with x < 1, and we can approximate M-1/2 as 1 + x/2. Again, the fact that only x is stored in binary means that we get the division by two by simple bit shifting.
Now we look at x−2: this is equal to M−2 × 2−2 E, and we are looking for an exponent field:
e' = Bias − 2 E = 3 Bias − 2 e
Again, 3 Bias is just a constant, and you can get 2 e from e by bitshifting. As for the mantissa, you can approximate (1 + x)−2 by 1 − 2 x, and so the problem reduces to obtaining 2 x from x.
Note that Carmack's magic floating point fiddling doesn't actually compute the result right aaway: Rather, it produces a remarkably accurate estimate, which is used as the starting point for a traditional, iterative computation. But because the estimate is so good, you only need very few rounds of subsequent iteration to get an acceptable result.
For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.
Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?

Resources