Optimize modulus arithmetic with big numbers - c

I'd like to implement big number arithmetic operations modulo P, with P = 2^256 - 2^32 - 977. Note that P is fixed so any optimization can be hardcoded.
I'm using an array of 8 u32 to represent a u256:
struct fe {
uint32_t b[8]; // 256 = 8 x 32
};
Now a simple version of the addition could look like this
void fe_add(struct fe *x, struct fe *y, struct fe *res) {
int carry = 0;
for (int i = 0; i < 8; ++i) {
uint32_t tmp = x->b[i] + y->b[i] + carry;
carry = tmp < x->b[i] && tmp < y->b[i];
res->b[i] = tmp;
}
}
Now to support (x + y) % P, I could use this definition and define -, *, and / over struct fe:
// (x + y) % P = (x + y) - (P * int((x + y) / P))
fe_add(&x, &y, &t1); // t1 = x + y
fe_div(&t1, &P, &t2); // t2 = (x + y) / P
fe_mult(&P, &t2, &t3); // t3 = P * ((x + y) / P)
fe_sub(&t1, &t3, &res); // res = x + y - (P * ((x + y) / P))
What would be a better way to implement (x + y) % P directly during the addition, knowing that P won't change?

As Eric wrote in a comment, you should pay attention to the carry. After your loop is done, you may have some carry from the highest position. If in the end carry is not zero, then it has to be one. Then its value is 2^256, corresponding to index 8. Since
2^256 ≡ 2^32 + 977 (mod P)
you may account for this carry by adding 2^32 + 977 to your result so far. You can probably do so in an optimized manner (i.e. not re-using the same add loop), since you know the one term to be mostly zeros so you can stop after the first (least significant) two “digits” are added as soon as the carry has become zero. (I'm using the term “digit” for each of your u32 array members.)
What do you do if during that addition the carry at the high end of the addition is non-zero a second time? As Eric noted, when each of your inputs is less than P, the sum will be less than 2P so subtracting P once (which is what the shift from 2^256 to 2^32 + 977 does) will make it less than P. So no need to worry, you can stop the loop when carry becomes zero no matter the digit count.
And what if the resulting sum is bigger than P but less than 2^256 so you don't get any carry? To also cover this situation, you can compare the result against P, and subtract P unless it's smaller. Subtraction is a lot easier than division. You can skip this check if you did the code path for the non-zero carry. You can also optimize that check somewhat, aborting if any of the first 6 “digits” is less than 2^32-1. Only if they all equal 2^32-1 then you can do some minor comparisons and computations to do the actual subtraction in the lowest two “digits” before clearing all the higher “digits”.
In Python-like pseudo-code and glossing over the details of how to detect overflow or underflow happening in the line before:
def fe_add(x, y, res):
carry = 0
for i in 0 .. 7:
res[i] = x[i] + y[i] + carry
carry = 1 if overflow else 0
# So far this is what you had.
if carry != 0:
# If carry == 1: add 2^32 + 977 instead.
res[0] += 977
res[1] += 1 + (1 if overflow else 0)
carry = 1 if overflow else 0
i = 2
while carry != 0:
res[i] += 1
carry = 1 if overflow else 0
i++
else:
# Compare res against P.
for i in 7 .. 2:
if res[i] != 2^32 - 1:
return
if res[1] == 2^32 - 1 or (res[1] == 2^32 - 2 and res[0] >= 2^32 - 977):
# If res >= P, subtract P.
res[0] -= 2^32 - 977
res[1] -= 2^32 - 2 + (1 if underflow else 0)
for i in 2 .. 7:
res[i] = 0
There is an alternative. Instead of using numbers from the range [0 .. P-1] to represent your elements of the modulo group, you might also choose to use [2^32 + 977 .. 2^256-1] instead. That would simplify some operations but complicate others. Additions in particular would be simpler, because just handling the nonzero carry situation discussed above would be enough. Comparing whether a number is ≡ 0 (mod P) would be more complicated, for example. And it might also be confusing some code contributors. As usual with changes that might improve performance, tests would be best suited to tell whether one or the other solution performs better in practice. But perhaps you might want to design your API so that you can swap these implementation details without any code using them even noticing it. This could mean e.g. not relying on zero initialization to initialize a zero element of that data type but having a function instead.

Related

modulo arithmetic steps for this program

I have written this code in C where each of a,b,cc,ma,mb,mcc,N,k are int . But as per specification of the problem , N and k could be as big as 10^9 . 10^9 can be stored within a int variable in my machine. But internal and final value of of a,b,cc,ma,mb,mcc will be much bigger for bigger values of N and k which can not be stored even in a unsigned long long int variable.
Now, I want to print value of mcc % 1000000007 as you can see in the code. I know, some clever modulo arithmetic tricks in the operations of the body of the for loop can create correct output without any overflow and also can make the program time efficient. Being new in modulo arithmetic, I failed to solve this. Can someone point me out those steps?
ma=1;mb=0;mcc=0;
for(i=1; i<=N; ++i){
a=ma;b=mb;cc=mcc;
ma = k*a + 1;
mb = k*b + k*(k-1)*a*a;
mcc = k*cc + k*(k-1)*a*(3*b+(k-2)*a*a);
}
printf("%d\n",mcc%1000000007);
My attempt:
I used a,b,cc,ma,mb,mcc as long long and done this. Could it be optimized more ??
ma=1;mb=0;cc=0;
ok = k*(k-1);
for(i=1; i<=N; ++i){
a=ma;b=mb;
as = (a*a)%MOD;
ma = (k*a + 1)%MOD;
temp1 = (k*b)%MOD;
temp2 = (as*ok)%MOD;
mb = (temp1+temp2)%MOD;
temp1 = (k*cc)%MOD;
temp2 = (as*(k-2))%MOD;
temp3 = (3*b)%MOD;
temp2 = (temp2+temp3)%MOD;
temp2 = (temp2*a)%MOD;
temp2 = (ok*temp2)%MOD;
cc = (temp1 + temp2)%MOD;
}
printf("%lld\n",cc);
Let's look at a small example:
mb = (k*b + k*(k-1)*a*a)%MOD;
Here, k*b, k*(k-1)*a*a can overflow, so can the sum, taking into account
(x + y) mod m = (x mod m + y mod m) mod m
we can rewrite this (x= k*b, y=k*(k-1)*a*a and m=MOD)
mb = ((k*b) % MOD + (k*(k-1)*a*a) %MOD) % MOD
now, we could go one step futher. Since
x * y mod m = (x mod m * y mod m) mod m
we can also rewrite the multiplication k*(k-1)*a*a % MOD with with x=k*(k-1) and y=a*a to
((k*(k-1)) %MOD) * ((a*a) %MOD)) % MOD
I'm sure you can do the rest. While you can sprinkle % MOD all over the place, you should careful consider whether you need it or not, taking John's hint into account:
Adding two n-digit numbers produces a number of up to n+1 digits, and
multiplying an n-digit number by an m-digit number produces a result
with up to n + m digits.
As such, there are places where you will need use modulus properties, and there are some, where you surely don't need it, but this is your part of the work ;).
That's a good exercise to build a template class along these lines:
template <int N>
class modulo_int_t
{
public:
modulo_int_t(int value) : value_(value % N) {}
modulo_int_t<N> operator+(const modulo_int_t<N> &rhs)
{
return modulo_int_t<N>(value_ + rhs.value) ;
}
// fill in the other operations
private:
int value_ ;
} ;
Then write the operations using modulo_int_t<1000000007> objects instead of int.
Disclaimer: make use of long long where appropriate and take care of negative differencies...

Fastest algorithm to identify the smallest and largest x that make the double-precision equation x + a == b true

In the context of static analysis, I am interested in determining the values of x in the then-branch of the conditional below:
double x;
x = …;
if (x + a == b)
{
…
a and b can be assumed to be double-precision constants (generalizing to arbitrary expressions is the easiest part of the problem), and the compiler can be assumed to follow IEEE 754 strictly (FLT_EVAL_METHOD is 0). The rounding mode at run-time can be assumed to be to nearest-even.
If computing with rationals was cheap, it would be simple: the values for x would be the double-precision numbers contained in the rational interval (b - a - 0.5 * ulp1(b) … b - a + 0.5 * ulp2(b)). The bounds should be included if b is even, excluded if b is odd, and ulp1 and ulp2 are two slightly different definitions of “ULP” that can be taken identical if one does not mind losing a little precision on powers of two.
Unfortunately, computing with rationals can be expensive. Consider that another possibility is to obtain each of the bounds by dichotomy, in 64 double-precision additions (each operation deciding one bit of the result). 128 floating-point additions to obtain the lower and upper bounds may well be faster than any solution based on maths.
I am wondering if there is a way to improve over the “128 floating-point additions” idea. Actually I have my own solution involving changes of rounding mode and nextafter calls, but I wouldn't want to cramp anyone's style and cause them to miss a more elegant solution than the one I currently have. Also I am not sure that changing the rounding mode twice is actually cheaper than 64 floating-point additions.
You already gave a nice and elegant solution in your question:
If computing with rationals was cheap, it would be simple: the values
for x would be the double-precision numbers contained in the rational
interval (b - a - 0.5 * ulp1(b) … b - a + 0.5 * ulp2(b)). The bounds
should be included if b is even, excluded if b is odd, and ulp1 and
ulp2 are two slightly different definitions of “ULP” that can be taken
identical if one does not mind losing a little precision on powers of
two.
What follows is a half-reasoned sketch of a partial solution to the problem based on this paragraph. Hopefully I'll get a chance to flesh it out soon. To get a real solution, you'll have to handle subnormals, zeroes, NaNs, and all that other fun stuff. I'm going to assume that a and b are, say, such that 1e-300 < |a| < 1e300 and 1e-300 < |b| < 1e300 so that no craziness occurs at any point.
Absent overflow and underflow, you can get ulp1(b) from b - nextafter(b, -1.0/0.0). You can get ulp2(b) from nextafter(b, 1.0/0.0) - b.
If b/2 <= a <= 2b, then Sterbenz's theorem tells you that b - a is exact. So (b - a) - ulp1 / 2 will be the closest double to the lower bound and (b - a) + ulp2 / 2 will be the closest double to the upper bound. Try these values, and the values immediately before and after, and pick the widest interval that works.
If b > 2a, b - a > b/2. The computed value of b - a is off by at most half an ulp. One ulp1 is at most two ulp, as is one ulp2, so the rational interval you gave is at most two ulp wide. Figure out which of the five closest values to b-a work.
If a > 2b, an ulp of b-a is at least as big as an ulp of b; if anything works, I bet it'll have to be be among the three closest values to b-a. I imagine the case where a and b have different signs works similarly.
I wrote a small pile of C++ code implementing this idea. It didn't fail random fuzz testing (in a few different ranges) before I got bored of waiting. Here it is:
void addeq_range(double a, double b, double &xlo, double &xhi) {
if (a != a) return; // empty interval
if (b != b) {
if (a-a != 0) { xlo = xhi = -a; return; }
else return; // empty interval
}
if (b-b != 0) {
// TODO: handle me.
}
// b is now guaranteed to be finite.
if (a-a != 0) return; // empty interval
if (b < 0) {
addeq_range(-a, -b, xlo, xhi);
xlo = -xlo;
xhi = -xhi;
return;
}
// b is now guaranteed to be zero or positive finite and a is finite.
if (a >= b/2 && a <= 2*b) {
double upulp = nextafter(b, 1.0/0.0) - b;
double downulp = b - nextafter(b, -1.0/0.0);
xlo = (b-a) - downulp/2;
xhi = (b-a) + upulp/2;
if (xlo + a == b) {
xlo = nextafter(xlo, -1.0/0.0);
if (xlo + a != b) xlo = nextafter(xlo, 1.0/0.0);
} else xlo = nextafter(xlo, 1.0/0.0);
if (xhi + a == b) {
xhi = nextafter(xhi, 1.0/0.0);
if (xhi + a != b) xhi = nextafter(xhi, -1.0/0.0);
} else xhi = nextafter(xhi, -1.0/0.0);
} else {
double xmid = b-a;
if (xmid + a < b) {
xhi = xlo = nextafter(xmid, 1.0/0.0);
if (xhi + a != b) xhi = xmid;
} else if (xmid + a == b) {
xlo = nextafter(xmid, -1.0/0.0);
xhi = nextafter(xmid, 1.0/0.0);
if (xlo + a != b) xlo = xmid;
if (xhi + a != b) xhi = xmid;
} else {
xlo = xhi = nextafter(xmid, -1.0/0.0);
if (xlo + a != b) xlo = xmid;
}
}
}

Best way to compute ((2^n )-1)mod p

I'm working on a cryptographic exercise, and I'm trying to calculate (2n-1)mod p where p is a prime number
What would be the best approach to do this? I'm working with C so 2n-1 becomes too large to hold when n is large
I came across the equation (a*b)modp=(a(bmodp))modp, but I'm not sure this applies in this case, as 2n-1 may be prime (or I'm not sure how to factorise this)
Help much appreciated.
A couple tips to help you come up with a better way:
Don't use (a*b)modp=(a(bmodp))modp to compute 2n-1 mod p, use it to compute 2n mod p and then subtract afterward.
Fermat's little theorem can be useful here. That way, the exponent you actually have to deal with won't exceed p.
You mention in the comments that n and p are 9 or 10 digits, or something. If you restrict them to 32 bit (unsigned long) values, you can find 2^n mod p with a simple (binary) modular exponentiation:
unsigned long long u = 1, w = 2;
while (n != 0)
{
if ((n & 0x1) != 0)
u = (u * w) % p; /* (mul-rdx) */
if ((n >>= 1) != 0)
w = (w * w) % p; /* (sqr-rdx) */
}
r = (unsigned long) u;
And, since (2^n - 1) mod p = r - 1 mod p :
r = (r == 0) ? (p - 1) : (r - 1);
If 2^n mod p = 0 - which doesn't actually occur if p > 2 is prime - but we might as well consider the general case - then (2^n - 1) mod p = -1 mod p.
Since the 'common residue' or 'remainder' (mod p) is in [0, p - 1], we add a some multiple of p so that it is in this range.
Otherwise, the result of 2^n mod p was in [1, p - 1], and subtracting 1 will be in this range already. It's probably better expressed as:
if (r == 0)
r = p - 1; /* -1 mod p */
else
r = r - 1;
To take modulus you somehow must have 2^n-1 or you will move in a different direction of algorithms, interesting but seperate direction somehow, so i recommend you to use big int concept as it will be easy... make a structure and implement a big value in small values, e.g.
struct bigint{
int lowerbits;
int upperbits;
}
decomposition of the statement also has solution like 2^n = (2^n-4 * 2^4 )-1%p decompose and seperatly handle them, that will be quite algorithmic then
To compute 2^n - 1 mod p, you can use exponentiation by squaring after first removing any multiple of (p - 1) from n (since a^{p-1} = 1 mod p). In pseudo-code:
n = n % (p - 1)
result = 1
pow = 2
while n {
if n % 2 {
result = (result * pow) % p
}
pow = (pow * pow) % p
n /= 2
}
result = (result + p - 1) % p
I came across the answer that I am posting here, when solving one of the mathematical problems on HackerRank, and it has worked for all the given test cases given there.
If you restrict n and p to 64 bit (unsigned long) values, then here is the mathematical approach :
2^n - 1 can be written as 1*[ (2^n - 1)/(2 - 1) ]
If you look at this carefully, this is the sum of the GP 1 + 2 + 4 + .. + 2^(n-1)
And voila, we know that (a+b)%m = ( (a%m) + (b%m) )%m
If you have a confusion whether the above relation is true or not for addition, you can google for it or you can check this link : http://www.inf.ed.ac.uk/teaching/courses/dmmr/slides/13-14/Ch4.pdf
So, now we can apply the above mentioned relation to our GP, and you would have your answer!!
That is,
(2^n - 1)%p is equivalent to ( 1 + 2 + 4 + .. + 2^(n-1) )%p and now apply the given relation.
First, focus on 2n mod p because you can always subtract one at the end.
Consider the powers of two. This is a sequence of numbers produced by repeatedly multiplying by two.
Consider the modulo operation. If the number is written in base p, you're just grabbing the last digit. Higher digits can be thrown away.
So at some point(s) in the sequence, you get a two-digit number (a 1 in the p's place), and your task is really just to get rid of the first digit (subtract p) when that happens.
Stopping here conceptually, the brute-force approach would be something like this:
uint64_t exp2modp( uint64_t n, uint64_t p ) {
uint64_t ret = 1;
uint64_t limit = p / 2;
n %= p; // Apply Fermat's Little Theorem.
while ( n -- ) {
if ( ret >= limit ) {
ret *= 2;
ret -= p;
} else {
ret *= 2;
}
}
return ret;
}
Unfortunately, this still takes forever for large n and p, and I can't think of any better number theory offhand.
If you have a multiplication facility which can compute (p-1)^2 without overflow, then you can use an analogous algorithm using repeated squaring with a modulo after each square operation, and then take the product of the series of square residuals, again with a modulo after each multiplication.
step 1. x= shifting 1 n times and then subtract 1
step 2.result = logical and operation of x and p

bitwise division by multiples of 2

I found many posts about bitwise division and I completely understand most bitwise usage but I can't think of a specific division. I want to divide a given number (lets say 100) with all the multiples of 2 possible (ATTENTION: I don't want to divide with powers of 2 bit multiples!)
For example: 100/2, 100/4, 100/6, 100/8, 100/10...100/100
Also I know that because of using unsigned int the answers will be rounded for example 100/52=0 but it doesn't really matter, because I can both skip those answers or print them, no problem. My concern is mostly how I can divide with 6 or 10, etc. (multiples of 2). There is need for it to be done in C, because I can manage to transform any code you give me from Java to C.
Following the math shown for the accepted solution to the division by 3 question, you can derive a recurrence for the division algorithm:
To compute (int)(X / Y)
Let k be such that 2k &geq; Y and 2k-1 < Y
(note, 2k = (1 << k))
Let d = 2k - Y
Then, if A = (int)(X / 2k) and B = X % 2k,
X = (1 << k) * A + B
= (1 << k) * A - Y * A + Y * A + B
= d * A + Y * A + B
= Y * A + (d * A + B)
Thus,
X/Y = A + (d * A + B)/Y
In otherwords,
If S(X, Y) := X/Y, then S(X, Y) := A + S(d * A + B, Y).
This recurrence can be implemented with a simple loop. The stopping condition for the loop is when the numerator falls below 2k. The function divu implements the recurrence, using only bitwise operators and using unsigned types. Helper functions for the math operations are left unimplemented, but shouldn't be too hard (the linked answer provides a full add implementation already). The rs() function is for "right-shift", which does sign extension on the unsigned input. The function div is the actual API for int, and checks for divide by zero and negative y before delegating to divu. negate does 2's complement negation.
static unsigned divu (unsigned x, unsigned y) {
unsigned k = 0;
unsigned pow2 = 0;
unsigned mask = 0;
unsigned diff = 0;
unsigned sum = 0;
while ((1 << k) < y) k = add(k, 1);
pow2 = (1 << k);
mask = sub(pow2, 1);
diff = sub(pow2, y);
while (x >= pow2) {
sum = add(sum, rs(x, k));
x = add(mul(diff, rs(x, k)), (x & mask));
}
if (x >= y) sum = add(sum, 1);
return sum;
}
int div (int x, int y) {
assert(y);
if (y > 0) return divu(x, y);
return negate(divu(x, negate(y)));
}
This implementation depends on signed int using 2's complement. For maximal portability, div should convert negative arguments to 2's complement before calling divu. Then, it should convert the result from divu back from 2's complement to the native signed representation.
The following code works for positive numbers. When the dividend or the divisor or both are negative, have flags to change the sign of the answer appropriately.
int divi(long long m, long long n)
{
if(m==0 || n==0 || m<n)
return 0;
long long a,b;
int f=0;
a=n;b=1;
while(a<=m)
{
b = b<<1;
a = a<<1;
f=1;
}
if(f)
{
b = b>>1;
a = a>>1;
}
b = b + divi(m-a,n);
return b;
}
Use the operator / for integer division as much as you can.
For instance, when you want to divide 100 by 6 or 10 you should write 100/6 or 100/10.
When you mention bit wise division do you (1) mean an implementation of operator / or (2) you are referring to the division by a power of two number.
For (1) a processor should have an integer division unit. If not the compiler should provide a good implementation.
For (2) you can use 100>>2 instead of 100/4. If the numerator is known at compile time then a good compiler should automatically use the shift instruction.

Explain this snippet which finds the maximum of two integers without using if-else or any other comparison operator?

Find the maximum of two numbers. You should not use if-else or any other comparison operator. I found this question on online bulletin board, so i thought i should ask in StackOverflow
EXAMPLE
Input: 5, 10
Output: 10
I found this solution, can someone help me understand these lines of code
int getMax(int a, int b) {
int c = a - b;
int k = (c >> 31) & 0x1;
int max = a - k * c;
return max;
}
int getMax(int a, int b) {
int c = a - b;
int k = (c >> 31) & 0x1;
int max = a - k * c;
return max;
}
Let's dissect this. This first line appears to be straightforward - it stores the difference of a and b. This value is negative if a < b and is nonnegative otherwise. But there's actually a bug here - if the difference of the numbers a and b is so big that it can't fit into an integer, this will lead to undefined behavior - oops! So let's assume that doesn't happen here.
In the next line, which is
int k = (c >> 31) & 0x1;
the idea is to check if the value of c is negative. In virtually all modern computers, numbers are stored in a format called two's complement in which the highest bit of the number is 0 if the number is positive and 1 if the number is negative. Moreover, most ints are 32 bits. (c >> 31) shifts the number down 31 bits, leaving the highest bit of the number in the spot for the lowest bit. The next step of taking this number and ANDing it with 1 (whose binary representation is 0 everywhere except the last bit) erases all the higher bits and just gives you the lowest bit. Since the lowest bit of c >> 31 is the highest bit of c, this reads the highest bit of c as either 0 or 1. Since the highest bit is 1 iff c is 1, this is a way of checking whether c is negative (1) or positive (0). Combining this reasoning with the above, k is 1 if a < b and is 0 otherwise.
The final step is to do this:
int max = a - k * c;
If a < b, then k == 1 and k * c = c = a - b, and so
a - k * c = a - (a - b) = a - a + b = b
Which is the correct max, since a < b. Otherwise, if a >= b, then k == 0 and
a - k * c = a - 0 = a
Which is also the correct max.
Here we go: (a + b) / 2 + |a - b| / 2
Use bitwise hacks
r = x ^ ((x ^ y) & -(x < y)); // max(x, y)
If you know that INT_MIN <= x - y <= INT_MAX, then you can use the following, which is faster because (x - y) only needs to be evaluated once.
r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
Source : Bit Twiddling Hacks by Sean Eron Anderson
(sqrt( a*a + b*b - 2*a*b ) + a + b) / 2
This is based on the same technique as mike.dld's solution, but it is less "obvious" here what I am doing. An "abs" operation looks like you are comparing the sign of something but I here am taking advantage of the fact that sqrt() will always return you the positive square root so I am squaring (a-b) writing it out in full then square-rooting it again, adding a+b and dividing by 2.
You will see it always works: eg the user's example of 10 and 5 you get sqrt(100 + 25 - 100) = 5 then add 10 and 5 gives you 20 and divide by 2 gives you 10.
If we use 9 and 11 as our numbers we would get (sqrt(121 + 81 - 198) + 11 + 9)/2 = (sqrt(4) + 20) / 2 = 22/2 = 11
The simplest answer is below.
#include <math.h>
int Max(int x, int y)
{
return (float)(x + y) / 2.0 + abs((float)(x - y) / 2);
}
int Min(int x, int y)
{
return (float)(x + y) / 2.0 - abs((float)(x - y) / 2);
}
int max(int i, int j) {
int m = ((i-j) >> 31);
return (m & j) + ((~m) & i);
}
This solution avoids multiplication.
m will either be 0x00000000 or 0xffffffff
Using the shifting idea to extract the sign as posted by others, here's another way:
max (a, b) = new[] { a, b } [((a - b) >> 31) & 1]
This pushes the two numbers into an array with the maximum number given by the array-element whose index is sign bit of the difference between the two numbers.
Do note that:
The difference (a - b) may overflow.
If the numbers are unsigned and the >> operator refers to a logical right-shift, the & 1 is unnecessary.
Here's how I think I'd do the job. It's not as readable as you might like, but when you start with "how do I do X without using the obvious way of doing X, you have to kind of expect that.
In theory, this gives up some portability too, but you'd have to find a pretty unusual system to see a problem.
#define BITS (CHAR_BIT * sizeof(int) - 1)
int findmax(int a, int b) {
int rets[] = {a, b};
return rets[unsigned(a-b)>>BITS];
}
This does have some advantages over the one shown in the question. First of all, it calculates the correct size of shift, instead of being hard-coded for 32-bit ints. Second, with most compilers we can expect all the multiplication to happen at compile time, so all that's left at run time is trivial bit manipulation (subtract and shift) followed by a load and return. In short, this is almost certain to be pretty fast, even on the smallest microcontroller, where the original used multiplication that had to happen at run-time, so while it's probably pretty fast on a desktop machine, it'll often be quite a bit slower on a small microcontroller.
Here's what those lines are doing:
c is a-b. if c is negative, a<b.
k is 32nd bit of c which is the sign bit of c (assuming 32 bit integers. If done on a platform with 64 bit integers, this code will not work). It's shifted 31 bits to the right to remove the rightmost 31 bits leaving the sign bit in the right most place and then anding it with 1 to remove all the bits to the left (which will be filled with 1s if c is negative). So k will be 1 if c is negative and 0 if c is positive.
Then max = a - k * c. If c is 0, this means a>=b, so max is a - 0 * c = a. If c is 1, this means that a<b and then a - 1 * c = a - (a - b) = a - a + b = b.
In the overall, it's just using the sign bit of the difference to avoid using greater than or less than operations. It's honestly a little silly to say that this code doesn't use a comparison. c is the result of comparing a and b. The code just doesn't use a comparison operator. You could do a similar thing in many assembly codes by just subtracting the numbers and then jumping based on the values set in the status register.
I should also add that all of these solutions are assuming that the two numbers are integers. If they are floats, doubles, or something more complicated (BigInts, Rational numbers, etc.) then you really have to use a comparison operator. Bit-tricks will not generally do for those.
getMax() Function Without Any Logical Operation-
int getMax(int a, int b){
return (a+b+((a-b)>>sizeof(int)*8-1|1)*(a-b))/2;
}
Explanation:
Lets smash the 'max' into pieces,
max
= ( max + max ) / 2
= ( max + (min+differenceOfMaxMin) ) / 2
= ( max + min + differenceOfMaxMin ) / 2
= ( max + min + | max - min | ) ) / 2
So the function should look like this-
getMax(a, b)
= ( a + b + absolute(a - b) ) / 2
Now,
absolute(x)
= x [if 'x' is positive] or -x [if 'x' is negative]
= x * ( 1 [if 'x' is positive] or -1 [if 'x' is negative] )
In integer positive number the first bit (sign bit) is- 0; in negative it is- 1. By shifting bits to the right (>>) the first bit can be captured.
During right shift the empty space is filled by the sign bit. So 01110001 >> 2 = 00011100, while 10110001 >> 2 = 11101100.
As a result, for 8 bit number shifting 7 bit will either produce- 1 1 1 1 1 1 1 [0 or 1] for negative, or 0 0 0 0 0 0 0 [0 or 1] for positive.
Now, if OR operation is performed with 00000001 (= 1), negative number yields- 11111111 (= -1), and positive- 00000001 (= 1).
So,
absolute(x)
= x * ( 1 [if 'x' is positive] or -1 [if 'x' is negative] )
= x * ( ( x >> (numberOfBitsInInteger-1) ) | 1 )
= x * ( ( x >> ((numberOfBytesInInteger*bitsInOneByte) - 1) ) | 1 )
= x * ( ( x >> ((sizeOf(int)*8) - 1) ) | 1 )
Finally,
getMax(a, b)
= ( a + b + absolute(a - b) ) / 2
= ( a + b + ((a-b) * ( ( (a-b) >> ((sizeOf(int)*8) - 1) ) | 1 )) ) / 2
Another way-
int getMax(int a, int b){
int i[] = {a, b};
return i[( (i[0]-i[1]) >> (sizeof(int)*8 - 1) ) & 1 ];
}
static int mymax(int a, int b)
{
int[] arr;
arr = new int[3];
arr[0] = b;
arr[1] = a;
arr[2] = a;
return arr[Math.Sign(a - b) + 1];
}
If b > a then (a-b) will be negative, sign will return -1, by adding 1 we get index 0 which is b, if b=a then a-b will be 0, +1 will give 1 index so it does not matter if we are returning a or b, when a > b then a-b will be positive and sign will return 1, adding 1 we get index 2 where a is stored.
#include<stdio.h>
main()
{
int num1,num2,diff;
printf("Enter number 1 : ");
scanf("%d",&num1);
printf("Enter number 2 : ");
scanf("%d",&num2);
diff=num1-num2;
num1=abs(diff);
num2=num1+diff;
if(num1==num2)
printf("Both number are equal\n");
else if(num2==0)
printf("Num2 > Num1\n");
else
printf("Num1 > Num2\n");
}
The code which I am providing is for finding maximum between two numbers, the numbers can be of any data type(integer, floating). If the input numbers are equal then the function returns the number.
double findmax(double a, double b)
{
//find the difference of the two numbers
double diff=a-b;
double temp_diff=diff;
int int_diff=temp_diff;
/*
For the floating point numbers the difference contains decimal
values (for example 0.0009, 2.63 etc.) if the left side of '.' contains 0 then we need
to get a non-zero number on the left side of '.'
*/
while ( (!(int_diff|0)) && ((temp_diff-int_diff)||(0.0)) )
{
temp_diff = temp_diff * 10;
int_diff = temp_diff;
}
/*
shift the sign bit of variable 'int_diff' to the LSB position and find if it is
1(difference is -ve) or 0(difference is +ve) , then multiply it with the difference of
the two numbers (variable 'diff') then subtract it with the variable a.
*/
return a- (diff * ( int_diff >> (sizeof(int) * 8 - 1 ) & 1 ));
}
Description
The first thing the function takes the arguments as double and has return type as double. The reason for this is that to create a single function which can find maximum for all types. When integer type numbers are provided or one is an integer and other is the floating point then also due to implicit conversion the function can be used to find the max for integers also.
The basic logic is simple, let's say we have two numbers a & b if a-b>0(i.e. the difference is positive) then a is maximum else if a-b==0 then both are equal and if a-b<0(i.e. diff is -ve) b is maximum.
The sign bit is saved as the Most Significant Bit(MSB) in the memory. If MSB is 1 and vice-versa. To check if MSB is 1 or 0 we shift the MSB to the LSB position and Bitwise & with 1, if the result is 1 then the number is -ve else no. is +ve. This result is obtained by the statement:
int_diff >> (sizeof(int) * 8 - 1 ) & 1
Here to get the sign bit from the MSB to LSB we right shift it to k-1 bits(where k is the number of bits needed to save an integer number in the memory which depends on the type of system). Here k= sizeof(int) * 8 as sizeof() gives the number of bytes needed to save an integer to get no. of bits, we multiply it with 8. After the right shift, we apply the bitwise & with 1 to get the result.
Now after obtaining the result(let us assume it as r) as 1(for -ve diff) and 0(for +ve diff) we multiply the result with the difference of the two numbers, the logic is given as follows:
if a>b then a-b>0 i.e., is +ve so the result is 0(i.e., r=0). So a-(a-b)*r => a-(a-b)*0, which gives 'a' as the maximum.
if a < b then a-b<0 i.e., is -ve so the result is 1(i.e., r=1). So a-(a-b)*r => a-(a-b)*1 => a-a+b =>b , which gives 'b' as the maximum.
Now there are two remaining points 1. the use of while loop and 2. why I have used the variable 'int_diff' as an integer. To answer these properly we have to understand some points:
Floating type values cannot be used as an operand for the bitwise operators.
Due to above reason, we need to get the value in an integer value to get the sign of difference by using bitwise operators. These two points describe the need of variable 'int_diff' as integer type.
Now let's say we find the difference in variable 'diff' now there are 3 possibilities for the values of 'diff' irrespective of the sign of these values. (a). |diff|>=1 , (b). 0<|diff|<1 , (c). |diff|==0.
When we assign a double value to integer variable the decimal part is lost.
For case(a) the value of 'int_diff' >0 (i.e.,1,2,...). For other two cases int_diff=0.
The condition (temp_diff-int_diff)||0.0 checks if diff==0 so both numbers are equal.
If diff!=0 then we check if int_diff|0 is true i.e., case(b) is true
In the while loop, we try to get the value of int_diff as non-zero so that the value of int_diff also gets the sign of diff.
Here are a couple of bit-twiddling methods to get the max of two integral values:
Method 1
int max1(int a, int b) {
static const size_t SIGN_BIT_SHIFT = sizeof(a) * 8 - 1;
int mask = (a - b) >> SIGN_BIT_SHIFT;
return (a & ~mask) | (b & mask);
}
Explanation:
(a - b) >> SIGN_BIT_SHIFT - If a > b then a - b is positive, thus the sign bit is 0, and the mask is 0x00.00. Otherwise, a < b so a - b is negative, the sign bit is 1 and after shifting, we get a mask of 0xFF..FF
(a & ~mask) - If the mask is 0xFF..FF, then ~mask is 0x00..00 and then this value is 0. Otherwise, ~mask is 0xFF..FF and the value is a
(b & mask) - If the mask is 0xFF..FF, then this value is b. Otherwise, mask is 0x00..00 and the value is 0.
Finally:
If a >= b then a - b is positive, we get max = a | 0 = a
If a < b then a - b is negative, we get max = 0 | b = b
Method 2
int max2(int a, int b) {
static const size_t SIGN_BIT_SHIFT = sizeof(a) * 8 - 1;
int mask = (a - b) >> SIGN_BIT_SHIFT;
return a ^ ((a ^ b) & mask);
}
Explanation:
Mask explanation is the same as for Method 1. If a > b the mask is 0x00..00, otherwise the mask is 0xFF..FF
If the mask is 0x00..00, then (a ^ b) & mask is 0x00..00
If the mask is 0xFF..FF, then (a ^ b) & mask is a ^ b
Finally:
If a >= b, we get a ^ 0x00..00 = a
If a < b, we get a ^ a ^ b = b
//In C# you can use math library to perform min or max function
using System;
class NumberComparator
{
static void Main()
{
Console.Write(" write the first number to compare: ");
double first_Number = double.Parse(Console.ReadLine());
Console.Write(" write the second number to compare: ");
double second_Number = double.Parse(Console.ReadLine());
double compare_Numbers = Math.Max(first_Number, second_Number);
Console.Write("{0} is greater",compare_Numbers);
}
}
No logical operators, no libs (JS)
function (x, y) {
let z = (x - y) ** 2;
z = z ** .5;
return (x + y + z) / 2
}
The logic described in a problem can be explained as if 1st number is smaller then 0 will be subtracted else difference will be subtracted from 1st number to get 2nd number.
I found one more mathematical solution which I think is bit simpler to understand this concept.
Considering a and b as given numbers
c=|a/b|+1;
d=(c-1)/b;
smallest number= a - d*(a-b);
Again,The idea is to find k which is wither 0 or 1 and multiply it with difference of two numbers.And finally this number should be subtracted from 1st number to yield the smaller of the two numbers.
P.S. this solution will fail in case 2nd number is zero
There is one way
public static int Min(int a, int b)
{
int dif = (int)(((uint)(a - b)) >> 31);
return a * dif + b * (1 - dif);
}
and one
return (a>=b)?b:a;
int a=151;
int b=121;
int k=Math.abs(a-b);
int j= a+b;
double k1=(double)(k);
double j1= (double) (j);
double c=Math.ceil(k1/2) + Math.floor(j1/2);
int c1= (int) (c);
System.out.println(" Max value = " + c1);
Guess we can just multiply the numbers with their bitwise comparisons eg:
int max=(a>b)*a+(a<=b)*b;

Resources