I found many posts about bitwise division and I completely understand most bitwise usage but I can't think of a specific division. I want to divide a given number (lets say 100) with all the multiples of 2 possible (ATTENTION: I don't want to divide with powers of 2 bit multiples!)
For example: 100/2, 100/4, 100/6, 100/8, 100/10...100/100
Also I know that because of using unsigned int the answers will be rounded for example 100/52=0 but it doesn't really matter, because I can both skip those answers or print them, no problem. My concern is mostly how I can divide with 6 or 10, etc. (multiples of 2). There is need for it to be done in C, because I can manage to transform any code you give me from Java to C.
Following the math shown for the accepted solution to the division by 3 question, you can derive a recurrence for the division algorithm:
To compute (int)(X / Y)
Let k be such that 2k ≥ Y and 2k-1 < Y
(note, 2k = (1 << k))
Let d = 2k - Y
Then, if A = (int)(X / 2k) and B = X % 2k,
X = (1 << k) * A + B
= (1 << k) * A - Y * A + Y * A + B
= d * A + Y * A + B
= Y * A + (d * A + B)
Thus,
X/Y = A + (d * A + B)/Y
In otherwords,
If S(X, Y) := X/Y, then S(X, Y) := A + S(d * A + B, Y).
This recurrence can be implemented with a simple loop. The stopping condition for the loop is when the numerator falls below 2k. The function divu implements the recurrence, using only bitwise operators and using unsigned types. Helper functions for the math operations are left unimplemented, but shouldn't be too hard (the linked answer provides a full add implementation already). The rs() function is for "right-shift", which does sign extension on the unsigned input. The function div is the actual API for int, and checks for divide by zero and negative y before delegating to divu. negate does 2's complement negation.
static unsigned divu (unsigned x, unsigned y) {
unsigned k = 0;
unsigned pow2 = 0;
unsigned mask = 0;
unsigned diff = 0;
unsigned sum = 0;
while ((1 << k) < y) k = add(k, 1);
pow2 = (1 << k);
mask = sub(pow2, 1);
diff = sub(pow2, y);
while (x >= pow2) {
sum = add(sum, rs(x, k));
x = add(mul(diff, rs(x, k)), (x & mask));
}
if (x >= y) sum = add(sum, 1);
return sum;
}
int div (int x, int y) {
assert(y);
if (y > 0) return divu(x, y);
return negate(divu(x, negate(y)));
}
This implementation depends on signed int using 2's complement. For maximal portability, div should convert negative arguments to 2's complement before calling divu. Then, it should convert the result from divu back from 2's complement to the native signed representation.
The following code works for positive numbers. When the dividend or the divisor or both are negative, have flags to change the sign of the answer appropriately.
int divi(long long m, long long n)
{
if(m==0 || n==0 || m<n)
return 0;
long long a,b;
int f=0;
a=n;b=1;
while(a<=m)
{
b = b<<1;
a = a<<1;
f=1;
}
if(f)
{
b = b>>1;
a = a>>1;
}
b = b + divi(m-a,n);
return b;
}
Use the operator / for integer division as much as you can.
For instance, when you want to divide 100 by 6 or 10 you should write 100/6 or 100/10.
When you mention bit wise division do you (1) mean an implementation of operator / or (2) you are referring to the division by a power of two number.
For (1) a processor should have an integer division unit. If not the compiler should provide a good implementation.
For (2) you can use 100>>2 instead of 100/4. If the numerator is known at compile time then a good compiler should automatically use the shift instruction.
Related
How do we write a program in C which can calculate an average of 2 16 bit signed numbers on a 16 bit processor.
int getAverage(int x, int y)
{
int result=0;
result = ((x+y)/2);
return result;
}
The above works for most cases except for when both x and y are max values 65535.
In the case where both x and y are positive or negative numbers, I would divide the difference between the numbers by 2 and add that result to the number that is subtracted. Mathematically, this is equivalent to what you currently have:
(y - x)/2 + x = y/2 - x/2 + x = y/2 + x/2 = (x + y)/2
If x is positive and y is negative or vice versa, the original method of calculation that you have should be used.
Simplest possible solution with some crude integer rounding:
int32_t getAverage (int16_t x, int16_t y)
{
int32_t sum = (int32_t)x + (int32_t)y;
return sum/2 + sum%2;
}
This will work just fine since your 16 bit compiler will have software routines to handle 32 bit integers.
Lets say I have a double a = 0.3;. How would I be able to change the exponent of the variable, without using math functions like pow(), or multiplying it manually.
I am guessing I would have to acces the memory addres of the variable using pointers, find the exponent and change it manualy. But how would I accomplish this?
Note, that this is on a 8-bit system, and I am trying to find a faster way to multiply the number by 10^12, 10^9, 10^6 or 10^3.
Best regards!
Note that a*10^3 = a*1000 = a*1024 - a*16 - a*8 = a*2^10 - a*2^4 - a*2^3.
So you can calculate a*10^3 as follows:
Read the 11 exponent bits into int exp
Read the 52 fraction bits into double frac
Calculate double x with exp+10 as the exponent and frac as the fraction
Calculate double y with exp+4 as the exponent and frac as the fraction
Calculate double z with exp+3 as the exponent and frac as the fraction
Calculate the output as x-y-z, and don't forget to add the sign bit if a < 0
You can use a similar method for the other options (a*10^6, a*10^9 and a*10^12)...
Here is how you can do the whole thing in a "clean" manner:
double MulBy1000(double a)
{
double x = a;
double y = a;
double z = a;
unsigned long long* px = (unsigned long long*)&x;
unsigned long long* py = (unsigned long long*)&y;
unsigned long long* pz = (unsigned long long*)&z;
*px += 10ULL << 52;
*py += 4ULL << 52;
*pz += 3ULL << 52;
return x - y - z;
}
Please note that I'm not sure whether or not this code breaks strict-aliasing rules.
Multiplying a number by 10 is the equivalent of
a) Multiplying the original number by 2
b) Multiplying the original number by 8
c) Adding the results of (a) and (b).
This works because to is binary 1010.
One approach would therefore be to increment the exponent (for (a)), add 3 to the exponent (for (b)), then add the results.
To multiply by 10^n, repeat the above n times. Alternatively work out the binary representation of 1,000, 1,000,000, etc, and add the relevant 1s. You may make things easier by noting that 1000 for instance 1024 (for instance) is 1024 - 16 - 8, i.e.
a) Add 10 to the exponent of the original to multiply by 1024
b) Add 4 to the exponent of the original to multiply by 16
c) Add 3 to the exponent of the original to multiply by 8
d) From (a) subtract (b) and (c) to get the answer.
Again, you can do that multiple times for 10^6, 10^9 etc.
For a quick approximation and powers of n which are multiples of 3, just add 10n/3 to the exponent (as 1024 ~= 1000)
For fun a simple recursive solution.
double ScalePower10(double x, unsigned power) {
if (power <= 1) {
if (power == 0) return x;
return x * 10.0;
}
double y = ScalePower10(x, power/2);
y = y*y;
if (power%2) y *= 10.0;
return y;
}
I was looking at another question (here) where someone was looking for a way to get the square root of a 64 bit integer in x86 assembly.
This turns out to be very simple. The solution is to convert to a floating point number, calculate the sqrt and then convert back.
I need to do something very similar in C however when I look into equivalents I'm getting a little stuck. I can only find a sqrt function which takes in doubles. Doubles do not have the precision to store large 64bit integers without introducing significant rounding error.
Is there a common math library that I can use which has a long double sqrt function?
There is no need for long double; the square root can be calculated with double (if it is IEEE-754 64-bit binary). The rounding error in converting a 64-bit integer to double is nearly irrelevant in this problem.
The rounding error is at most one part in 253. This causes an error in the square root of at most one part in 254. The sqrt itself has a rounding error of less than one part in 253, due to rounding the mathematical result to the double format. The sum of these errors is tiny; the largest possible square root of a 64-bit integer (rounded to 53 bits) is 232, so an error of three parts in 254 is less than .00000072.
For a uint64_t x, consider sqrt(x). We know this value is within .00000072 of the exact square root of x, but we do not know its direction. If we adjust it to sqrt(x) - 0x1p-20, then we know we have a value that is less than, but very close to, the square root of x.
Then this code calculates the square root of x, truncated to an integer, provided the operations conform to IEEE 754:
uint64_t y = sqrt(x) - 0x1p-20;
if (2*y < x - y*y)
++y;
(2*y < x - y*y is equivalent to (y+1)*(y+1) <= x except that it avoids wrapping the 64-bit integer if y+1 is 232.)
Function sqrtl(), taking a long double, is part of C99.
Note that your compilation platform does not have to implement long double as 80-bit extended-precision. It is only required to be as wide as double, and Visual Studio implements is as a plain double. GCC and Clang do compile long double to 80-bit extended-precision on Intel processors.
Yes, the standard library has sqrtl() (since C99).
If you only want to calculate sqrt for integers, using divide and conquer should find the result in max 32 iterations:
uint64_t mysqrt (uint64_t a)
{
uint64_t min=0;
//uint64_t max=1<<32;
uint64_t max=((uint64_t) 1) << 32; //chux' bugfix
while(1)
{
if (max <= 1 + min)
return min;
uint64_t sqt = min + (max - min)/2;
uint64_t sq = sqt*sqt;
if (sq == a)
return sqt;
if (sq > a)
max = sqt;
else
min = sqt;
}
Debugging is left as exercise for the reader.
Here we collect several observations in order to arrive to a solution:
In standard C >= 1999, it is garanted that non-netative integers have a representation in bits as one would expected for any base-2 number.
----> Hence, we can trust in bit manipulation of this type of numbers.
If x is a unsigned integer type, tnen x >> 1 == x / 2 and x << 1 == x * 2.
(!) But: It is very probable that bit operations shall be done faster than their arithmetical counterparts.
sqrt(x) is mathematically equivalent to exp(log(x)/2.0).
If we consider truncated logarithms and base-2 exponential for integers, we could obtain a fair estimate: IntExp2( IntLog2(x) / 2) "==" IntSqrtDn(x), where "=" is informal notation meaning almost equatl to (in the sense of a good approximation).
If we write IntExp2( IntLog2(x) / 2 + 1) "==" IntSqrtUp(x), we obtain an "above" approximation for the integer square root.
The approximations obtained in (4.) and (5.) are a little rough (they enclose the true value of sqrt(x) between two consecutive powers of 2), but they could be a very well starting point for any algorithm that searchs for the square roor of x.
The Newton algorithm for square root could be work well for integers, if we have a good first approximation to the real solution.
http://en.wikipedia.org/wiki/Integer_square_root
The final algorithm needs some mathematical comprobations to be plenty sure that always work properly, but I will not do it right now... I will show you the final program, instead:
#include <stdio.h> /* For printf()... */
#include <stdint.h> /* For uintmax_t... */
#include <math.h> /* For sqrt() .... */
int IntLog2(uintmax_t n) {
if (n == 0) return -1; /* Error */
int L;
for (L = 0; n >>= 1; L++)
;
return L; /* It takes < 64 steps for long long */
}
uintmax_t IntExp2(int n) {
if (n < 0)
return 0; /* Error */
uintmax_t E;
for (E = 1; n-- > 0; E <<= 1)
;
return E; /* It takes < 64 steps for long long */
}
uintmax_t IntSqrtDn(uintmax_t n) { return IntExp2(IntLog2(n) / 2); }
uintmax_t IntSqrtUp(uintmax_t n) { return IntExp2(IntLog2(n) / 2 + 1); }
int main(void) {
uintmax_t N = 947612934; /* Try here your number! */
uintmax_t sqrtn = IntSqrtDn(N), /* 1st approx. to sqrt(N) by below */
sqrtn0 = IntSqrtUp(N); /* 1st approx. to sqrt(N) by above */
/* The following means while( abs(sqrt-sqrt0) > 1) { stuff... } */
/* However, we take care of subtractions on unsigned arithmetic, just in case... */
while ( (sqrtn > sqrtn0 + 1) || (sqrtn0 > sqrtn+1) )
sqrtn0 = sqrtn, sqrtn = (sqrtn0 + N/sqrtn0) / 2; /* Newton iteration */
printf("N==%llu, sqrt(N)==%g, IntSqrtDn(N)==%llu, IntSqrtUp(N)==%llu, sqrtn==%llu, sqrtn*sqrtn==%llu\n\n",
N, sqrt(N), IntSqrtDn(N), IntSqrtUp(N), sqrtn, sqrtn*sqrtn);
return 0;
}
The last value stored in sqrtn is the integer square root of N.
The last line of the program just shows all the values, with comprobation purposes.
So, you can try different values of Nand see what happens.
If we add a counter inside the while-loop, we'll see that no more than a few iterations happen.
Remark: It is necessary to verify that the condition abs(sqrtn-sqrtn0)<=1 is always achieved when working in the integer-number setting. If not, we shall have to fix the algorithm.
Remark2: In the initialization sentences, observe that sqrtn0 == sqrtn * 2 == sqrtn << 1. This avoids us some calculations.
// sqrt_i64 returns the integer square root of v.
int64_t sqrt_i64(int64_t v) {
uint64_t q = 0, b = 1, r = v;
for( b <<= 62; b > 0 && b > r; b >>= 2);
while( b > 0 ) {
uint64_t t = q + b;
q >>= 1;
if( r >= t ) {
r -= t;
q += b;
}
b >>= 2;
}
return q;
}
The for loop may be optimized by using the clz machine code instruction.
I'm looking for implementation of log() and exp() functions provided in C library <math.h>. I'm working with 8 bit microcontrollers (OKI 411 and 431). I need to calculate Mean Kinetic Temperature. The requirement is that we should be able to calculate MKT as fast as possible and with as little code memory as possible. The compiler comes with log() and exp() functions in <math.h>. But calling either function and linking with the library causes the code size to increase by 5 Kilobytes, which will not fit in one of the micro we work with (OKI 411), because our code already consumed ~12K of available ~15K code memory.
The implementation I'm looking for should not use any other C library functions (like pow(), sqrt() etc). This is because all library functions are packed in one library and even if one function is called, the linker will bring whole 5K library to code memory.
EDIT
The algorithm should be correct up to 3 decimal places.
Using Taylor series is not the simplest neither the fastest way of doing this. Most professional implementations are using approximating polynomials. I'll show you how to generate one in Maple (it is a computer algebra program), using the Remez algorithm.
For 3 digits of accuracy execute the following commands in Maple:
with(numapprox):
Digits := 8
minimax(ln(x), x = 1 .. 2, 4, 1, 'maxerror')
maxerror
Its response is the following polynomial:
-1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x
With the maximal error of: 0.000061011436
We generated a polynomial which approximates the ln(x), but only inside the [1..2] interval. Increasing the interval is not wise, because that would increase the maximal error even more. Instead of that, do the following decomposition:
So first find the highest power of 2, which is still smaller than the number (See: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?). That number is actually the base-2 logarithm. Divide with that value, then the result gets into the 1..2 interval. At the end we will have to add n*ln(2) to get the final result.
An example implementation for numbers >= 1:
float ln(float y) {
int log2;
float divisor, x, result;
log2 = msb((int)y); // See: https://stackoverflow.com/a/4970859/6630230
divisor = (float)(1 << log2);
x = y / divisor; // normalized value between [1.0, 2.0]
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
result += ((float)log2) * 0.69314718; // ln(2) = 0.69314718
return result;
}
Although if you plan to use it only in the [1.0, 2.0] interval, then the function is like:
float ln(float x) {
return -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
}
The Taylor series for e^x converges extremely quickly, and you can tune your implementation to the precision that you need. (http://en.wikipedia.org/wiki/Taylor_series)
The Taylor series for log is not as nice...
If you don't need floating-point math for anything else, you may compute an approximate fractional base-2 log pretty easily. Start by shifting your value left until it's 32768 or higher and store the number of times you did that in count. Then, repeat some number of times (depending upon your desired scale factor):
n = (mult(n,n) + 32768u) >> 16; // If a function is available for 16x16->32 multiply
count<<=1;
if (n < 32768) n*=2; else count+=1;
If the above loop is repeated 8 times, then the log base 2 of the number will be count/256. If ten times, count/1024. If eleven, count/2048. Effectively, this function works by computing the integer power-of-two logarithm of n**(2^reps), but with intermediate values scaled to avoid overflow.
Would basic table with interpolation between values approach work? If ranges of values are limited (which is likely for your case - I doubt temperature readings have huge range) and high precisions is not required it may work. Should be easy to test on normal machine.
Here is one of many topics on table representation of functions: Calculating vs. lookup tables for sine value performance?
Necromancing.
I had to implement logarithms on rational numbers.
This is how I did it:
Occording to Wikipedia, there is the Halley-Newton approximation method
which can be used for very-high precision.
Using Newton's method, the iteration simplifies to (implementation), which has cubic convergence to ln(x), which is way better than what the Taylor-Series offers.
// Using Newton's method, the iteration simplifies to (implementation)
// which has cubic convergence to ln(x).
public static double ln(double x, double epsilon)
{
double yn = x - 1.0d; // using the first term of the taylor series as initial-value
double yn1 = yn;
do
{
yn = yn1;
yn1 = yn + 2 * (x - System.Math.Exp(yn)) / (x + System.Math.Exp(yn));
} while (System.Math.Abs(yn - yn1) > epsilon);
return yn1;
}
This is not C, but C#, but I'm sure anybody capable to program in C will be able to deduce the C-Code from that.
Furthermore, since
logn(x) = ln(x)/ln(n).
You have therefore just implemented logN as well.
public static double log(double x, double n, double epsilon)
{
return ln(x, epsilon) / ln(n, epsilon);
}
where epsilon (error) is the minimum precision.
Now as to speed, you're probably better of using the ln-cast-in-hardware, but as I said, I used this as a base to implement logarithms on a rational numbers class working with arbitrary precision.
Arbitrary precision might be more important than speed, under certain circumstances.
Then, use the logarithmic identities for rational numbers:
logB(x/y) = logB(x) - logB(y)
In addition to Crouching Kitten's answer which gave me inspiration, you can build a pseudo-recursive (at most 1 self-call) logarithm to avoid using polynomials. In pseudo code
ln(x) :=
If (x <= 0)
return NaN
Else if (!(1 <= x < 2))
return LN2 * b + ln(a)
Else
return taylor_expansion(x - 1)
This is pretty efficient and precise since on [1; 2) the taylor series converges A LOT faster, and we get such a number 1 <= a < 2 with the first call to ln if our input is positive but not in this range.
You can find 'b' as your unbiased exponent from the data held in the float x, and 'a' from the mantissa of the float x (a is exactly the same float as x, but now with exponent biased_0 rather than exponent biased_b). LN2 should be kept as a macro in hexadecimal floating point notation IMO. You can also use http://man7.org/linux/man-pages/man3/frexp.3.html for this.
Also, the trick
unsigned long tmp = *(ulong*)(&d);
for "memory-casting" double to unsigned long, rather than "value-casting", is very useful to know when dealing with floats memory-wise, as bitwise operators will cause warnings or errors depending on the compiler.
Possible computation of ln(x) and expo(x) in C without <math.h> :
static double expo(double n) {
int a = 0, b = n > 0;
double c = 1, d = 1, e = 1;
for (b || (n = -n); e + .00001 < (e += (d *= n) / (c *= ++a)););
// approximately 15 iterations
return b ? e : 1 / e;
}
static double native_log_computation(const double n) {
// Basic logarithm computation.
static const double euler = 2.7182818284590452354 ;
unsigned a = 0, d;
double b, c, e, f;
if (n > 0) {
for (c = n < 1 ? 1 / n : n; (c /= euler) > 1; ++a);
c = 1 / (c * euler - 1), c = c + c + 1, f = c * c, b = 0;
for (d = 1, c /= 2; e = b, b += 1 / (d * c), b - e/* > 0.0000001 */;)
d += 2, c *= f;
} else b = (n == 0) / 0.;
return n < 1 ? -(a + b) : a + b;
}
static inline double native_ln(const double n) {
// Returns the natural logarithm (base e) of N.
return native_log_computation(n) ;
}
static inline double native_log_base(const double n, const double base) {
// Returns the logarithm (base b) of N.
return native_log_computation(n) / native_log_computation(base) ;
}
Try it Online
Building off #Crouching Kitten's great natural log answer above, if you need it to be accurate for inputs <1 you can add a simple scaling factor. Below is an example in C++ that i've used in microcontrollers. It has a scaling factor of 256 and it's accurate to inputs down to 1/256 = ~0.04, and up to 2^32/256 = 16777215 (due to overflow of a uint32 variable).
It's interesting to note that even on an STMF103 Arm M3 with no FPU, the float implementation below is significantly faster (eg 3x or better) than the 16 bit fixed-point implementation in libfixmath (that being said, this float implementation still takes a few thousand cycles so it's still not ~fast~)
#include <float.h>
float TempSensor::Ln(float y)
{
// Algo from: https://stackoverflow.com/a/18454010
// Accurate between (1 / scaling factor) < y < (2^32 / scaling factor). Read comments below for more info on how to extend this range
float divisor, x, result;
const float LN_2 = 0.69314718; //pre calculated constant used in calculations
uint32_t log2 = 0;
//handle if input is less than zero
if (y <= 0)
{
return -FLT_MAX;
}
//scaling factor. The polynomial below is accurate when the input y>1, therefore using a scaling factor of 256 (aka 2^8) extends this to 1/256 or ~0.04. Given use of uint32_t, the input y must stay below 2^24 or 16777216 (aka 2^(32-8)), otherwise uint_y used below will overflow. Increasing the scaing factor will reduce the lower accuracy bound and also reduce the upper overflow bound. If you need the range to be wider, consider changing uint_y to a uint64_t
const uint32_t SCALING_FACTOR = 256;
const float LN_SCALING_FACTOR = 5.545177444; //this is the natural log of the scaling factor and needs to be precalculated
y = y * SCALING_FACTOR;
uint32_t uint_y = (uint32_t)y;
while (uint_y >>= 1) // Convert the number to an integer and then find the location of the MSB. This is the integer portion of Log2(y). See: https://stackoverflow.com/a/4970859/6630230
{
log2++;
}
divisor = (float)(1 << log2);
x = y / divisor; // FInd the remainder value between [1.0, 2.0] then calculate the natural log of this remainder using a polynomial approximation
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x; //This polynomial approximates ln(x) between [1,2]
result = result + ((float)log2) * LN_2 - LN_SCALING_FACTOR; // Using the log product rule Log(A) + Log(B) = Log(AB) and the log base change rule log_x(A) = log_y(A)/Log_y(x), calculate all the components in base e and then sum them: = Ln(x_remainder) + (log_2(x_integer) * ln(2)) - ln(SCALING_FACTOR)
return result;
}
Find the maximum of two numbers. You should not use if-else or any other comparison operator. I found this question on online bulletin board, so i thought i should ask in StackOverflow
EXAMPLE
Input: 5, 10
Output: 10
I found this solution, can someone help me understand these lines of code
int getMax(int a, int b) {
int c = a - b;
int k = (c >> 31) & 0x1;
int max = a - k * c;
return max;
}
int getMax(int a, int b) {
int c = a - b;
int k = (c >> 31) & 0x1;
int max = a - k * c;
return max;
}
Let's dissect this. This first line appears to be straightforward - it stores the difference of a and b. This value is negative if a < b and is nonnegative otherwise. But there's actually a bug here - if the difference of the numbers a and b is so big that it can't fit into an integer, this will lead to undefined behavior - oops! So let's assume that doesn't happen here.
In the next line, which is
int k = (c >> 31) & 0x1;
the idea is to check if the value of c is negative. In virtually all modern computers, numbers are stored in a format called two's complement in which the highest bit of the number is 0 if the number is positive and 1 if the number is negative. Moreover, most ints are 32 bits. (c >> 31) shifts the number down 31 bits, leaving the highest bit of the number in the spot for the lowest bit. The next step of taking this number and ANDing it with 1 (whose binary representation is 0 everywhere except the last bit) erases all the higher bits and just gives you the lowest bit. Since the lowest bit of c >> 31 is the highest bit of c, this reads the highest bit of c as either 0 or 1. Since the highest bit is 1 iff c is 1, this is a way of checking whether c is negative (1) or positive (0). Combining this reasoning with the above, k is 1 if a < b and is 0 otherwise.
The final step is to do this:
int max = a - k * c;
If a < b, then k == 1 and k * c = c = a - b, and so
a - k * c = a - (a - b) = a - a + b = b
Which is the correct max, since a < b. Otherwise, if a >= b, then k == 0 and
a - k * c = a - 0 = a
Which is also the correct max.
Here we go: (a + b) / 2 + |a - b| / 2
Use bitwise hacks
r = x ^ ((x ^ y) & -(x < y)); // max(x, y)
If you know that INT_MIN <= x - y <= INT_MAX, then you can use the following, which is faster because (x - y) only needs to be evaluated once.
r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
Source : Bit Twiddling Hacks by Sean Eron Anderson
(sqrt( a*a + b*b - 2*a*b ) + a + b) / 2
This is based on the same technique as mike.dld's solution, but it is less "obvious" here what I am doing. An "abs" operation looks like you are comparing the sign of something but I here am taking advantage of the fact that sqrt() will always return you the positive square root so I am squaring (a-b) writing it out in full then square-rooting it again, adding a+b and dividing by 2.
You will see it always works: eg the user's example of 10 and 5 you get sqrt(100 + 25 - 100) = 5 then add 10 and 5 gives you 20 and divide by 2 gives you 10.
If we use 9 and 11 as our numbers we would get (sqrt(121 + 81 - 198) + 11 + 9)/2 = (sqrt(4) + 20) / 2 = 22/2 = 11
The simplest answer is below.
#include <math.h>
int Max(int x, int y)
{
return (float)(x + y) / 2.0 + abs((float)(x - y) / 2);
}
int Min(int x, int y)
{
return (float)(x + y) / 2.0 - abs((float)(x - y) / 2);
}
int max(int i, int j) {
int m = ((i-j) >> 31);
return (m & j) + ((~m) & i);
}
This solution avoids multiplication.
m will either be 0x00000000 or 0xffffffff
Using the shifting idea to extract the sign as posted by others, here's another way:
max (a, b) = new[] { a, b } [((a - b) >> 31) & 1]
This pushes the two numbers into an array with the maximum number given by the array-element whose index is sign bit of the difference between the two numbers.
Do note that:
The difference (a - b) may overflow.
If the numbers are unsigned and the >> operator refers to a logical right-shift, the & 1 is unnecessary.
Here's how I think I'd do the job. It's not as readable as you might like, but when you start with "how do I do X without using the obvious way of doing X, you have to kind of expect that.
In theory, this gives up some portability too, but you'd have to find a pretty unusual system to see a problem.
#define BITS (CHAR_BIT * sizeof(int) - 1)
int findmax(int a, int b) {
int rets[] = {a, b};
return rets[unsigned(a-b)>>BITS];
}
This does have some advantages over the one shown in the question. First of all, it calculates the correct size of shift, instead of being hard-coded for 32-bit ints. Second, with most compilers we can expect all the multiplication to happen at compile time, so all that's left at run time is trivial bit manipulation (subtract and shift) followed by a load and return. In short, this is almost certain to be pretty fast, even on the smallest microcontroller, where the original used multiplication that had to happen at run-time, so while it's probably pretty fast on a desktop machine, it'll often be quite a bit slower on a small microcontroller.
Here's what those lines are doing:
c is a-b. if c is negative, a<b.
k is 32nd bit of c which is the sign bit of c (assuming 32 bit integers. If done on a platform with 64 bit integers, this code will not work). It's shifted 31 bits to the right to remove the rightmost 31 bits leaving the sign bit in the right most place and then anding it with 1 to remove all the bits to the left (which will be filled with 1s if c is negative). So k will be 1 if c is negative and 0 if c is positive.
Then max = a - k * c. If c is 0, this means a>=b, so max is a - 0 * c = a. If c is 1, this means that a<b and then a - 1 * c = a - (a - b) = a - a + b = b.
In the overall, it's just using the sign bit of the difference to avoid using greater than or less than operations. It's honestly a little silly to say that this code doesn't use a comparison. c is the result of comparing a and b. The code just doesn't use a comparison operator. You could do a similar thing in many assembly codes by just subtracting the numbers and then jumping based on the values set in the status register.
I should also add that all of these solutions are assuming that the two numbers are integers. If they are floats, doubles, or something more complicated (BigInts, Rational numbers, etc.) then you really have to use a comparison operator. Bit-tricks will not generally do for those.
getMax() Function Without Any Logical Operation-
int getMax(int a, int b){
return (a+b+((a-b)>>sizeof(int)*8-1|1)*(a-b))/2;
}
Explanation:
Lets smash the 'max' into pieces,
max
= ( max + max ) / 2
= ( max + (min+differenceOfMaxMin) ) / 2
= ( max + min + differenceOfMaxMin ) / 2
= ( max + min + | max - min | ) ) / 2
So the function should look like this-
getMax(a, b)
= ( a + b + absolute(a - b) ) / 2
Now,
absolute(x)
= x [if 'x' is positive] or -x [if 'x' is negative]
= x * ( 1 [if 'x' is positive] or -1 [if 'x' is negative] )
In integer positive number the first bit (sign bit) is- 0; in negative it is- 1. By shifting bits to the right (>>) the first bit can be captured.
During right shift the empty space is filled by the sign bit. So 01110001 >> 2 = 00011100, while 10110001 >> 2 = 11101100.
As a result, for 8 bit number shifting 7 bit will either produce- 1 1 1 1 1 1 1 [0 or 1] for negative, or 0 0 0 0 0 0 0 [0 or 1] for positive.
Now, if OR operation is performed with 00000001 (= 1), negative number yields- 11111111 (= -1), and positive- 00000001 (= 1).
So,
absolute(x)
= x * ( 1 [if 'x' is positive] or -1 [if 'x' is negative] )
= x * ( ( x >> (numberOfBitsInInteger-1) ) | 1 )
= x * ( ( x >> ((numberOfBytesInInteger*bitsInOneByte) - 1) ) | 1 )
= x * ( ( x >> ((sizeOf(int)*8) - 1) ) | 1 )
Finally,
getMax(a, b)
= ( a + b + absolute(a - b) ) / 2
= ( a + b + ((a-b) * ( ( (a-b) >> ((sizeOf(int)*8) - 1) ) | 1 )) ) / 2
Another way-
int getMax(int a, int b){
int i[] = {a, b};
return i[( (i[0]-i[1]) >> (sizeof(int)*8 - 1) ) & 1 ];
}
static int mymax(int a, int b)
{
int[] arr;
arr = new int[3];
arr[0] = b;
arr[1] = a;
arr[2] = a;
return arr[Math.Sign(a - b) + 1];
}
If b > a then (a-b) will be negative, sign will return -1, by adding 1 we get index 0 which is b, if b=a then a-b will be 0, +1 will give 1 index so it does not matter if we are returning a or b, when a > b then a-b will be positive and sign will return 1, adding 1 we get index 2 where a is stored.
#include<stdio.h>
main()
{
int num1,num2,diff;
printf("Enter number 1 : ");
scanf("%d",&num1);
printf("Enter number 2 : ");
scanf("%d",&num2);
diff=num1-num2;
num1=abs(diff);
num2=num1+diff;
if(num1==num2)
printf("Both number are equal\n");
else if(num2==0)
printf("Num2 > Num1\n");
else
printf("Num1 > Num2\n");
}
The code which I am providing is for finding maximum between two numbers, the numbers can be of any data type(integer, floating). If the input numbers are equal then the function returns the number.
double findmax(double a, double b)
{
//find the difference of the two numbers
double diff=a-b;
double temp_diff=diff;
int int_diff=temp_diff;
/*
For the floating point numbers the difference contains decimal
values (for example 0.0009, 2.63 etc.) if the left side of '.' contains 0 then we need
to get a non-zero number on the left side of '.'
*/
while ( (!(int_diff|0)) && ((temp_diff-int_diff)||(0.0)) )
{
temp_diff = temp_diff * 10;
int_diff = temp_diff;
}
/*
shift the sign bit of variable 'int_diff' to the LSB position and find if it is
1(difference is -ve) or 0(difference is +ve) , then multiply it with the difference of
the two numbers (variable 'diff') then subtract it with the variable a.
*/
return a- (diff * ( int_diff >> (sizeof(int) * 8 - 1 ) & 1 ));
}
Description
The first thing the function takes the arguments as double and has return type as double. The reason for this is that to create a single function which can find maximum for all types. When integer type numbers are provided or one is an integer and other is the floating point then also due to implicit conversion the function can be used to find the max for integers also.
The basic logic is simple, let's say we have two numbers a & b if a-b>0(i.e. the difference is positive) then a is maximum else if a-b==0 then both are equal and if a-b<0(i.e. diff is -ve) b is maximum.
The sign bit is saved as the Most Significant Bit(MSB) in the memory. If MSB is 1 and vice-versa. To check if MSB is 1 or 0 we shift the MSB to the LSB position and Bitwise & with 1, if the result is 1 then the number is -ve else no. is +ve. This result is obtained by the statement:
int_diff >> (sizeof(int) * 8 - 1 ) & 1
Here to get the sign bit from the MSB to LSB we right shift it to k-1 bits(where k is the number of bits needed to save an integer number in the memory which depends on the type of system). Here k= sizeof(int) * 8 as sizeof() gives the number of bytes needed to save an integer to get no. of bits, we multiply it with 8. After the right shift, we apply the bitwise & with 1 to get the result.
Now after obtaining the result(let us assume it as r) as 1(for -ve diff) and 0(for +ve diff) we multiply the result with the difference of the two numbers, the logic is given as follows:
if a>b then a-b>0 i.e., is +ve so the result is 0(i.e., r=0). So a-(a-b)*r => a-(a-b)*0, which gives 'a' as the maximum.
if a < b then a-b<0 i.e., is -ve so the result is 1(i.e., r=1). So a-(a-b)*r => a-(a-b)*1 => a-a+b =>b , which gives 'b' as the maximum.
Now there are two remaining points 1. the use of while loop and 2. why I have used the variable 'int_diff' as an integer. To answer these properly we have to understand some points:
Floating type values cannot be used as an operand for the bitwise operators.
Due to above reason, we need to get the value in an integer value to get the sign of difference by using bitwise operators. These two points describe the need of variable 'int_diff' as integer type.
Now let's say we find the difference in variable 'diff' now there are 3 possibilities for the values of 'diff' irrespective of the sign of these values. (a). |diff|>=1 , (b). 0<|diff|<1 , (c). |diff|==0.
When we assign a double value to integer variable the decimal part is lost.
For case(a) the value of 'int_diff' >0 (i.e.,1,2,...). For other two cases int_diff=0.
The condition (temp_diff-int_diff)||0.0 checks if diff==0 so both numbers are equal.
If diff!=0 then we check if int_diff|0 is true i.e., case(b) is true
In the while loop, we try to get the value of int_diff as non-zero so that the value of int_diff also gets the sign of diff.
Here are a couple of bit-twiddling methods to get the max of two integral values:
Method 1
int max1(int a, int b) {
static const size_t SIGN_BIT_SHIFT = sizeof(a) * 8 - 1;
int mask = (a - b) >> SIGN_BIT_SHIFT;
return (a & ~mask) | (b & mask);
}
Explanation:
(a - b) >> SIGN_BIT_SHIFT - If a > b then a - b is positive, thus the sign bit is 0, and the mask is 0x00.00. Otherwise, a < b so a - b is negative, the sign bit is 1 and after shifting, we get a mask of 0xFF..FF
(a & ~mask) - If the mask is 0xFF..FF, then ~mask is 0x00..00 and then this value is 0. Otherwise, ~mask is 0xFF..FF and the value is a
(b & mask) - If the mask is 0xFF..FF, then this value is b. Otherwise, mask is 0x00..00 and the value is 0.
Finally:
If a >= b then a - b is positive, we get max = a | 0 = a
If a < b then a - b is negative, we get max = 0 | b = b
Method 2
int max2(int a, int b) {
static const size_t SIGN_BIT_SHIFT = sizeof(a) * 8 - 1;
int mask = (a - b) >> SIGN_BIT_SHIFT;
return a ^ ((a ^ b) & mask);
}
Explanation:
Mask explanation is the same as for Method 1. If a > b the mask is 0x00..00, otherwise the mask is 0xFF..FF
If the mask is 0x00..00, then (a ^ b) & mask is 0x00..00
If the mask is 0xFF..FF, then (a ^ b) & mask is a ^ b
Finally:
If a >= b, we get a ^ 0x00..00 = a
If a < b, we get a ^ a ^ b = b
//In C# you can use math library to perform min or max function
using System;
class NumberComparator
{
static void Main()
{
Console.Write(" write the first number to compare: ");
double first_Number = double.Parse(Console.ReadLine());
Console.Write(" write the second number to compare: ");
double second_Number = double.Parse(Console.ReadLine());
double compare_Numbers = Math.Max(first_Number, second_Number);
Console.Write("{0} is greater",compare_Numbers);
}
}
No logical operators, no libs (JS)
function (x, y) {
let z = (x - y) ** 2;
z = z ** .5;
return (x + y + z) / 2
}
The logic described in a problem can be explained as if 1st number is smaller then 0 will be subtracted else difference will be subtracted from 1st number to get 2nd number.
I found one more mathematical solution which I think is bit simpler to understand this concept.
Considering a and b as given numbers
c=|a/b|+1;
d=(c-1)/b;
smallest number= a - d*(a-b);
Again,The idea is to find k which is wither 0 or 1 and multiply it with difference of two numbers.And finally this number should be subtracted from 1st number to yield the smaller of the two numbers.
P.S. this solution will fail in case 2nd number is zero
There is one way
public static int Min(int a, int b)
{
int dif = (int)(((uint)(a - b)) >> 31);
return a * dif + b * (1 - dif);
}
and one
return (a>=b)?b:a;
int a=151;
int b=121;
int k=Math.abs(a-b);
int j= a+b;
double k1=(double)(k);
double j1= (double) (j);
double c=Math.ceil(k1/2) + Math.floor(j1/2);
int c1= (int) (c);
System.out.println(" Max value = " + c1);
Guess we can just multiply the numbers with their bitwise comparisons eg:
int max=(a>b)*a+(a<=b)*b;