C - Bit operation without loop - c

I am trying to calculate a value in function of bits in a word of variable length. Starting MSB, if bit is 1 then the value is 1/2^i. This value is multiplied by a scaling factor
Example: 110010
this would be (1/2 + 1/4 + 1/64) * scaling_factor
I have programmed it with a for loop; any idea of how this could be done avoiding the loop?
This is the code:
double dec_bnr (unsigned long data, int significant_bits, double scaling_factor)
{
unsigned int temp;
unsigned int bnr_locmask = 0x8000000;
temp = data & bnr_locmasks[significant_bits-1];
double result = 0;
for (int i=0; i<significant_bits; i++){
if((temp & bnr_locmask)==bnr_locmask){
result+=pow (0.5, i+1);
}
bnr_locmask = bnr_locmask >> 1;
}
return result * scaling_factor;
}
Thank you in advance!
Edit: Thank you for your answers; however, what I am trying to say is not what you propose. Please, let me add an example:
data=a0280
A 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
1/A 0,5 0,125 0,000488281 0,00012207
data 1 0 1 0 0 0 0 0 0 0 1 0 1 0000000
result = scaling_factor*Sum(data/A)
We only take into account the value 1/A if the bit for that position is 1.

It's actually very easy to do without a loop:
double dec_bnr (unsigned long data, double scaling_factor)
{
return data*scaling_factor/(ULONG_MAX+1.0);
}
It's worth noting what happens in this code. First, data is converted into a double to match scaling_factor and then the numbers are multiplied, but then we do a further scaling dividing by ULONG_MAX+1.0 which is also converted to a double before dividing.
Note that
This can't be ULONG_MAX + 1 because that would cause the number to remain an integer type and wrap around to zero (and thus causing a divide-by-zero error at runtime).
The number ULONG_MAX + 1.0, interpreted as a double, may well be identical to ULONG_MAX on 64-bit machines
It's called fixed point arithmetic and there are many resources available on the internet and elsewhere that explain it very well.

Or you could use ldexp():
double result = ldexp(data * scaling_factor, -significant_bits) ;
which has the advantage of expressing exactly what you are doing ! (Assuming that the scaling_factor is double.)
It also avoids any issues with constructing large powers of two ((double)(ULONG_MAX + 1) doesn't quite work !) and dividing, or doing pow(2.0, -significant_bits) and multiplying.
Further thought... this is, of course, equivalent:
double result = ldexp((double)data, -significant_bits) * scaling_factor ;
But you could lump the "binary point shift" in with the scaling_factor (once):
double scaling_factor_x = ldexp(scaling_factor, -significant_bits) ;
and then the conversion is simply:
double result = (double)data * scaling_factor_x ;

This, or something similar, should be equivalent to what you are doing:
(double)data / 0x100000000 * scaling_factor
The way binary works is that each bit has a weight twice the weight of the bit after it, so you don't need to loop through the bits like you are doing.

Related

Our Exercise code is outputting an unexpected -1 and we don't know why

So we are working on an exercise for Uni and we can't figure out why this code outputs the second value as -1
We believe that it is due to the 16-bit limit but don't understand exactly why and can't find any sources on this issue since we don't know what it actually is. I'm sorry if this seems really stupid, please help D:
#include <stdio.h>
#include <stdint.h>
int main() {
int16_t y = 1024, z = 65; // auch "short" als Datentyp verwendbar
y = y * z;
printf("1. Ausgabe: %d\n", y);
printf("2. Ausgabe: %d\n", y / 3 * 3 - 3 * y / 3);
printf("\n");
return 0;
}
we expected the result to be 0 for 2. Ausgabe but it outputs -1
the range for a int16_t is −32,768 ... 32,767
y * z = 1024*65 = 66560 but hence will be stored as 66560 % 2^16 = 1024
so you still have y = 1024 and your statement y = y * z is useless
y / 3 * 3 = (y / 3) * 3 = 341 * 3 = 1023 != y because of rounding
3 * y / 3 = (3 * y) / 3 = y because there is no rounding
when you substract you get -1
the problem is you're overflowing your variable and doing integer divisions
use floats instead of int16_t
int16_t has the range [−32768, +32767] - you are overflowing that range immediately with the first multiplication.
What does the first printf print out? That should have been an indication. You also mention it in your question.
Google int16_t range and it will show you the range above. Try using an int32_t and see what difference that makes.
Given y = y * z; where all operands are type int16_t and the values are 1024 * 65 = 66560,
then there are two possibilities:
If your system is 8 or 16 bit, you will have 16 bit int type. There will be no type promotion of the operands since they are already 16 bits and you get a signed integer overflow, invoking undefined behavior. Since 16 bit signed integers can only store values up to 32767.
If your system is 32 bit, then both y and z are implicitly promoted (see Implicit type promotion rules) to type int which is then 32 bits. The result of the multiplication is of type int. The value 66560 will fit just fine.
You then convert this 32 bit int to int16_t upon assignment. The value will no longer fit - what will happen is an implementation-defined conversion from signed 32 to signed 16. (In theory your system may raise a signal here.)
In practice most systems will simply take the 66560 = 10400h and cut off the MS bytes, leaving you with 1024 = 400h.
In either case, the equation y = y * z; is highly questionable given the size of the input values! This is to be regarded as a bug. You should use uint16_t or int32_t instead.
As for y / 3 * 3 - 3 * y / 3, it will be 1024 / 3 * 3 - 3 * 1024 / 3. All operands are integers and the operator associativity of multiplicative operators * and / is left-to-right.
So you get 341 * 3 - 3072 / 3 -> 1023 - 1024 = -1.
As a side-note, you are using the wrong printf conversion specifier. The most correct one for int16_t is this:
#include <inttypes.h>
printf("1. Ausgabe: %"PRIi16 "\n", y);

How to sum large numbers?

I am trying to calculate 1 + 1 * 2 + 1 * 2 * 3 + 1 * 2 * 3 * 4 + ... + 1 * 2 * ... * n where n is the user input.
It works for values of n up to 12. I want to calculate the sum for n = 13, n = 14 and n = 15. How do I do that in C89? As I know, I can use unsigned long long int only in C99 or C11.
Input 13, result 2455009817, expected 6749977113
Input 14, result 3733955097, expected 93928268313
Input 15, result 1443297817, expected 1401602636313
My code:
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned long int n;
unsigned long int P = 1;
int i;
unsigned long int sum = 0;
scanf("%lu", &n);
for(i = 1; i <= n; i++)
{
P *= i;
sum += P;
}
printf("%lu", sum);
return 0;
}
In practice, you want some arbitrary precision arithmetic (a.k.a. bigint or bignum) library. My recommendation is GMPlib but there are other ones.
Don't try to code your own bignum library. Efficient & clever algorithms exist, but they are unintuitive and difficult to grasp (you can find entire books devoted to that question). In addition, existing libraries like GMPlib are taking advantage of specific machine instructions (e.g. ADC -add with carry) that a standard C compiler won't emit (from pure C code).
If this is a homework and you are not allowed to use external code, consider for example representing a number in base or radix 1000000000 (one billion) and code yourself the operations in a very naive way, similar to what you have learned as a kid. But be aware that more efficient algorithms exist (and that real bignum libraries are using them).
A number could be represented in base 1000000000 by having an array of unsigned, each being a "digit" of base 1000000000. So you need to manage arrays (probably heap allocated, using malloc) and their length.
You could use a double, especially if your platform uses IEEE754.
Such a double gives you 53 bits of precision, which means integers are exact up to the 53rd power of 2. That's good enough for this case.
If your platform doesn't use IEEE754 then consult the documentation on the floating point scheme adopted. It might be adequate.
A simple approach when you're just over the limit of MaxInt, is to do the computations modulo 10^n for a suitable n and you do the same computation as floating point computation but where you divide everything by 10^r.The former result will give you the first n digits while the latter result will give you the last digits of the answer with the first r digits removed. Then the last few digits here will be inaccurate due to roundoff errors, so you should choose r a bit smaller than n. In this case taking n = 9 and r = 5 will work well.

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.
First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.
All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.
What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

In C bits, multiply by 3 and divide by 16

A buddy of mine had these puzzles and this is one that is eluding me. Here is the problem, you are given a number and you want to return that number times 3 and divided by 16 rounding towards 0. Should be easy. The catch? You can only use the ! ~ & ^ | + << >> operators and of them only a combination of 12.
int mult(int x){
//some code here...
return y;
}
My attempt at it has been:
int hold = x + x + x;
int hold1 = 8;
hold1 = hold1 & hold;
hold1 = hold1 >> 3;
hold = hold >> 4;
hold = hold + hold1;
return hold;
But that doesn't seem to be working. I think I have a problem of losing bits but I can't seem to come up with a way of saving them. Another perspective would be nice. Just to add, you also can only use variables of type int and no loops, if statements or function calls may be used.
Right now I have the number 0xfffffff. It is supposed to return 0x2ffffff but it is returning 0x3000000.
For this question you need to worry about the lost bits before your division (obviously).
Essentially, if it is negative then you want to add 15 after you multiply by 3. A simple if statement (using your operators) should suffice.
I am not going to give you the code but a step by step would look like,
x = x*3
get the sign and store it in variable foo.
have another variable hold x + 15;
Set up an if statement so that if x is negative it uses that added 15 and if not then it uses the regular number (times 3 which we did above).
Then divide by 16 which you already showed you know how to do. Good luck!
This seems to work (as long as no overflow occurs):
((num<<2)+~num+1)>>4
Try this JavaScript code, run in console:
for (var num = -128; num <= 128; ++num) {
var a = Math.floor(num * 3 / 16);
var b = ((num<<2)+~num+1)>>4;
console.log(
"Input:", num,
"Regular math:", a,
"Bit math:", b,
"Equal: ", a===b
);
}
The Maths
When you divide a positive integer n by 16, you get a positive integer quotient k and a remainder c < 16:
(n/16) = k + (c/16).
(Or simply apply the Euclidan algorithm.) The question asks for multiplication by 3/16, so multiply by 3
(n/16) * 3 = 3k + (c/16) * 3.
The number k is an integer, so the part 3k is still a whole number. However, int arithmetic rounds down, so the second term may lose precision if you divide first, And since c < 16, you can safely multiply first without overflowing (assuming sizeof(int) >= 7). So the algorithm design can be
(3n/16) = 3k + (3c/16).
The design
The integer k is simply n/16 rounded down towards 0. So k can be found by applying a single AND operation. Two further operations will give 3k. Operation count: 3.
The remainder c can also be found using an AND operation (with the missing bits). Multiplication by 3 uses two more operations. And shifts finishes the division. Operation count: 4.
Add them together gives you the final answer.
Total operation count: 8.
Negatives
The above algorithm uses shift operations. It may not work well on negatives. However, assuming two's complement, the sign of n is stored in a sign bit. It can be removed beforing applying the algorithm and reapplied on the answer.
To find and store the sign of n, a single AND is sufficient.
To remove this sign, OR can be used.
Apply the above algorithm.
To restore the sign bit, Use a final OR operation on the algorithm output with the stored sign bit.
This brings the final operation count up to 11.
what you can do is first divide by 4 then add 3 times then again devide by 4.
3*x/16=(x/4+x/4+x/4)/4
with this logic the program can be
main()
{
int x=0xefffffff;
int y;
printf("%x",x);
y=x&(0x80000000);
y=y>>31;
x=(y&(~x+1))+(~y&(x));
x=x>>2;
x=x&(0x3fffffff);
x=x+x+x;
x=x>>2;
x=x&(0x3fffffff);
x=(y&(~x+1))+(~y&(x));
printf("\n%x %d",x,x);
}
AND with 0x3fffffff to make msb's zero. it'l even convert numbers to positive.
This uses 2's complement of negative numbers. with direct methods to divide there will be loss of bit accuracy for negative numbers. so use this work arround of converting -ve to +ve number then perform division operations.
Note that the C99 standard states in section section 6.5.7 that right shifts of signed negative integer invokes implementation-defined behavior. Under the provisions that int is comprised of 32 bits and that right shifting of signed integers maps to an arithmetic shift instruction, the following code works for all int inputs. A fully portable solution that also fulfills the requirements set out in the question may be possible, but I cannot think of one right now.
My basic idea is to split the number into high and low bits to prevent intermediate overflow. The high bits are divided by 16 first (this is an exact operation), then multiplied by three. The low bits are first multiplied by three, then divided by 16. Since arithmetic right shift rounds towards negative infinity instead of towards zero like integer division, a correction needs to be applied to the right shift for negative numbers. For a right shift by N, one needs to add 2N-1 prior to the shift if the number to be shifted is negative.
#include <stdio.h>
#include <stdlib.h>
int ref (int a)
{
long long int t = ((long long int)a * 3) / 16;
return (int)t;
}
int main (void)
{
int a, t, r, c, res;
a = 0;
do {
t = a >> 4; /* high order bits */
r = a & 0xf; /* low order bits */
c = (a >> 31) & 15; /* shift correction. Portable alternative: (a < 0) ? 15 : 0 */
res = t + t + t + ((r + r + r + c) >> 4);
if (res != ref(a)) {
printf ("!!!! error a=%08x res=%08x ref=%08x\n", a, res, ref(a));
return EXIT_FAILURE;
}
a++;
} while (a);
return EXIT_SUCCESS;
}

Rounding up integer without using float, double, or division

Its an embedded platform thats why such restrictions.
original equation: 0.02035*c*c - 2.4038*c
Did this:
int32_t val = 112; // this value is arbitrary
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
result = result>>24;
The precision is still poor. When we multiply val*0x535A8 Is there a way we can further improve the precision by rounding up, but without using any float, double, or division.
The problem is not precision. You're using plenty of bits.
I suspect the problem is that you're comparing two different methods of converting to int. The first is a cast of a double, the second is a truncation by right-shifting.
Converting floating point to integer simply drops the fractional part, leading to a round towards zero; right-shifting does a round down or floor. For positive numbers there's no difference, but for negative numbers the two methods will be 1 off from each other. See an example at http://ideone.com/rkckuy and some background reading at Wikipedia.
Your original code is easy to fix:
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
if (result < 0)
result += 0xffffff;
result = result>>24;
See the results at http://ideone.com/D0pNPF
You might also just decide that the right shift result is OK as is. The conversion error isn't greater than it is for the other method, just different.
Edit: If you want to do rounding instead of truncation the answer is even easier.
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
result = (result + (1L << 23)) >> 24;
I'm going to join in with some of the others in suggesting that you use a constant expression to replace those magic constants with something that documents how they were derived.
static const int32_t a = (int32_t)(0.02035 * (1L << 24) + 0.5);
static const int32_t b = (int32_t)(2.4038 * (1L << 24) + 0.5);
int32_t result = (val*((val * a) - b));
How about just scaling your constants by 10000. The maximum number you then get is 2035*120*120 - 24038*120 = 26419440, which is far below the 2^31 limit. So maybe there is no need to do real bit-tweaking here.
As noted by Joe Hass, your problem is that you shift your precision bits into the dustbin.
Whether shifting your decimals by 2 or by 10 to the left does actually not matter. Just pretend your decimal point is not behind the last bit but at the shifted position. If you keep computing with the result, shifting by 2 is likely easier to handle. If you just want to output the result, shift by powers of ten as proposed above, convert the digits and insert the decimal point 5 characters from the right.
Givens:
Lets assume 1 <= c <= 120,
original equation: 0.02035*c*c - 2.4038*c
then -70.98586 < f(c) < 4.585
--> -71 <= result <= 5
rounding f(c) to nearest int32_t.
Arguments A = 0.02035 and B = 2.4038
A & B may change a bit with subsequent compiles, but not at run-time.
Allow coder to input values like 0.02035 & 2.4038. The key components shown here and by others it to scale the factors like 0.02035 to by some power-of-2, do the equation (simplified into the form (A*c - B)*c) and the scale the result back.
Important features:
1 When determining A and B, insure the compile time floating point multiplication and final conversion occurs via a round and not a truncation. With positive values, the + 0.5 achieves that. Without a rounded answer UD_A*UD_Scaling could end up just under a whole number and truncate away 0.999999 when converting to the int32_t
2 Instead of doing expensive division at run-time, we do >> (right shift). By adding half the divisor (as suggested by #Joe Hass), before the division, we get a nicely rounded answer. It is important not to code in / here as some_signed_int / 4 and some_signed_int >> 2 do not round the same way. With 2's complement, >> truncates toward INT_MIN whereas / truncates toward 0.
#define UD_A (0.02035)
#define UD_B (2.4038)
#define UD_Shift (24)
#define UD_Scaling ((int32_t) 1 << UD_Shift)
#define UD_ScA ((int32_t) (UD_A*UD_Scaling + 0.5))
#define UD_ScB ((int32_t) (UD_B*UD_Scaling + 0.5))
for (int32_t val = 1; val <= 120; val++) {
int32_t result = ((UD_A*val - UD_B)*val + UD_Scaling/2) >> UD_Shift;
printf("%" PRId32 "%" PRId32 "\n", val, result);
}
Example differences:
val, OP equation, OP code, This code
1, -2.38345, -3, -2
54, -70.46460, -71, -70
120, 4.58400, 4, 5
This is a new answer. My old +1 answer deleted.
If you r input uses max 7 bits and you have 32 bit available then your best bet is to shift everything by as many bits as possible and work with that:
int32_t result;
result = (val * (int32_t)(0.02035 * 0x1000000)) - (int32_t)(2.4038 * 0x1000000);
result >>= 8; // make room for another 7 bit multiplication
result *= val;
result >>= 16;
Constant conversion will be done by an optimising compiler at compile time.

Resources