Overflow of 32 bit variable

Overflow of 32 bit variable - c

Currently I am implementing an equation (2^A)[X + Y*(2^B)] in one of my applications.
The issue is with the overflow of 32 bit value and I cannot use 64 bit data type.
Suppose when B = 3 and Y = 805306367, it overflows 32bit value, but when X = -2147483648, the result comes backs to 32 bit range.
So I want to store the result of (Y*2^B). Can anyone suggest some solution for this.... A and B are having value from -15 to 15 and X,Y can have values from 2147483647..-2147483648.
Output can range from 0...4294967295.

If the number is too big for a 32 bit variable, then you either use more bits (either by storing in a bigger variable, or using multiple variables) or you give up precision and store it in a float. Since Y can be MAX_INT, by definition you can't multiply it by a number greater than 1 and still have it fit in a 32 bit int.

I'd use loop, instead of multiplication, in this case. Something like this:
int newX = X;
int poweredB = ( 1 << B ); // 2^B
for( int i = 0; i < poweredB ; ++i )
{
newX += Y; // or change X directly, if you will not need it later.
}
int result = ( 1 << A ) * newX;
But note : this will work only for some situations - only if you have the guarantee, that this result will not overflow. In your case, when Y is large positive and X is large negative number ("large" - argh, this is too subjective), this will definitely work. But if X is large positive and Y is large positive - there will be overflow again. And not only in this case, but with many others.

Based on the values for A and B in the assignment I suppose the expected solution would involve this:
the following are best done unsigned so store the signs for X and Y and operate on their absolute value
Store X and Y in two variables each, one holding the high 16 bits the other holding the low bits
something like
int hiY = Y & 0xFFFF0000;
int loY = Y & 0x0000FFFF;
Shift the high parts right so that all the variables have the high bits 0
Y*(2^B) is actually a shift of Y to the left by B bits. It is equivalent to shifting the high and low parts by B bits and, since you've shifted the high part, both operations will fit inside their 32 bit integer
Process X similarly in the high and low parts
Keeping track of the signs of X and Y calculate X + Y*(2^B) for the high and low parts
Again shift both the high and low results by A bits
Join the high and low parts into the final result

If you can't use 64-bits because your local C does not support them rather than some other overriding reason, you might consider The GNU Multiple Precision Arithmetic Library at http://gmplib.org/

Related

How can I split one long value which was 'build' from 2 int-values back into its 2 integers?

I have (not mine) a program that reads long values from a data-file.
I can change the numbers in the data-file and I can do s.th. with the number the program has read from the data-file.
Now I want to write 2 integer-values (2*4 byte) in the data-file instead of one (small) long-value (8 byte).
What do I have to do with the number I get in the program to 'split' that into the two initial integer-values?
What I read is s.th. like 54257654438765. How do I split that?
That program offers me some (c-like) bitwise operations:
x = x << y; // left shift
x = x >> y; // right shift
b = ((x & y) != 0); // Bitwise AND
b = ((x | y) != 0); // Bitwise OR
b = x ^ y; // Bitwise Exclusive Operation OR
But these operators are working in that program only with integer- not long-values and I assume that 2 integers together get bigger than the highest possible integer +-2'147'483'647).
Is there a numeric approach (from the value I see) to get back the two int-values?
I have never tried that and I appreciate any hint!

That is a easy one. You got your 64-bit value. The upper 32-bit is one value, the lower another.
The trick is to get the values into position to a cast to a 32-bit integer works. So casting your 64-bit value to a integer and storing it in a integer variable, will give you the lower 32-bit value right away.
For the upper value you need to do some shifting. You need to move the upper values by 32bit to the right to get them into position.
So basically:
uint64 longValue = /* Your long value. */;
uint32 firstIntValue = (uint32) longValue;
uint32 secondIntValue = (uint32) (longValue >> 32);
As the cast will discard all bits not fitting into the new variable that should work just fine.
EDIT:
And as requested by comment. Also the other way round:
uint64 longValue = secondIntValue;
longValue = longValue << 32; /* If its C: longValue <<= 32; */
longValue = longValue | firstIntValue; /* If its C: longValue |= firstIntValue; */
The idea here is to first put the the integer that is supposed to end up in the higher bits to the 64-bit storage and move them with the shift to the up bits. After that place the lower value with a OR operation in the lower bits. You can't perform a simple assignment in the last operation because that would kill the upper bits.
Just as additional information. You can get around all that shifting entirely in case you are able to use unions and structs in the language you are using. If its plain C that is possible.

Data conversion from accelerometer

Hi all I am working on an accelerometer bma220 , and its datasheet says that data is in 2's complement form.So what i had to do was getting that 8 bit data in any 8 bit signed char and done.
the bma220 have an 8 bit register of which first 6 bits are data and last two are zero.
void properdata(int16_t *msgData)
{
printf("\nin proper data\n");
int16_t temp, i;
for(i=0; i<3; i++)
{
temp = *(msgData + i);
printf("temp = %d sense = %d\n", temp, sense);
temp = temp >> 2; // only 6 bits data
temp = temp / sense; //decimal value * .0625 = value in g
printf("temp = %d\n", temp);
}
}
in this program i am taking data in a unsigned variable msgdata and doing all the calculations on a signed variable. I just need to know if this is the correct way to convert data?
After some suggestions i changed my code to this
void properdata(uint16_t *msgData)
{
int arr[3];
arr[0] = msgData[0];
arr[1] = msgData[1];
arr[2] = msgData[2];
arr[0] = arr[0]/4;
arr[1] = arr[1]/4;
arr[2] = arr[2]/4;
printf("x = %d y = %d z = %d\n", arr[0], arr[1], arr[2]);
}
now in a stand still condition i am getting data as 61, 60 and 17.If I think data should be in a range 31 to -32 but here it is coming out of range?

in this program i am taking data in a unsigned variable msgdata
No you aren't. msgdata is a signed variable.
I just need to know if this is the correct way to convert data?
Using bit-wise operators on signed variables is almost always a bug. You perform a right shift on a signed variable, this is implementation-defined behavior and what will happen to the sign bit depends on the compiler.

I see two problems in your code:
1) As Lundin stated out, shifting negative values to the right is dangerous since behavior is compiler specific.
2) According to the data sheet the range of the accelerator is 1.94 ...- 2.00 g. You try to store the value as plain integer. At least fix point arithmetic is needed here (or float). Or your result will just be 1, 0 , -1 or -2.
The following code should take these points into account (not tested):
int16_t raw; // the 8 bit raw value from the chip
int32_t accel; // acceleration in mg
raw = (int16_t) read_value_from_chip(); // get 8 bits raw value from chip
accel = (int32_t)(raw / 4) * 625; // to avoid to shift to right, use division here
if ( accel >= 0 )
accel = ( accel + 5 ) / 10;
else
accel = ( accel - 5 ) / 10;
printf("%ld\n", accel);
Explanation:
According to the data sheet the resolution is 62.5 mg and the most significant 6 bits hold the signed raw value.
To avoid to deal with the sign explicitly when bringing the bits into position the division is used here instead of the right-shift. Dividing by 4 is used instead of >> 2. This keeps the sign as required.
An optimizing compiler will replace this division by a bit-shift if the compiler/MCU sets bits on the left side 1 when negative values are shifted to the right. If the compiler/MCU does not support this, the division will be used.
*625 is done to get the acceleration in the required resolution of 1/10 mg (1 digit is 0.1 mg). 625 is the short form of 0.0625 * 10000. (Updated)
To get it in mg the acceleration is divided by 10 (I do this here just since mg is handier than 0.1 mg). To round correctly the half of the dividend must be added/subtracted according to the sign before dividing, here this is 10/2 = 5.
The result is now in mg.
If you want to avoid the division, you must handle negative/positive values explicitly when bringing the significant bits into place.

Typically the spec sheet will have an example conversion or two. It might show the value 0000 0000 (binary) is zero, and 0100 0000 is 47.25 g. Run the example values through your code to validate.

optimizing bitwise operations

I have a unsigned integer N = abcd where a,b,c,d represents bits from msb to lsb. I want get following numbers
x1 = ab0cd
x2 = ab1cd
What is the fastest way to do it using bitwise operations in C?
What I'm trying right now is as follows
unsigned int blockid1 = N>>offset;
unsigned int key1 = (blockid<<(1+offset))|(((1<<offset)-1)&N);
unsigned int key2 = (key1)|(1<<offset);
here offset is the location where I want to insert 0 and 1.

const unsigned int mask = (~0) << offset; // same as -(2**offset)
unsigned int key1 = N + (N & mask);
unsigned int key2 = key1 - mask;

Since your input is only 4 bits wide, which means there are a total of only 16 outputs, I would recommend at least testing (i.e. implementing and profiling) a look-up table.
These days, with super-fast ALUs and slow(ish) memories, look-ups are not often faster, but it's always worth testing. A small table means it won't pollute the cache very much, which might make it faster than a sequence of arithmetic instructions.
Since your outputs are pretty small too, the complete table could be represented as:
const static uint8_t keys[16][2];
32 bytes is very small, and if you do this often (i.e. many times in a row in a tight loop), the table should fit totally in cache.

You should have a look at Jasper Neumann's pages about bit permutations. It includes an online code generator. However it may be usefully complex for your specific case (permutation of one bit if we consider the 0 or 1 to be the MSB).
Note: I let you google the adresse since it has no domain name and direct IPs are not allowed by SO.

Homework - C bit puzzle - Perform % using C bit operations (no looping, conditionals, function calls, etc)

I'm completely stuck on how to do this homework problem and looking for a hint or two to keep me going. I'm limited to 20 operations (= doesn't count in this 20).
I'm supposed to fill in a function that looks like this:
/* Supposed to do x%(2^n).
For example: for x = 15 and n = 2, the result would be 3.
Additionally, if positive overflow occurs, the result should be the
maximum positive number, and if negative overflow occurs, the result
should be the most negative number.
*/
int remainder_power_of_2(int x, int n){
int twoToN = 1 << n;
/* Magic...? How can I do this without looping? We are assuming it is a
32 bit machine, and we can't use constants bigger than 8 bits
(0xFF is valid for example).
However, I can make a 32 bit number by ORing together a bunch of stuff.
Valid operations are: << >> + ~ ! | & ^
*/
return theAnswer;
}
I was thinking maybe I could shift the twoToN over left... until I somehow check (without if/else) that it is bigger than x, and then shift back to the right once... then xor it with x... and repeat? But I only have 20 operations!

Hint: In decadic system to do a modulo by power of 10, you just leave the last few digits and null the other. E.g. 12345 % 100 = 00045 = 45. Well, in computer numbers are binary. So you have to null the binary digits (bits). So look at various bit manipulation operators (&, |, ^) to do so.

Since binary is base 2, remainders mod 2^N are exactly represented by the rightmost bits of a value. For example, consider the following 32 bit integer:
00000000001101001101000110010101
This has the two's compliment value of 3461525. The remainder mod 2 is exactly the last bit (1). The remainder mod 4 (2^2) is exactly the last 2 bits (01). The remainder mod 8 (2^3) is exactly the last 3 bits (101). Generally, the remainder mod 2^N is exactly the last N bits.
In short, you need to be able to take your input number, and mask it somehow to get only the last few bits.
A tip: say you're using mod 64. The value of 64 in binary is:
00000000000000000000000001000000
The modulus you're interested in is the last 6 bits. I'll provide you a sequence of operations that can transform that number into a mask (but I'm not going to tell you what they are, you can figure them out yourself :D)
00000000000000000000000001000000 // starting value
11111111111111111111111110111111 // ???
11111111111111111111111111000000 // ???
00000000000000000000000000111111 // the mask you need
Each of those steps equates to exactly one operation that can be performed on an int type. Can you figure them out? Can you see how to simplify my steps? :D
Another hint:
00000000000000000000000001000000 // 64
11111111111111111111111111000000 // -64

Since your divisor is always power of two, it's easy.
uint32_t remainder(uint32_t number, uint32_t power)
{
power = 1 << power;
return (number & (power - 1));
}
Suppose you input number as 5 and divisor as 2
`00000000000000000000000000000101` number
AND
`00000000000000000000000000000001` divisor - 1
=
`00000000000000000000000000000001` remainder (what we expected)
Suppose you input number as 7 and divisor as 4
`00000000000000000000000000000111` number
AND
`00000000000000000000000000000011` divisor - 1
=
`00000000000000000000000000000011` remainder (what we expected)
This only works as long as divisor is a power of two (Except for divisor = 1), so use it carefully.

How to subtract two unsigned ints with wrap around or overflow

There are two unsigned ints (x and y) that need to be subtracted. x is always larger than y. However, both x and y can wrap around; for example, if they were both bytes, after 0xff comes 0x00. The problem case is if x wraps around, while y does not. Now x appears to be smaller than y. Luckily, x will not wrap around twice (only once is guaranteed). Assuming bytes, x has wrapped and is now 0x2, whereas y has not and is 0xFE. The right answer of x - y is supposed to be 0x4.
Maybe,
( x > y) ? (x-y) : (x+0xff-y);
But I think there is another way, something involving 2s compliment?, and in this embedded system, x and y are the largest unsigned int types, so adding 0xff... is not possible
What is the best way to write the statement (target language is C)?

Assuming two unsigned integers:
If you know that one is supposed to be "larger" than the other, just subtract. It will work provided you haven't wrapped around more than once (obviously, if you have, you won't be able to tell).
If you don't know that one is larger than the other, subtract and cast the result to a signed int of the same width. It will work provided the difference between the two is in the range of the signed int (if not, you won't be able to tell).
To clarify: the scenario described by the original poster seems to be confusing people, but is typical of monotonically increasing fixed-width counters, such as hardware tick counters, or sequence numbers in protocols. The counter goes (e.g. for 8 bits) 0xfc, 0xfd, 0xfe, 0xff, 0x00, 0x01, 0x02, 0x03 etc., and you know that of the two values x and y that you have, x comes later. If x==0x02 and y==0xfe, the calculation x-y (as an 8-bit result) will give the correct answer of 4, assuming that subtraction of two n-bit values wraps modulo 2n - which C99 guarantees for subtraction of unsigned values. (Note: the C standard does not guarantee this behaviour for subtraction of signed values.)

Here's a little more detail of why it 'just works' when you subtract the 'smaller' from the 'larger'.
A couple of things going into this…
1. In hardware, subtraction uses addition: The appropriate operand is simply negated before being added.
2. In two’s complement (which pretty much everything uses), an integer is negated by inverting all the bits then adding 1.
Hardware does this more efficiently than it sounds from the above description, but that’s the basic algorithm for subtraction (even when values are unsigned).
So, lets figure 2 – 250 using 8bit unsigned integers. In binary we have
0 0 0 0 0 0 1 0
- 1 1 1 1 1 0 1 0
We negate the operand being subtracted and then add. Recall that to negate we invert all the bits then add 1. After inverting the bits of the second operand we have
0 0 0 0 0 1 0 1
Then after adding 1 we have
0 0 0 0 0 1 1 0
Now we perform addition...
0 0 0 0 0 0 1 0
+ 0 0 0 0 0 1 1 0
= 0 0 0 0 1 0 0 0 = 8, which is the result we wanted from 2 - 250

Maybe I don't understand, but what's wrong with:
unsigned r = x - y;

The question, as stated, is confusing. You said that you are subtracting unsigned values. If x is always larger than y, as you said, then x - y cannot possibly wrap around or overflow. So you just do x - y (if that's what you need) and that's it.

This is an efficient way to determine the amount of free space in a circular buffer or do sliding window flow control.
Use unsigned ints for head and tail - increment them and let them wrap!
Buffer length has to be a power of 2.
free = ((head - tail) & size_mask), where size_mask is 2^n-1 the buffer or window size.

Just to put the already correct answer into code:
If you know that x is the smaller value, the following calculation just works:
int main()
{
uint8_t x = 0xff;
uint8_t y = x + 20;
uint8_t res = y - x;
printf("Expect 20: %d\n", res); // res is 20
return 0;
}
If you do not know which one is smaller:
int main()
{
uint8_t x = 0xff;
uint8_t y = x + 20;
int8_t res1 = (int8_t)x - y;
int8_t res2 = (int8_t)y - x;
printf("Expect -20 and 20: %d and %d\n", res1, res2);
return 0;
}
Where the difference must be inside the range of uint8_t in this case.
The code experiment helped me to understand the solution better.

The problem should be stated as follows:
Let's assume the position (angle) of two pointers a and b of a clock is given by an uint8_t. The whole circumerence is devided into the 256 values of an uint8_t. How can the smaller distance between the two pointer be calculated efficiently?
A solution is:
uint8_t smaller_distance = abs( (int8_t)( a - b ) );
I suspect there is nothing more effient as otherwise there would be something more efficient than abs().

To echo everyone else replying, if you just subtract the two and interpret the result as unsigned you'll be fine.
Unless you have an explicit counterexample.
Your example of x = 0x2, y= 0x14 would not result in 0x4, it would result in 0xEE, unless you have more constraints on the math that are unstated.

Yet another answer, and hopefully easy to understand:
SUMMARY:
It's assumed the OP's x and y are assigned values from a counter, e.g., from a timer.
(x - y) will always give the value desired, even if the counter wraps.
This assumes the counter is incremented less than 2^N times between y and x,
for N-bit unsigned int's.
DESCRIPTION:
A counter variable is unsigned and it can wrap around.
A uint8 counter would have values:
0, 1, 2, ..., 255, 0, 1, 2, ..., 255, ...
The number of counter tics between two points can be calculated as shown below.
This assumes the counter is incremented less than 256 times, between y and x.
uint8 x, y, counter, counterTics;
<initalize the counter>
<do stuff while the counter increments>
y = counter;
<do stuff while the counter increments>
x = counter;
counterTics = x - y;
EXPLANATION:
For uint8, and the counter-tics from y to x is less than 256 (i.e., less than 2^8):
If (x >= y) then: the counter did not wrap, counterTics == x - y
If (x < y) then: the counter wrapped, counterTics == (256-y) + x
(256-y) is the number of tics before wrapping.
x is the number of tics after wrapping.
Note: if those calculations are made in the order shown, no negative numbers are involved.
This equation holds for both cases: counterTics == (256+x-y) mod 256
For uintN, where N is the number of bits:
counterTics == ((2^N)+x-y) mod (2^N)
The last equation also describes the result in C when subtracting unsigned int's, in general.
This is not to say the compiler or processor uses that equation when subtracting unsigned int's.
RATIONALE:
The explanation is consistent with what is described in this ACM paper:
"Understanding Integer Overflow in C/C++", by Dietz, et al.
HARDWARE INTEGER ARITHMETIC
When an n-bit addition or subtraction operation on unsigned or two’s complement integers overflows, the result “wraps around,” effectively subtracting 2n from, or adding 2n to, the true mathematical result. Equivalently, the result can be considered to occupy n+1 bits; the lower n bits are placed into the result register and the highest-order bit is placed into the processor’s carry flag.
INTEGER ARITHMETIC IN C AND C++
3.3. Unsigned Overflow
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
Thus, the semantics for unsigned overflow in C/C++ are precisely the same as the semantics of processor-level unsigned overflow as described in Section 2. As shown in Table I, UINT MAX+1 must evaluate to zero in a conforming C and C++ implementation.
Also, it's easy to write a C program to test that the cases shown work as described.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Overflow of 32 bit variable - c

If you can't use 64-bits because your local C does not support them rather than some other overriding reason, you might consider The GNU Multiple Precision Arithmetic Library at http://gmplib.org/

Related

How can I split one long value which was 'build' from 2 int-values back into its 2 integers?

Data conversion from accelerometer

optimizing bitwise operations

Homework - C bit puzzle - Perform % using C bit operations (no looping, conditionals, function calls, etc)

How to subtract two unsigned ints with wrap around or overflow

Categories

Resources