How to subtract two unsigned ints with wrap around or overflow - c

There are two unsigned ints (x and y) that need to be subtracted. x is always larger than y. However, both x and y can wrap around; for example, if they were both bytes, after 0xff comes 0x00. The problem case is if x wraps around, while y does not. Now x appears to be smaller than y. Luckily, x will not wrap around twice (only once is guaranteed). Assuming bytes, x has wrapped and is now 0x2, whereas y has not and is 0xFE. The right answer of x - y is supposed to be 0x4.
Maybe,
( x > y) ? (x-y) : (x+0xff-y);
But I think there is another way, something involving 2s compliment?, and in this embedded system, x and y are the largest unsigned int types, so adding 0xff... is not possible
What is the best way to write the statement (target language is C)?

Assuming two unsigned integers:
If you know that one is supposed to be "larger" than the other, just subtract. It will work provided you haven't wrapped around more than once (obviously, if you have, you won't be able to tell).
If you don't know that one is larger than the other, subtract and cast the result to a signed int of the same width. It will work provided the difference between the two is in the range of the signed int (if not, you won't be able to tell).
To clarify: the scenario described by the original poster seems to be confusing people, but is typical of monotonically increasing fixed-width counters, such as hardware tick counters, or sequence numbers in protocols. The counter goes (e.g. for 8 bits) 0xfc, 0xfd, 0xfe, 0xff, 0x00, 0x01, 0x02, 0x03 etc., and you know that of the two values x and y that you have, x comes later. If x==0x02 and y==0xfe, the calculation x-y (as an 8-bit result) will give the correct answer of 4, assuming that subtraction of two n-bit values wraps modulo 2n - which C99 guarantees for subtraction of unsigned values. (Note: the C standard does not guarantee this behaviour for subtraction of signed values.)

Here's a little more detail of why it 'just works' when you subtract the 'smaller' from the 'larger'.
A couple of things going into this…
1. In hardware, subtraction uses addition: The appropriate operand is simply negated before being added.
2. In two’s complement (which pretty much everything uses), an integer is negated by inverting all the bits then adding 1.
Hardware does this more efficiently than it sounds from the above description, but that’s the basic algorithm for subtraction (even when values are unsigned).
So, lets figure 2 – 250 using 8bit unsigned integers. In binary we have
0 0 0 0 0 0 1 0
- 1 1 1 1 1 0 1 0
We negate the operand being subtracted and then add. Recall that to negate we invert all the bits then add 1. After inverting the bits of the second operand we have
0 0 0 0 0 1 0 1
Then after adding 1 we have
0 0 0 0 0 1 1 0
Now we perform addition...
0 0 0 0 0 0 1 0
+ 0 0 0 0 0 1 1 0
= 0 0 0 0 1 0 0 0 = 8, which is the result we wanted from 2 - 250

Maybe I don't understand, but what's wrong with:
unsigned r = x - y;

The question, as stated, is confusing. You said that you are subtracting unsigned values. If x is always larger than y, as you said, then x - y cannot possibly wrap around or overflow. So you just do x - y (if that's what you need) and that's it.

This is an efficient way to determine the amount of free space in a circular buffer or do sliding window flow control.
Use unsigned ints for head and tail - increment them and let them wrap!
Buffer length has to be a power of 2.
free = ((head - tail) & size_mask), where size_mask is 2^n-1 the buffer or window size.

Just to put the already correct answer into code:
If you know that x is the smaller value, the following calculation just works:
int main()
{
uint8_t x = 0xff;
uint8_t y = x + 20;
uint8_t res = y - x;
printf("Expect 20: %d\n", res); // res is 20
return 0;
}
If you do not know which one is smaller:
int main()
{
uint8_t x = 0xff;
uint8_t y = x + 20;
int8_t res1 = (int8_t)x - y;
int8_t res2 = (int8_t)y - x;
printf("Expect -20 and 20: %d and %d\n", res1, res2);
return 0;
}
Where the difference must be inside the range of uint8_t in this case.
The code experiment helped me to understand the solution better.

The problem should be stated as follows:
Let's assume the position (angle) of two pointers a and b of a clock is given by an uint8_t. The whole circumerence is devided into the 256 values of an uint8_t. How can the smaller distance between the two pointer be calculated efficiently?
A solution is:
uint8_t smaller_distance = abs( (int8_t)( a - b ) );
I suspect there is nothing more effient as otherwise there would be something more efficient than abs().

To echo everyone else replying, if you just subtract the two and interpret the result as unsigned you'll be fine.
Unless you have an explicit counterexample.
Your example of x = 0x2, y= 0x14 would not result in 0x4, it would result in 0xEE, unless you have more constraints on the math that are unstated.

Yet another answer, and hopefully easy to understand:
SUMMARY:
It's assumed the OP's x and y are assigned values from a counter, e.g., from a timer.
(x - y) will always give the value desired, even if the counter wraps.
This assumes the counter is incremented less than 2^N times between y and x,
for N-bit unsigned int's.
DESCRIPTION:
A counter variable is unsigned and it can wrap around.
A uint8 counter would have values:
0, 1, 2, ..., 255, 0, 1, 2, ..., 255, ...
The number of counter tics between two points can be calculated as shown below.
This assumes the counter is incremented less than 256 times, between y and x.
uint8 x, y, counter, counterTics;
<initalize the counter>
<do stuff while the counter increments>
y = counter;
<do stuff while the counter increments>
x = counter;
counterTics = x - y;
EXPLANATION:
For uint8, and the counter-tics from y to x is less than 256 (i.e., less than 2^8):
If (x >= y) then: the counter did not wrap, counterTics == x - y
If (x < y) then: the counter wrapped, counterTics == (256-y) + x
(256-y) is the number of tics before wrapping.
x is the number of tics after wrapping.
Note: if those calculations are made in the order shown, no negative numbers are involved.
This equation holds for both cases: counterTics == (256+x-y) mod 256
For uintN, where N is the number of bits:
counterTics == ((2^N)+x-y) mod (2^N)
The last equation also describes the result in C when subtracting unsigned int's, in general.
This is not to say the compiler or processor uses that equation when subtracting unsigned int's.
RATIONALE:
The explanation is consistent with what is described in this ACM paper:
"Understanding Integer Overflow in C/C++", by Dietz, et al.
HARDWARE INTEGER ARITHMETIC
When an n-bit addition or subtraction operation on unsigned or two’s complement integers overflows, the result “wraps around,” effectively subtracting 2n from, or adding 2n to, the true mathematical result. Equivalently, the result can be considered to occupy n+1 bits; the lower n bits are placed into the result register and the highest-order bit is placed into the processor’s carry flag.
INTEGER ARITHMETIC IN C AND C++
3.3. Unsigned Overflow
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
Thus, the semantics for unsigned overflow in C/C++ are precisely the same as the semantics of processor-level unsigned overflow as described in Section 2. As shown in Table I, UINT MAX+1 must evaluate to zero in a conforming C and C++ implementation.
Also, it's easy to write a C program to test that the cases shown work as described.

Related

SWAR byte counting methods from 'Bit Twiddling Hacks' - why do they work?

Bit Twiddling Hacks contains the following macros, which count the number of bytes in a word x that are less than, or greater than, n:
#define countless(x,n) \
(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)
#define countmore(x,n) \
(((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255)
However, it doesn't explain why they work. What's the logic behind these macros?
Let's try for intuition on countmore.
First, ~0UL/255*(127-n) is a clever way of copying the value 127-n to all bytes in the word in parallel. Why does it work? ~0 is 255 in all bytes. Consequently, ~0/255 is 1 in all bytes. Multiplying by (127-n) does the "copying" mentioned at the outset.
The term ~0UL/255*127 is just a special case of the above where n is zero. It copies 127 into all bytes. That's 0x7f7f7f7f if words are 4 bytes. "Anding" with x zeros out the high order bit in each byte.
That's the first term (x)&~0UL/255*127). The result is the same as x except the high bit in each byte is zeroed.
The second term ~0UL/255*(127-(n)) is as above: 127-n copied to each byte.
For any given byte x[i], adding the two terms gives us 127-n+x[i] if x[i]<=127. This quantity will have the high order bit set whenever x[i]>n. It's easiest to see this as adding two 7-bit unsigned numbers. The result "overflows" into the 8th bit because the result is 128 or more.
So it looks like the algorithm is going to use the 8th bit of each byte as a boolean indicating x[i]>n.
So what about the other case, x[i]>127? Here we know the byte is more than n because the algorithm stipulates n<=127. The 8th bit ought to be always 1. Happily, the sum's 8th bit doesn't matter because the next step "or"s the result with x. Since x[i] has the 8th bit set to 1 if and only if it's 128 or larger, this operation "forces" the 8th bit to 1 just when the sum might provide a bad value.
To summarize so far, the "or" result has the 8th bit set to 1 in its i'th byte if and only if x[i]>n. Nice.
The next operation &~0UL/255*128 sets everything to zero except all those 8th bits of interest. It's "anding" with 0x80808080...
Now the task is to find the number of these bits set to 1. For this, countmore uses some basic number theory. First it shifts right 7 bits so the bits of interest are b0, b8, b16... The value of this word is
b0 + b8*2^8 + b16*2^16 + ...
A beautiful fact is that 1 == 2^8 == 2^16 == ... mod 255. In other words, each 1 bit is 1 mod 255. It follows that finding mod 255 of the shifted result is the same as summing b0+b8+b16+...
Yikes. We're done.
Let's analyse countless macro. We can simplify this macro as following code:
#define A(n) (0x0101010101010101UL * (0x7F+n))
#define B(x) (x & 0x7F7F7F7F7F7F7F7FUL)
#define C(x,n) (A(n) - B(x))
#define countless(x,n) (( C(x,n) & ~x & 0x8080808080808080UL) / 0x80 % 0xFF )
A(n) will be:
A(0) = 0x7F7F7F7F7F7F7F7F
A(1) = 0x8080808080808080
A(2) = 0x8181818181818181
A(3) = 0x8282828282828282
....
And for B(x), each byte of x will mask with 0x7F.
If we suppose x = 0xb0b1b2b3b4b5b6b7 and n = 0, then C(x,n) will equals to 0x(0x7F-b0)(0x7F-b1)(0x7F-b2)...
For example, We suppose x = 0x1234567811335577 and n = 0x50. So:
A(0x50) = 0xCFCFCFCFCFCFCFCF
B(0x1234567811335577) = 0x1234567811335577
C(0x1234567811335577, 0x50) = 0xBD9B7957BE9C7A58
~(0x1234567811335577) = 0xEDCBA987EECCAA88
0xEDCBA987EECCAA88 & 0x8080808080808080UL = 0x8080808080808080
C(0x1234567811335577, 0x50) & 0x8080808080808080 = 0x8080000080800000
(0x8080000080800000 / 0x80) % 0xFF = 4 //Count bytes that equal to 0x80 value.

Overflow and underflow in unsigned integers

Suppose, I'm trying to subtract 2 unsigned integers:
247 = 1111 0111
135 = 1000 0111
If we subtract these 2 binary numbers we get = 0111 0000
Is this a underflow, since we only need 7 bits now?? Or how does that work??
Underflow in unsigned subtraction c = a - b occurs whenever b is larger than a.
However, that's somewhat of a circular definition, because how many kinds of machines perform the a < b comparison is by subtracting the operands using wraparound arithmetic, and then detecting the overflow based on the two operands and the result.
Note also that in C we don't speak about "overflow", because there is no error condition: C unsigned integers provide that wraparound arithmetic that is commonly found in hardware.
So, given that we have the wraparound arithmetic, we can detect whether wraparound (or overflow, depending on point of view) has taken place in a subtraction.
What we need is the most significant bits from a, b and c. Let's call them A, B and C. From these, the overflow V is calculated like this:
A B C | V
------+--
0 0 0 | 0
0 0 1 | 1
0 1 0 | 1
0 1 1 | 1
1 0 0 | 0
1 0 1 | 0
1 1 0 | 0
1 1 1 | 1
This simplifies to
A'B + A'C + BC
In other words, overflow in the unsigned subtraction c = a - b happens whenever:
the msb of a is 0 and that of b is 1;
or the msb of a is 0 and that of c is 1;
or the msb of b is 1 and that of c is also 1.
Subtracting 247 - 135 = 112 is clearly not overflow, since 247 is larger than 135. Applying the rules above, A = 1, B = 0 and C = 0. The 1 1 0 row of the table has a 0 in the V column: no overflow.
Generally, “underflow” means the ideal mathematical result of a calculation is below what the type can represent. If 7 is subtracted from 5 in unsigned arithmetic, the ideal mathematical result would be −2, but an unsigned type cannot represent −2, so the operation underflows. Or, in an eight-bit signed type that can represent numbers from −128 to +127, subtracting 100 from −100 would ideally produce −200, but this cannot be represented in the type, so the operation underflows.
In C, unsigned arithmetic is said not to underflow or overflow because the C standard defines the operations to be performed using modulo arithmetic instead of real-number arithmetic. For example, with 32-bit unsigned arithmetic, subtracting 7 from 5 would produce 4,294,967,294 (in hexadecimal, FFFFFFFE16), because it has wrapped modulo 232 = 4,294,967,296. People may nonetheless use the terms “underflow” or “overflow” when discussing these operations, intended to refer to the mathematical issues rather than the defined C behavior.
In other words, for whatever type you are using for arithmetic there is some lower limit L and some upper limit U that the type can represent. If the ideal mathematical result of an operation is less than L, the operation underflows. If the ideal mathematical result of an operation is greater than U, the operation overflows. “Underflow” and “overflow” mean the operation has gone out of the bounds of the type. “Overflow” may also be used to refer to any exceeding of the bounds of the type, including in the low direction.
It does not mean that fewer bits are needed to represent the result. When 100001112 is subtracted from 111101112, the result, 011100002 = 11100002, is within bounds, so there is no overflow or underflow. The fact that it needs fewer bits to represent is irrelevant.
(Note: For integer arithmetic, “underflow” or “overflow” is defined relative to the absolute bounds L and U. For floating-point arithmetic, these terms have somewhat different meanings. They may be defined relative to the magnitude of the result, neglecting the sign, and they are defined relative to the finite non-zero range of the format. A floating-point format may be able to represent 0, then various finite non-zero numbers, then infinity. Certain results between 0 and the smallest non-zero number the format can represent are said to underflow even though they are technically inside the range of representable numbers, which is from 0 to infinity in magnitude. Similarly, certain results above the greatest representable finite number are said to overflow even though they are inside the representable range, since they are less than infinity.)
Long story short, this is what happens when you have:
unsigned char n = 255; /* highest possible value for an unsigned char */
n = n + 1; /* now n is "overflowing" (although the terminology is not correct) to 0 */
printf("overflow: 255 + 1 = %u\n", n);
n = n - 1; /* n will now "underflow" from 0 to 255; */
printf("underflow: 0 - 1 = %u\n", n);
n *= 2; /* n will now be (255 * 2) % 256 = 254;
/* when the result is too high, modulo with 2 to the power of 8 is used */
/* for an 8 bit variable such as unsigned char; */
printf("large overflow: 255 * 2 = %u\n", n);
n = n * (-2) + 100; /* n should now be -408 which is 104 in terms of unsigned char. */
/* (Logic is this: 408 % 256 = 152; 256 - 152 = 104) */
printf("large underflow: 255 * 2 = %u\n", n);
The result of that is (compiled with gcc 11.1, flags -Wall -Wextra -std=c99):
overflow: 255 + 1 = 0
underflow: 0 - 1 = 255
large overflow: 255 * 2 = 254
large underflow: 255 * 2 = 104
Now the scientific version: The comments above represent just a mathematical model of what is going on. To better understand what is actually happening, the following rules apply:
Integer promotion:
Integer types smaller than int are promoted when an operation is
performed on them. If all values of the original type can be
represented as an int, the value of the smaller type is converted to
an int; otherwise, it is converted to an unsigned int.
So what actually happens in memory when the computer does n = 255; n = n + 1; is this:
First, the right side is evaluated as an int (signed), because the result fits in a signed int according to the rule of integer promotion. So the right side of the expression becomes in binary: 0b00000000000000000000000011111111 + 0b00000000000000000000000000000001 = 0b00000000000000000000000100000000 (a 32 bit int).
Truncation
The 32-bit int loses the most significant 24 bits when being assigned
back to an 8-bit number.
So, when 0b00000000000000000000000100000000 is assigned to variable n, which is an unsigned char, the 32-bit value is truncated to an 8-bit value (only the right-most 8 bits are copied) => n becomes 0b00000000.
The same thing happens for each operation. The expression on the right side evaluates to a signed int, than it is truncated to 8 bits.

Simple program convert int16_t array to uint16_t

I have used the WinFilter Program to compute the FIR filter on C code but I got a issue:
The program only provide the 16 bits signed array and i need to that this vector be a unsigned int. So i am looking for a simple solution to relocated the array values to the next "values".
int16_t FIRCoef[Ntap] = {
-1029,
-1560,
-1188,
0,
1405,
2186,
1718,
0,
-2210,
-3647,
-3095,
0,
5160,
10947,
15482,
17197,
15482,
10947,
5160,
0,
-3095,
-3647,
-2210,
0,
1718,
2186,
1405,
0,
-1188,
-1560,
-1029,
0
};
uint16_t fir(uint16_t NewSample) {
static uint16_t x[Ntap]; //input samples
uint32_t y=0; //output sample
int n;
//shift the old samples
for(n=Ntap-1; n>0; n--)
x[n] = x[n-1];
//Calculate the new output
x[0] = NewSample;
for(n=0; n<Ntap; n++)
y += FIRCoef[n] * x[n]; // calculo da convolucao na amostra
// Calculation of the convolution in the sample
return y / DCgain;
}
I think that one solution should be like this:
uint16_t--------int16_t---------index
0 -32767 1
1 -32766 2
2 -32765 3
... ... ...
65535 32767 65535
any hint?
The value range for an int16_t is -32768 to 32767. Your question is unclear on this point, but it seems like you want simply to shift these values into the range of a uint16_t, 0 to 65535. That's reasonable, as the number of representable values for the two types is the same; it would be accomplished by adding the inverse of the minimum possible value of an int16_t to the input.
Of course, the Devil is in the details. When signed addition overflows, undefined behavior results. When an out-of-range value is converted to a signed integer type, the result is implementation defined, and can be an (implementation-defined) exception. It is desirable to avoid implementation-defined behavior and essential to avoid undefined behavior; that can be done in this case with just a little care:
uint16_t convert(int16_t in) {
return (uint16_t) 32768 + (uint16_t) in;
}
That reliably does the right thing on any conforming system that provides the uint16_t and int16_t types in the first place, because the conversion and the addition operate modulo one plus the maximum value of a uint16_t. The negative input values are converted to unsigned values in the upper half of the range of uint16_t, and the addition then rotates all the values, bringing those from the upper half of the range around to the lower half.
As for doing it to a whole array, if you want to rely only on well-defined C behavior (i.e. if you want a solution that strictly conforms to the standard) then you'll need to make a copy of the data. You could use a function such as the above to populate the copy.

Unary negation of unsigned integer 4

If x is an unsigned int type is there a difference in these statements:
return (x & 7);
and
return (-x & 7);
I understand negating an unsigned value gives a value of max_int - value. But is there a difference in the return value (i.e. true/false) among the above two statements under any specific boundary conditions OR are they both same functionally?
Test code:
#include <stdio.h>
static unsigned neg7(unsigned x) { return -x & 7; }
static unsigned pos7(unsigned x) { return +x & 7; }
int main(void)
{
for (unsigned i = 0; i < 8; i++)
printf("%u: pos %u; neg %u\n", i, pos7(i), neg7(i));
return 0;
}
Test results:
0: pos 0; neg 0
1: pos 1; neg 7
2: pos 2; neg 6
3: pos 3; neg 5
4: pos 4; neg 4
5: pos 5; neg 3
6: pos 6; neg 2
7: pos 7; neg 1
For the specific case of 4 (and also 0), there isn't a difference; for other values, there is a difference. You can extend the range of the input, but the outputs will produce the same pattern.
If you ask specifically for true/false (i.e. is zero / not zero) and two's complement then there is indeed no difference. (You do however return not just a simple truth value but allow different bit patterns for true. As long as the caller does not distinguish, that is fine.)
Consider how a two's complement negation is formed: invert the bits then increment. Since you take only the least significant bits, there will be no carry in for the increment. This is a necessity, so you can't do this with anything but a range of least significant bits.
Let's look at the two cases:
First, if the three low bits are zero (for a false equivalent). Inverting gives all ones, incrementing turns them to zero again. The fourth and more significant bits might be different, but they don't influence the least significant bits and they don't influence the result since they are masked out. So this stays.
Second, if the three low bits are not all zero (for a true equivalent). The only way this can change into false is when the increment operation leaves them at zero, which can only happen if they were all ones before, which in turn could only happen if they were all zeros before the inversion. That can't be, since that is the first case. Again, the more significant bits don't influence the three low bits and they are masked out. So the result does not change.
But again, this only works when the caller considers only the truth value (all bits zero / not all bits zero) and when the mask allows a range of bits starting from the least significant without a gap.
Firstly, negating an unsigned int value produces UINT_MAX - original_value + 1. (For example, 0 remains 0 under negation). The alternative way to describe negation is full inversion of all bits followed by increment.
It is not clear why you'd even ask this question, since it is obvious that basically the very first example that comes to mind — an unsigned int value 1 — already produces different results in your expression. 1u & 7 is 1, while -1u & 7 is 7. Did you mean something else, by any chance?

Overflow of 32 bit variable

Currently I am implementing an equation (2^A)[X + Y*(2^B)] in one of my applications.
The issue is with the overflow of 32 bit value and I cannot use 64 bit data type.
Suppose when B = 3 and Y = 805306367, it overflows 32bit value, but when X = -2147483648, the result comes backs to 32 bit range.
So I want to store the result of (Y*2^B). Can anyone suggest some solution for this.... A and B are having value from -15 to 15 and X,Y can have values from 2147483647..-2147483648.
Output can range from 0...4294967295.
If the number is too big for a 32 bit variable, then you either use more bits (either by storing in a bigger variable, or using multiple variables) or you give up precision and store it in a float. Since Y can be MAX_INT, by definition you can't multiply it by a number greater than 1 and still have it fit in a 32 bit int.
I'd use loop, instead of multiplication, in this case. Something like this:
int newX = X;
int poweredB = ( 1 << B ); // 2^B
for( int i = 0; i < poweredB ; ++i )
{
newX += Y; // or change X directly, if you will not need it later.
}
int result = ( 1 << A ) * newX;
But note : this will work only for some situations - only if you have the guarantee, that this result will not overflow. In your case, when Y is large positive and X is large negative number ("large" - argh, this is too subjective), this will definitely work. But if X is large positive and Y is large positive - there will be overflow again. And not only in this case, but with many others.
Based on the values for A and B in the assignment I suppose the expected solution would involve this:
the following are best done unsigned so store the signs for X and Y and operate on their absolute value
Store X and Y in two variables each, one holding the high 16 bits the other holding the low bits
something like
int hiY = Y & 0xFFFF0000;
int loY = Y & 0x0000FFFF;
Shift the high parts right so that all the variables have the high bits 0
Y*(2^B) is actually a shift of Y to the left by B bits. It is equivalent to shifting the high and low parts by B bits and, since you've shifted the high part, both operations will fit inside their 32 bit integer
Process X similarly in the high and low parts
Keeping track of the signs of X and Y calculate X + Y*(2^B) for the high and low parts
Again shift both the high and low results by A bits
Join the high and low parts into the final result
If you can't use 64-bits because your local C does not support them rather than some other overriding reason, you might consider The GNU Multiple Precision Arithmetic Library at http://gmplib.org/

Resources