Bitwise operations and shifts - c

Im having some trouble understanding how and why this code works the way it does. My partner in this assignment finished this part and I cant get ahold of him to find out how and why this works. I've tried a few different things to understand it, but any help would be much appreciated. This code is using 2's complement and a 32-bit representation.
/*
* fitsBits - return 1 if x can be represented as an
* n-bit, two's complement integer.
* 1 <= n <= 32
* Examples: fitsBits(5,3) = 0, fitsBits(-4,3) = 1
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 15
* Rating: 2
*/
int fitsBits(int x, int n) {
int r, c;
c = 33 + ~n;
r = !(((x << c)>>c)^x);
return r;
}

c = 33 + ~n;
This calculates how many high order bits are remaining after using n low order bits.
((x << c)>>c
This fills the high order bits with the same value as the sign bit of x.
!(blah ^ x)
This is equivalent to
blah == x

On a 2's-complement platform -n is equivalent to ~n + 1. For this reason, c = 33 + ~n on such platform is actually equivalent to c = 32 - n. This c is intended to represent how many higher-order bits remain in a 32-bit int value if n lower bits are occupied.
Note two pieces of platform dependence present in this code: 2's-complement platform, 32-bit int type.
Then ((x << c) >> c is intended to sign-fill those c higher order bits. Sign-fill means that those values of x that have 0 in bit-position n - 1, these higher-order bits have to be zeroed-out. But for those values of x that have 1 in bit-position n - 1, these higher-order bits have to be filled with 1s. This is important to make the code work properly for negative values of x.
This introduces another two pieces of platform dependence: << operator that behaves nicely when shifting negative values or when 1 is shifted into the sign bit (formally it is undefined behavior) and >> operator that performs sign-extension when shifting negative values (formally it is implementation-defined)
The rest is, as answered above, just a comparison with the original value of x: !(a ^ b) is equivalent to a == b. If the above transformations did not destroy the original value of x then x does indeed fit into n lower bits of 2's-complement representation.

Using the bitwise complement (unary ~) operator on a signed integer has implementation-defined and undefined aspects. In other words, this code isn't portable, even when you consider only two's complement implementations.
It is important to note that even two's complement representations in C may have trap representations. 6.2.6.2p2 even states this quite clearly:
If the sign bit is one, the value shall be modified in one of the following ways:
-- the corresponding value with sign bit 0 is negated (sign and magnitude);
-- the sign bit has the value -(2 M ) (two's complement );
-- the sign bit has the value -(2 M - 1) (ones' complement ).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value.
The emphasis is mine. Using trap representations is undefined behaviour.
There are actual implementations that reserve that value as a trap representation in the default mode. The notable one I tend to cite is Unisys Clearpath Dordado on OS2200 (go to 2-29). Do note the date on that document; such implementations aren't necessarily ancient (hence the reason I cite this one).
According to 6.2.6.2p4, shifting negative values left is undefined behaviour, too. I haven't done a whole lot of research into what behaviours are out there in reality, but I would reasonably expect that there might be implementations that sign-extend, as well as implementations that don't. This would also be one way of forming the trap representations mentioned above, which are undefined in nature and thus undesirable. Theoretically (or perhaps some time in the distant or not-so-distant future), you might also face signals "corresponding to a computational exception" (that's a C standard category similar to that which SIGSEGV falls into, corresponding to things like "division by zero") or otherwise erratic and/or undesirable behaviours...
In conclusion, the only reason the code in the question works is by coincidence that the decisions your implementation made happen to align in the right way. If you use the implementation I've listed, you'll probably find that this code doesn't work as expected for some values.
Such heavy wizardry (as it has been described in comments) isn't really necessary, and doesn't really look that optimal to me. If you want something that doesn't rely upon magic (e.g. something portable) to solve this problem consider using this (actually, this code will work for at least 1 <= n <= 64):
#include <stdint.h>
int fits_bits(intmax_t x, unsigned int n) {
uintmax_t min = 1ULL << (n - 1),
max = min - 1;
return (x < 0) * min + x <= max;
}

Related

what does "if(( number >> 1) <<1==number)" mean?

c program to check odd or even without using modulus operator
We can check a number is even or odd without using modulus and division operator in c program
The another method is using shift operators
number >> 1) <<1==number then it is even number,can someone explaain this?
A right shift by x is essentially a truncating division by 2x.
A left shift by x is essentially a multiplication by 2x.
6 ⇒ 3 ⇒ 6
7 ⇒ 3 ⇒ 6
If this produces the original number, it's even.
An unsigned number is odd if its lowest (1) bit is 1. What this does is shift right by 1, and then right by one, effictively zeroing out this first bit. If that bit was already a 0 (i.e. the number is even), then the number doesn't change and the == passes. If it was a 1 then its now a zero and the equality check fails.
A better, more obvious implementation of this logic would be:
if((number & 0x1) == 0)
which checks directly whether the bit is a 0 (i.e. the number is even)
We can check a number is even or odd without using modulus and division operator in c program The another method is using shift operators number >> 1) <<1==number then it is even number
Do not use if(( number >> 1) <<1==number) as it risks implementation defined behavior when number < 0.
Only for the pedantic
This is likely incorrect on rare machines that use ones' complement int encoding.
Of course such beast are so rare these days and the next version of C, C2x, is expected to no longer support non-2's complement encoding.
Alternative
Instead code can use:
is_odd = number & 1LLU;
This will convert various possible integer types of number into unsigned long long and then perform a simple mask of the least significant bit (the one's place). Even with negatives values in any encoding will convert mathematically to an unsigned value with the same even-ness.
Modulus operator??
... using modulus and division operator ...
In C there is no operator defined as modulus. There is %, which results in the remainder.
See What's the difference between “mod” and “remainder”?.
Right-shifting by one shifts off the low bit, left-shifting back restores the value without that low bit.
All odd numbers end in a low bit of 1, which this removes, so the equality comparison only returns true for even numbers, where the removal and re-adding of the low 0 bit does not change the value.

Output of C program changes when optimisation is enabled

I am solving one of the lab exercises from the CS:APP course as a self-study.
In the CS:APP course the maximum positive number, that can be represented with 4 bytes in two's complement, is marked as Tmax (which is equal to the 0x7fffffff).
Likewise, the most negative number is marked as Tmin (which is equal to the 0x80000000).
The goal of the exercise was to implement a isTmax() function which should return 1, when given an Tmax, otherwise it should return 0. This should be done only with a restricted set of operators which are: ! ~ & ^ | +, the maximum number of operators is 10.
Below you can see my implementation of isTmax() function, with comments explaining how it should work.
#include <stdio.h>
int isTmax(int x)
{
/* Ok, lets assume that x really is tMax.
* This means that if we add 1 to it we get tMin, lets call it
* possible_tmin. We can produce an actual tMin with left shift.
* We can now xor both tmins, lets call the result check.
* If inputs to xor are identical then the check will be equal to
* 0x00000000, if they are not identical then the result will be some
* value different from 0x00000000.
* As a final step we logicaly negate check to get the requested behaviour.
* */
int possible_tmin = x + 1;
int tmin = 1 << 31;
int check = possible_tmin ^ tmin;
int negated_check = !check;
printf("input =\t\t 0x%08x\n", x);
printf("possible_tmin =\t 0x%08x\n", possible_tmin);
printf("tmin =\t\t 0x%08x\n", tmin);
printf("check =\t\t 0x%08x\n", check);
printf("negated_check =\t 0x%08x\n", negated_check);
return negated_check;
}
int main()
{
printf("output: %i", isTmax(0x7fffffff));
return 0;
}
The problem that I am facing is that I get different output whether I set an optimization flag when compiling the program. I am using gcc 11.1.0.
With no optimizations I get this output, which is correct for the given input:
$ gcc main.c -lm -m32 -Wall && ./a.out
input = 0x7fffffff
possible_tmin = 0x80000000
tmin = 0x80000000
check = 0x00000000
negated_check = 0x00000001
output: 1
With optimization enabled I get this output, which is incorrect.
gcc main.c -lm -m32 -Wall -O1 && ./a.out
input = 0x7fffffff
possible_tmin = 0x80000000
tmin = 0x80000000
check = 0x00000000
negated_check = 0x00000000
output: 0
For some reason the logical negation is not applied to the check variable when optimization is enabled.
The problem persists with any other level of optimization (-O2, -O3, -Os).
Even if I write the expressions as a one-liner return !((x + 1) ^ (1 << 31)); nothing changes.
I can "force" a correct behavior If I declare check as a volatile.
I am using the same level of optimization as is used by the automated checker that came with the exercise, If I turn it off my code passes all checks.
Can anyone shed on some light why is this happening? Why doesn't the logical negation happen?
EDIT: I have added a section with the extra guidelines and restrictions connected to the exercise that I forgot to include with the original post. Specifically, I am not allowed to use any other data type instead of int. I am not sure if that also includes literal suffix U.
Replace the "return" statement in each function with one
or more lines of C code that implements the function. Your code
must conform to the following style:
int Funct(arg1, arg2, ...) {
/* brief description of how your implementation works */
int var1 = Expr1;
...
int varM = ExprM;
varJ = ExprJ;
...
varN = ExprN;
return ExprR;
}
Each "Expr" is an expression using ONLY the following:
1. Integer constants 0 through 255 (0xFF), inclusive. You are
not allowed to use big constants such as 0xffffffff.
2. Function arguments and local variables (no global variables).
3. Unary integer operations ! ~
4. Binary integer operations & ^ | + << >>
Some of the problems restrict the set of allowed operators even further.
Each "Expr" may consist of multiple operators. You are not restricted to
one operator per line.
You are expressly forbidden to:
1. Use any control constructs such as if, do, while, for, switch, etc.
2. Define or use any macros.
3. Define any additional functions in this file.
4. Call any functions.
5. Use any other operations, such as &&, ||, -, or ?:
6. Use any form of casting.
7. Use any data type other than int. This implies that you
cannot use arrays, structs, or unions.
You may assume that your machine:
1. Uses 2s complement, 32-bit representations of integers.
2. Performs right shifts arithmetically.
3. Has unpredictable behavior when shifting an integer by more
than the word size.
The specific cause is most likely in 1 << 31. Nominally, this would produce 231, but 231 is not representable in a 32-bit int. In C 2018 6.5.7 4, where the C standard specifies the behavior of <<, it says the behavior in this case is not defined.
When optimization is disabled, the compiler may generate a processor instruction that gives 1 left 31 bits. This produces the bit pattern 0x80000000, and subsequent instructions interpret that as −231.
In contrast, with optimization enabled, the optimization software recognizes that 1 << 31 is not defined and does not generate a shift instruction for it. It may replace it with a compile-time value. Since the behavior is not defined by the C standard, the compiler is allowed to use any value for that. It might use zero, for example. (Since the entire behavior is not defined, not just the result, the compiler is actually allowed toreplace this part of your program with anything. It could use entirely different instructions or just abort.)
You can start to fix that by using 1u << 31. That is defined because 231 fits in the unsigned int type. However, there is a problem when assigning that to tmin, because tmin is an int, and the value still does not fit in an int. However, for this conversion, the behavior is implementation-defined, not undefined. Common C implementations define the conversion to wrap modulo 232, which means that the assignment will store −231 in tmin. However, an alternative is to change tmin from int to unsigned int (which may also be written just as unsigned) and then work with unsigned integers. That will give fully defined behavior, rather than undefined or implementation-defined, except for assuming the int width is 32 bits.
Another problem is x + 1. When x is INT_MAX, that overflows. That is likely not the cause of the behavior you observe, as common compilers simply wrap the result. Nonetheless, it can be corrected similarly, by using x + 1u and changing the type of possible_tmin to unsigned.
That said, the desired result can be computed with return ! (x ^ ~0u >> 1);. This takes zero as an unsigned int, complements it to produce all 1 bits, and shifts it right one bit, which gives a single 0 bit followed by all 1 bits. That is the INT_MAX value, and it works regardless of the width of int. Then this is XORed with x. The result of that has all zero bits if and only if x is also INT_MAX. Then ! either changes that zero into 1 or changes a non-zero value into 0.
Change the type of the variables from int to unsigned int (or just unsigned) because bitwise operations with signed values cause undefined behavior.
#Voo made a correct observation, x+1 created an undefined behavior, which was not apparent at first as the printf calls did not show anything weird happening.

understanding Fixed point arithmetic

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.
Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?
Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:
float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));
And similarly for the twiddle factors.
Now I need to perform
res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?
Maybe it's easiest to make an example.
Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).
You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.
A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.
The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).
I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:
You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.
Now, on multiplication.
It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(6, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.
In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.
For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).
So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.
So, instead of
res = a*tw;
you probably want
int64_t res_tmp = (int64_t)a * tw; // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
--res_tmp; // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)
Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.
res = a*tw
a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your ints to a 64-bit signed integer type (e.g. int64_t) and computing their product.
If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:
/*
* Computes a magnitude-preserving fixed-point product of any two signed
* fixed-point numbers with a combined 31 (or fewer) value bits. If x
* is represented as A(s,t) and y is represented as A(u,v),
* where s + t == u + v == 31, then the representation of the result is
* A(s + u + 1, t + v - 32).
*/
int32_t fixed_product(int32_t x, int32_t y) {
int64_t full_product = (int64_t) x * (int64_t) y;
int32_t truncated = full_product / (1U << 31);
int round_up = ((uint32_t) full_product) >> 31;
return truncated + round_up;
}
That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.
a + res
Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.
In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:
int32_t a_plus_res(int32_t a, int32_t res) {
int64_t res16 = ((int64_t) res) * (1 << 3);
int64_t sum16 = a + res16;
int round_up = (((uint32_t) sum16) >> 4) & 1;
return (int32_t) ((sum16 / (1 << 5)) + round_up);
}
A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.
All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.
Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.

Logical right shift in binary search preventing arithmetic overflow

In a binary search implementation, obviously:
mid = (low + high)/2
can cause overflow. I have read a lot of documentation (like this) that the following prevents the problem:
mid = (low + high) >>> 1
However, I did not see a reason why this would work. Can anyone throw some light on this?
>>> is the unsigned right shift operator in Java (ref). Since mid, low, and high are signed integers, the addition of low and high can overflow to a negative value. >>> ignores the potential negative-ness of this result and shifts it to the right as if it were an unsigned number (and in Java, there are no unsigned numbers).
In C and C++, this is the equivalent of
mid = ((unsigned int)low + (unsigned int)high)) >> 1;
(which is explicitly mention in the article you link to).
This ends up being the same as
mid = ((unsigned int)low + (unsigned int)high)) / 2;
Note that you probably don't want to do it like this. If you're going to be using unsigned values, you should stick with unsigned values and avoid bouncing back and forth between signed and unsigned.
There is no such thing as a "logical right shift" in C (there's no >>> operator), so you're probably talking about Java.
This works because low and high are presumed to be in the range 0 to 2^31-1 (assuming we're talking about int here). The maximum possible value of low+high is no greater than than 2^32-2, and so is representable by an unsigned int (if such a thing existed in Java). Such a thing doesn't exist in Java, so we've now overflowed. However, the logical shift operator >>> treats its operand as if it were unsigned, so this gives the expected result.
The same link states the reason for using Java's >>> and reason is (low+high) may exceed the maximum value 'mid' can hold:
In Programming Pearls Bentley says that the analogous line "sets m to
the average of l and u, truncated down to the nearest integer." On the
face of it, this assertion might appear correct, but it fails for
large values of the int variables low and high. Specifically, it fails
if the sum of low and high is greater than the maximum positive int
value (231 - 1). The sum overflows to a negative value, and the value
stays negative when divided by two. In C this causes an array index
out of bounds with unpredictable results.
It also states the equivalent operaiton in C:
......
In C and C++ (where you don't have the >>> operator), you can do this:
6: mid = ((unsigned int)low + (unsigned int)high)) >> 1;
So the solution is to read and understand that article completely.
As is mentioned in other answers that >>> is not a C operator.
However, if you want to avoid overflow in C, you can try this :
mid = (high - low)/2 + low;

Programmatically determining max value of a signed integer type

This related question is about determining the max value of a signed type at compile-time:
C question: off_t (and other signed integer types) minimum and maximum values
However, I've since realized that determining the max value of a signed type (e.g. time_t or off_t) at runtime seems to be a very difficult task.
The closest thing to a solution I can think of is:
uintmax_t x = (uintmax_t)1<<CHAR_BIT*sizeof(type)-2;
while ((type)x<=0) x>>=1;
This avoids any looping as long as type has no padding bits, but if type does have padding bits, the cast invokes implementation-defined behavior, which could be a signal or a nonsensical implementation-defined conversion (e.g. stripping the sign bit).
I'm beginning to think the problem is unsolvable, which is a bit unsettling and would be a defect in the C standard, in my opinion. Any ideas for proving me wrong?
Let's first see how C defines "integer types". Taken from ISO/IEC 9899, §6.2.6.2:
6.2.6.2 Integer types
1 For unsigned integer types other than unsigned char, the bits of the object
representation shall be divided into two groups: value bits and padding bits (there need
not be any of the latter). If there are N value bits, each bit shall represent a different
power of 2 between 1 and 2N−1, so that objects of that type shall be capable of
representing values from 0 to 2N − 1 using a pure binary representation; this shall be
known as the value representation. The values of any padding bits are unspecified.44)
2 For signed integer types, the bits of the object representation shall be divided into three
groups: value bits, padding bits, and the sign bit. There need not be any padding bits;
there shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M ≤ N). If the sign bit
is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be
modified in one of the following ways:
— the corresponding value with sign bit 0 is negated (sign and magnitude);
— the sign bit has the value −(2N) (two’s complement);
— the sign bit has the value −(2N − 1) (ones’ complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1
and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones’
complement), is a trap representation or a normal value. In the case of sign and
magnitude and ones’ complement, if this representation is a normal value it is called a
negative zero.
Hence we can conclude the following:
~(int)0 may be a trap representation, i.e. setting all bits to is a bad idea
There might be padding bits in an int that have no effect on its value
The order of the bits actually representing powers of two is undefined; so is the position of the sign bit, if it exists.
The good news is that:
there's only a single sign bit
there's only a single bit that represents the value 1
With that in mind, there's a simple technique to find the maximum value of an int. Find the sign bit, then set it to 0 and set all other bits to 1.
How do we find the sign bit? Consider int n = 1;, which is strictly positive and guaranteed to have only the one-bit and maybe some padding bits set to 1. Then for all other bits i, if i==0 holds true, set it to 1 and see if the resulting value is negative. If it's not, revert it back to 0. Otherwise, we've found the sign bit.
Now that we know the position of the sign bit, we take our int n, set the sign bit to zero and all other bits to 1, and tadaa, we have the maximum possible int value.
Determining the int minimum is slightly more complicated and left as an exercise to the reader.
Note that the C standard humorously doesn't require two different ints to behave the same. If I'm not mistaken, there may be two distinct int objects that have e.g. their respective sign bits at different positions.
EDIT: while discussing this approach with R.. (see comments below), I have become convinced that it is flawed in several ways and, more generally, that there is no solution at all. I can't see a way to fix this posting (except deleting it), so I let it unchanged for the comments below to make sense.
Mathematically, if you have a finite set (X, of size n (n a positive integer) and a comparison operator (x,y,z in X; x<=y and y<=z implies x<=z), it's a very simple problem to find the maximum value. (Also, it exists.)
The easiest way to solve this problem, but the most computationally expensive, is to generate an array with all possible values from, then find the max.
Part 1. For any type with a finite member set, there's a finite number of bits (m) which can be used to uniquely represent any given member of that type. We just make an array which contains all possible bit patterns, where any given bit pattern is represented by a given value in the specific type.
Part 2. Next we'd need to convert each binary number into the given type. This task is where my programming inexperience makes me unable to speak to how this may be accomplished. I've read some about casting, maybe that would do the trick? Or some other conversion method?
Part 3. Assuming that the previous step was finished, we now have a finite set of values in the desired type and a comparison operator on that set. Find the max.
But what if...
...we don't know the exact number of members of the given type? Than we over-estimate. If we can't produce a reasonable over-estimate, than there should be physical bounds on the number. Once we have an over-estimate, we check all of those possible bit patters to confirm which bit patters represent members of the type. After discarding those which aren't used, we now have a set of all possible bit patterns which represent some member of the given type. This most recently generated set is what we'd use now at part 1.
...we don't have a comparison operator in that type? Than the specific problem is not only impossible, but logically irrelevant. That is, if our program doesn't have access to give a meaningful result to if we compare two values from our given type, than our given type has no ordering in the context of our program. Without an ordering, there's no such thing as a maximum value.
...we can't convert a given binary number into a given type? Then the method breaks. But similar to the previous exception, if you can't convert types, than our tool-set seems logically very limited.
Technically, you may not need to convert between binary representations and a given type. The entire point of the conversion is to insure the generated list is exhaustive.
...we want to optimize the problem? Than we need some information about how the given type maps from binary numbers. For example, unsigned int, signed int (2's compliment), and signed int (1's compliment) each map from bits into numbers in a very documented and simple way. Thus, if we wanted the highest possible value for unsigned int and we knew we were working with m bits, than we could simply fill each bit with a 1, convert the bit pattern to decimal, then output the number.
This relates to optimization because the most expensive part of this solution is the listing of all possible answers. If we have some previous knowledge of how the given type maps from bit patterns, we can generate a subset of all possibilities by making instead all potential candidates.
Good luck.
Update: Thankfully, my previous answer below was wrong, and there seems to be a solution to this question.
intmax_t x;
for (x=INTMAX_MAX; (T)x!=x; x/=2);
This program either yields x containing the max possible value of type T, or generates an implementation-defined signal.
Working around the signal case may be possible but difficult and computationally infeasible (as in having to install a signal handler for every possible signal number), so I don't think this answer is fully satisfactory. POSIX signal semantics may give enough additional properties to make it feasible; I'm not sure.
The interesting part, especially if you're comfortable assuming you're not on an implementation that will generate a signal, is what happens when (T)x results in an implementation-defined conversion. The trick of the above loop is that it does not rely at all on the implementation's choice of value for the conversion. All it relies upon is that (T)x==x is possible if and only if x fits in type T, since otherwise the value of x is outside the range of possible values of any expression of type T.
Old idea, wrong because it does not account for the above (T)x==x property:
I think I have a sketch of a proof that what I'm looking for is impossible:
Let X be a conforming C implementation and assume INT_MAX>32767.
Define a new C implementation Y identical to X, but where the values of INT_MAX and INT_MIN are each divided by 2.
Prove that Y is a conforming C implementation.
The essential idea of this outline is that, due to the fact that everything related to out-of-bound values with signed types is implementation-defined or undefined behavior, an arbitrary number of the high value bits of a signed integer type can be considered as padding bits without actually making any changes to the implementation except the limit macros in limits.h.
Any thoughts on if this sounds correct or bogus? If it's correct, I'd be happy to award the bounty to whoever can do the best job of making it more rigorous.
I might just be writing stupid things here, since I'm relatively new to C, but wouldn't this work for getting the max of a signed?
unsigned x = ~0;
signed y=x/2;
This might be a dumb way to do it, but as far as I've seen unsigned max values are signed max*2+1. Won't it work backwards?
Sorry for the time wasted if this proves to be completely inadequate and incorrect.
Shouldn't something like the following pseudo code do the job?
signed_type_of_max_size test_values =
[(1<<7)-1, (1<<15)-1, (1<<31)-1, (1<<63)-1];
for test_value in test_values:
signed_foo_t a = test_value;
signed_foo_t b = a + 1;
if (b < a):
print "Max positive value of signed_foo_t is ", a
Or much simpler, why shouldn't the following work?
signed_foo_t signed_foo_max = (1<<(sizeof(signed_foo_t)*8-1))-1;
For my own code, I would definitely go for a build-time check defining a preprocessor macro, though.
Assuming modifying padding bits won't create trap representations, you could use an unsigned char * to loop over and flip individual bits until you hit the sign bit. If your initial value was ~(type)0, this should get you the maximum:
type value = ~(type)0;
assert(value < 0);
unsigned char *bytes = (void *)&value;
size_t i = 0;
for(; i < sizeof value * CHAR_BIT; ++i)
{
bytes[i / CHAR_BIT] ^= 1 << (i % CHAR_BIT);
if(value > 0) break;
bytes[i / CHAR_BIT] ^= 1 << (i % CHAR_BIT);
}
assert(value != ~(type)0);
// value == TYPE_MAX
Since you allow this to be at runtime you could write a function that de facto does an iterative left shift of (type)3. If you stop once the value is fallen below 0, this will never give you a trap representation. And the number of iterations - 1 will tell you the position of the sign bit.
Remains the problem of the left shift. Since just using the operator << would lead to an overflow, this would be undefined behavior, so we can't use the operator directly.
The simplest solution to that is not to use a shifted 3 as above but to iterate over the bit positions and to add always the least significant bit also.
type x;
unsigned char*B = &x;
size_t signbit = 7;
for(;;++signbit) {
size_t bpos = signbit / CHAR_BIT;
size_t apos = signbit % CHAR_BIT;
x = 1;
B[bpos] |= (1 << apos);
if (x < 0) break;
}
(The start value 7 is the minimum width that a signed type must have, I think).
Why would this present a problem? The size of the type is fixed at compile time, so the problem of determining the runtime size of the type reduces to the problem of determining the compile-time size of the type. For any given target platform, a declaration such as off_t offset will be compiled to use some fixed size, and that size will then always be used when running the resulting executable on the target platform.
ETA: You can get the size of the type type via sizeof(type). You could then compare against common integer sizes and use the corresponding MAX/MIN preprocessor define. You might find it simpler to just use:
uintmax_t bitWidth = sizeof(type) * CHAR_BIT;
intmax_t big2 = 2; /* so we do math using this integer size */
intmax_t sizeMax = big2^bitWidth - 1;
intmax_t sizeMin = -(big2^bitWidth - 1);
Just because a value is representable by the underlying "physical" type does not mean that value is valid for a value of the "logical" type. I imagine the reason max and min constants are not provided is that these are "semi-opaque" types whose use is restricted to particular domains. Where less opacity is desirable, you will often find ways of getting the information you want, such as the constants you can use to figure out how big an off_t is that are mentioned by the SUSv2 in its description of <unistd.h>.
For an opaque signed type for which you don't have a name of the associated unsigned type, this is unsolvable in a portable way, because any attempt to detect whether there is a padding bit will yield implementation-defined behavior or undefined behavior. The best thing you can deduce by testing (without additional knowledge) is that there are at least K padding bits.
BTW, this doesn't really answer the question, but can still be useful in practice: If one assumes that the signed integer type T has no padding bits, one can use the following macro:
#define MAXVAL(T) (((((T) 1 << (sizeof(T) * CHAR_BIT - 2)) - 1) * 2) + 1)
This is probably the best that one can do. It is simple and does not need to assume anything else about the C implementation.
Maybe I'm not getting the question right, but since C gives you 3 possible representations for signed integers (http://port70.net/~nsz/c/c11/n1570.html#6.2.6.2):
sign and magnitude
ones' complement
two's complement
and the max in any of these should be 2^(N-1)-1, you should be able to get it by taking the max of the corresponding unsigned, >>1-shifting it and casting the result to the proper type (which it should fit).
I don't know how to get the corresponding minimum if trap representations get in the way, but if they don't the min should be either (Tp)((Tp)-1|(Tp)TP_MAX(Tp)) (all bits set) (Tp)~TP_MAX(Tp) and which it is should be simple to find out.
Example:
#include <limits.h>
#define UNSIGNED(Tp,Val) \
_Generic((Tp)0, \
_Bool: (_Bool)(Val), \
char: (unsigned char)(Val), \
signed char: (unsigned char)(Val), \
unsigned char: (unsigned char)(Val), \
short: (unsigned short)(Val), \
unsigned short: (unsigned short)(Val), \
int: (unsigned int)(Val), \
unsigned int: (unsigned int)(Val), \
long: (unsigned long)(Val), \
unsigned long: (unsigned long)(Val), \
long long: (unsigned long long)(Val), \
unsigned long long: (unsigned long long)(Val) \
)
#define MIN2__(X,Y) ((X)<(Y)?(X):(Y))
#define UMAX__(Tp) ((Tp)(~((Tp)0)))
#define SMAX__(Tp) ((Tp)( UNSIGNED(Tp,~UNSIGNED(Tp,0))>>1 ))
#define SMIN__(Tp) ((Tp)MIN2__( \
(Tp)(((Tp)-1)|SMAX__(Tp)), \
(Tp)(~SMAX__(Tp)) ))
#define TP_MAX(Tp) ((((Tp)-1)>0)?UMAX__(Tp):SMAX__(Tp))
#define TP_MIN(Tp) ((((Tp)-1)>0)?((Tp)0): SMIN__(Tp))
int main()
{
#define STC_ASSERT(X) _Static_assert(X,"")
STC_ASSERT(TP_MAX(int)==INT_MAX);
STC_ASSERT(TP_MAX(unsigned int)==UINT_MAX);
STC_ASSERT(TP_MAX(long)==LONG_MAX);
STC_ASSERT(TP_MAX(unsigned long)==ULONG_MAX);
STC_ASSERT(TP_MAX(long long)==LLONG_MAX);
STC_ASSERT(TP_MAX(unsigned long long)==ULLONG_MAX);
/*STC_ASSERT(TP_MIN(unsigned short)==USHRT_MIN);*/
STC_ASSERT(TP_MIN(int)==INT_MIN);
/*STC_ASSERT(TP_MIN(unsigned int)==UINT_MIN);*/
STC_ASSERT(TP_MIN(long)==LONG_MIN);
/*STC_ASSERT(TP_MIN(unsigned long)==ULONG_MIN);*/
STC_ASSERT(TP_MIN(long long)==LLONG_MIN);
/*STC_ASSERT(TP_MIN(unsigned long long)==ULLONG_MIN);*/
STC_ASSERT(TP_MAX(char)==CHAR_MAX);
STC_ASSERT(TP_MAX(signed char)==SCHAR_MAX);
STC_ASSERT(TP_MAX(short)==SHRT_MAX);
STC_ASSERT(TP_MAX(unsigned short)==USHRT_MAX);
STC_ASSERT(TP_MIN(char)==CHAR_MIN);
STC_ASSERT(TP_MIN(signed char)==SCHAR_MIN);
STC_ASSERT(TP_MIN(short)==SHRT_MIN);
}
For all real machines, (two's complement and no padding):
type tmp = ((type)1)<< (CHAR_BIT*sizeof(type)-2);
max = tmp + (tmp-1);
With C++, you can calculate it at compile time.
template <class T>
struct signed_max
{
static const T max_tmp = T(T(1) << sizeof(T)*CO_CHAR_BIT-2u);
static const T value = max_tmp + T(max_tmp -1u);
};

Resources