Subtracting 1 from 0 in 8 bit binary - c

I have 8 bit int zero = 0b00000000; and 8 bit int one = 0b00000001;
according to binary arithmetic rule,
0 - 1 = 1 (borrow 1 from next significant bit).
So if I have:
int s = zero - one;
s = -1;
-1 = 0b1111111;
where all those 1s are coming from? There are nothing to borrow since all bits are 0 in zero variable.

This is a great question and has to do with how computers represent integer values.
If you’re writing out a negative number in base ten, you just write out the regular number and then prefix it with a minus sign. But if you’re working inside a computer where everything needs to either be a zero or a one, you don’t have any minus signs. The question then comes up of how you then choose to represent negative values.
One popular way of doing this is to use signed two’s complement form. The way this works is that you write the number using ones and zeros, except that the meaning of those ones and zeros differs from “standard” binary in how they’re interpreted. Specifically, if you have a signed 8-bit number, the lower seven bits have their standard meaning as 20, 21, 22, etc. However, the meaning of the most significant bit is changed: instead of representing 27, it represents the value -27.
So let’s look at the number 0b11111111. This would be interpreted as
-27 + 26 + 25 + 24 + 23 + 22 + 21 + 20
= -128 + 64 + 32 + 16 + 8 + 4 + 2 + 1
= -1
which is why this collection of bits represents -1.
There’s another way to interpret what’s going on here. Given that our integer only has eight bits to work with, we know that there’s no way to represent all possible integers. If you pick any 257 integer values, given that there are only 256 possible bit patterns, there’s no way to uniquely represent all these numbers.
To address this, we could alternatively say that we’re going to have our integer values represent not the true value of the integer, but the value of that integer modulo 256. All of the values we’ll store will be between 0 and 255, inclusive.
In that case, what is 0 - 1? It’s -1, but if we take that value mod 256 and force it to be nonnegative, then we get back that -1 = 255 (mod 256). And how would you write 255 in binary? It’s 0b11111111.
There’s a ton of other cool stuff to learn here if you’re interested, so I’d recommend reading up on signed and unsigned two’s-complement numbers.
As some exercises: what would -4 look like in this format? How about -9?
These aren't the only ways you can represent numbers in a computer, but they're probably the most popular. Some older computers used the balanced ternary number system (notably the Setun machine). There's also the one's complement format, which isn't super popular these days.

Zero minus one must give some number such that if you add one to it, you get zero. The only number you can add one to and get zero is the one represented in binary as all 1's. So that's what you get.
So long as you use any valid form of arithmetic, you get the same results. If there are eight cars and someone takes away three cars, the value you get for how many case are left should be five, regardless of whether you do the math with binary, decimal, or any other kind of representation.
So any valid system of representation that supports the operations you are using with their normal meanings must produce the same result. When you take the representation for zero and perform the subtraction operation using the representation for one, you must get the representation such that when you add one to it, you get the representation for zero. Otherwise, the result is just wrong based on the definitions of addition, subtraction, zero, one, and so on.

Related

Problem of a simple float number serialization example

I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
    ½ → 1734/3467 → 1734
    ⅓ → 1155/3467 → 1155
    0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
    1734 → 1734 ÷ 3467 → 0.500144
    1155 → 1155 ÷ 3467 → 0.333141
    433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...

Understanding casts from integer to float

Could someone explain this weird looking output on a 32 bit machine?
#include <stdio.h>
int main() {
printf("16777217 as float is %.1f\n",(float)16777217);
printf("16777219 as float is %.1f\n",(float)16777219);
return 0;
}
Output
16777217 as float is 16777216.0
16777219 as float is 16777220.0
The weird thing is that 16777217 casts to a lower value and 16777219 casts to a higher value...
In the IEEE-754 basic 32-bit binary floating-point format, all integers from −16,777,216 to +16,777,216 are representable. From 16,777,216 to 33,554,432, only even integers are representable. Then, from 33,554,432 to 67,108,864, only multiples of four are representable. (Since the question does not necessitate discussion of which numbers are representable, I will omit explanation and just take this for granted.)
The most common default rounding mode is to round the exact mathematical result to the nearest representable value and, in case of a tie, to round to the representable value which has zero in the low bit of its significand.
16,777,217 is equidistant between the two representable values 16,777,216 and 16,777,218. These values are represented as 1000000000000000000000002•21 and 1000000000000000000000012•21. The former has 0 in the low bit of its significand, so it is chosen as the result.
16,777,219 is equidistant between the two representable values 16,777,218 and 16,777,220. These values are represented as 1000000000000000000000012•21 and 1000000000000000000000102•21. The latter has 0 in the low bit of its significand, so it is chosen as the result.
You may have heard of the concept of "precision", as in "this fractional representation has 3 digits of precision".
This is very easy to think about in a fixed-point representation. If I have, say, three digits of precision past the decimal, then I can exactly represent 1/2 = 0.5, and I can exactly represent 1/4 = 0.25, and I can exactly represent 1/8 = 0.125, but if I try to represent 1/16, I can not get 0.0625; I will either have to settle for 0.062 or 0.063.
But that's for fixed-point. The computer you're using uses floating-point, which is a lot like scientific notation. You get a certain number of significant digits total, not just digits to the right of the decimal point. For example, if you have 3 decimal digits worth of precision in a floating-point format, you can represent 0.123 but not 0.1234, and you can represent 0.0123 and 0.00123, but not 0.01234 or 0.001234. And if you have digits to the left of the decimal point, those take away away from the number you can use to the right of the decimal point. You can use 1.23 but not 1.234, and 12.3 but not 12.34, and 123.0 but not 123.4 or 123.anythingelse.
And -- you can probably see the pattern by now -- if you're using a floating-point format with only three significant digits, you can't represent all numbers greater than 999 perfectly accurately at all, even though they don't have a fractional part. You can represent 1230 but not 1234, and 12300 but not 12340.
So that's decimal floating-point formats. Your computer, on the other hand, uses a binary floating-point format, which ends up being somewhat trickier to think about. We don't have an exact number of decimal digits' worth of precision, and the numbers that can't be exactly represented don't end up being nice even multiples of 10 or 100.
In particular, type float on most machines has 24 binary bits worth of precision, which works out to 6-7 decimal digits' worth of precision. That's obviously not enough for numbers like 16777217.
So where did the numbers 16777216 and 16777220 come from? As Eric Postpischil has already explained, it ends up being because they're multiples of 2. If we look at the binary representations of nearby numbers, the pattern becomes clear:
16777208 111111111111111111111000
16777209 111111111111111111111001
16777210 111111111111111111111010
16777211 111111111111111111111011
16777212 111111111111111111111100
16777213 111111111111111111111101
16777214 111111111111111111111110
16777215 111111111111111111111111
16777216 1000000000000000000000000
16777218 1000000000000000000000010
16777220 1000000000000000000000100
16777215 is the biggest number that can be represented exactly in 24 bits. After that, you can represent only even numbers, because the low-order bit is the 25th, and essentially has to be 0.
Type float cannot hold that much significance. The significand can only hold 24 bits. Of those 23 are stored and the 24th is 1 and not stored, because the significand is normalised.
Please read this which says "Integers in [ − 16777216 , 16777216 ] can be exactly represented", but yours are out of that range.
Floating representation follows a method similar to what we use in everyday life and we call exponential representation. This is a number using a number of digits that we decide will suffice to realistically represent the value, we call it mantissa, or significant, that we will multiply to a base, or radix, value elevated to a power that we call exponent. In plain words:
num*base^exp
We generally use 10 as base, because we have 10 finger in our hands, so we are habit to numbers like 1e2, which is 100=1*10^2.
Of course we regret to use exponential representation for so small numbers, but we prefer to use it when acting on very large numbers, or, better, when our number has a number of digits that we consider enough to represent the entity we are valorizing.
The correct number of digits could be how many we can handle by mind, or what are required for an engineering application. When we decided how many digits we need we will not care anymore for how adherent to the real value will be the numeric representation we are going to handle. I.e. for a number like 123456.789e5 it is understood that adding up 99 unit we can tolerate the rounded representation and consider it acceptable anyway, if not we should change the representation and use a different one with appropriate number of digits as in 12345678900.
On a computer when you have to handle very large numbers, that couldn't fit in a standard integer, or when the you have to represent a real number (with decimal part) the right choice is a floating or double floating point representation. It uses the same layout we discussed above, but the base is 2 instead of 10. This because a computer can have only 2 fingers, the states 0 or 1. Se the formula we used before, to represent 100, become:
100100*2^0
That's still isn't the real floating point representation, but gives the idea. Now consider that in a computer the floating point format is standardized and for a standard float, as per IEE-754, it uses, as memory layout (we will see after why it is assumed 1 more bit for the mantissa), 23bits for the mantissa, 1bit for the sign and 8bits for the exponent biased by -127 (that simply means that it will range between -126 and +127 without the need for a sign bit, and the values 0x00 and 0xff reserved for special meaning).
Now consider using 0 as exponent, this means that the value 2^exponent=2^0=1 multiplied by mantissa give the same behavior of a 23bits integer. This imply that incrementing a count as in:
float f = 0;
while(1)
{
f +=1;
printf ("%f\n", f);
}
You will see that the printed value linearly increase by one until it saturates the 23bits and the exponent will become to grow.
If the base, or radix, of our floating point number would have been 10, we would see an increase each 10 loops for the first 100 (10^2) values, than an increase of 100 for the next 1000 (10^3) values and so on. You see that this corresponds to the *truncation** we have to make due to the limited number of available digits.
The same phenomenon will be observed when using the binary base, only the changes happens on powers of 2 interval.
What we discussed up to now is called the denormalized form of a floating point, what is normally used is the counterpart normalized. The latter simply means that there is a 24th bit, not stored, that is always 1. In plane words we wouldn't use an exponent of 0 for number less that 2^24, but we shift it (multiply by 2) up to the MSbit==1 reach the 24th bit, than the exponent is adjusted to such a negative value that force the conversion to shift back the number to its original value.
Remember the reserved value of the exponent we talked above? Well an exponent==0x00 means that we have a denormalized number. exponent==0xff indicate a nan (not-a-number) or +/-infinity if mantissa==0.
It should be clear now that when the number we express is beyond the 24bits of the significant (mantissa), we should expect approximation of the real value depending on how much far we are from 2^24.
Now the number you are using are just on the edge of 2^24=16,277,216 :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| = 16,277,215
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\______ _______/\_____________________ _______________________/
i v v
g exponent mantissa
n
Now increasing by 1 we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| = 16,277,216
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Note that we have triggered to 1 the 24th bit, but from now on we are above the 24 bit representation, and each possible further representation is in steps of 2^1=2. Simply advance by 2 or can represent only even numbers (multiples of 2^1=2). I.e. setting to 1 the Less Significant bit we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| = 16,277,218
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Increasing again:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0| = 16,277,220
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
As you can see we cannot exactly represent 16,277,219. In your code:
// This will print 16777216, because 1 increment isn't enough to
// increase the significant that can express only intervals
// that are > 2^1
printf("16777217 as float is %.1f\n",(float)16777217);
// This will print 16777220, because an increment of 3 on
// the base 16777216=2^24 will trigger an exponent increase rounded
// to the closer exact representation
printf("16777219 as float is %.1f\n",(float)16777219);
As said above the choice of the numeric format must be appropriate for the usage, a floating point is only an approximate representation of a real number, and is definitively our duty to carefully use the right type.
In the case if we need more precision we could use a double, or an integer long long int.
Just for sake of completeness I would add few words on the approximate representation for irriducible numbers. This numbers are not divisible by a fraction of 2, so the representation in float format will always be not exact, and need to be rounded to the correct value during conversion to decimal representation.
For more details see:
https://en.wikipedia.org/wiki/IEEE_754
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
Online demo applets:
https://babbage.cs.qc.cuny.edu/IEEE-754/
https://evanw.github.io/float-toy/
https://www.h-schmidt.net/FloatConverter/IEEE754.html

The max number of digits in an int based on number of bits

So, I needed a constant value to represent the max number of digits in an int, and it needed to be calculated at compile time to pass into the size of a char array.
To add some more detail: The compiler/machine I'm working with has a very limited subset of the C language, so none of the std libraries work as they have unsupported features. As such I cannot use INT_MIN/MAX as I can neither include them, nor are they defined.
I need a compile time expression that calculates the size. The formula I came up with is:
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
It is marginally successful with n byte integers based on hand calculating it.
sizeof(int) INT_MAX characters formula
2 32767 5 7
4 2147483647 10 12
8 9223372036854775807 19 22
You're looking for a result related to a logarithm of the maximum value of the integer type in question (which logarithm depends on the radix of the representation whose digits you want to count). You cannot compute exact logarithms at compile time, but you can write macros that estimate them closely enough for your purposes, or that compute a close enough upper bound for your purposes. For example, see How to compute log with the preprocessor.
It is useful also to know that you can convert between logarithms in different bases by multiplying by appropriate constants. In particular, if you know the base-a logarithm of a number and you want the base-b logarithm, you can compute it as
logb(x) = loga(x) / loga(b)
Your case is a bit easier than the general one, though. For the dimension of an array that is not a variable-length array, you need an "integer constant expression". Furthermore, your result does not need more than two digits of precision (three if you wanted the number of binary digits) for any built-in integer type you'll find in a C implementation, and it seems like you need only a close enough upper bound.
Moreover, you get a head start from the sizeof operator, which can appear in integer constant expressions and which, when applied to an integer type, gives you an upper bound on the base-256 logarithm of values of that type (supposing that CHAR_BIT is 8). This estimate is very tight if every bit is a value bit, but signed integers have a sign bit, and they may have padding bits as well, so this bound is a bit loose for them.
If you want a a bound on the number of digits in a power-of-two radix then you can use sizeof pretty directly. Let's suppose, though, that you're looking for the number of decimal digits. Mathematically, the maximum number of digits in the decimal representation of an int is
N = ceil(log10(MAX_INT))
or
N = floor(log10(MAX_INT)) + 1
provided that MAX_INT is not a power of 10. Let's express that in terms of the base-256 logarithm:
N = floor( log256(MAX_INT) / log256(10) ) + 1
Now, log256(10) cannot be part of an integer constant expression, but it or its reciprocal can be pre-computed: 1 / log256(10) = 2.40824 (to a pretty good approximation; the actual value is slightly less). Now, let's use that to rewrite our expression:
N <= floor( sizeof(int) * 2.40824 ) + 1
That's not yet an integer constant expression, but it's close. This expression is an integer constant expression, and a good enough approximation to serve your purpose:
N = 241 * sizeof(int) / 100 + 1
Here are the results for various integer sizes:
sizeof(int) INT_MAX True N Computed N
1 127 3 3
2 32767 5 5
4 2147483648 10 10
8 ~9.223372037e+18 19 20
(The values in the INT_MAX and True N columns suppose one of the allowed forms of signed representation, and no padding bits; the former and maybe both will be smaller if the representation contains padding bits.)
I presume that in the unlikely event that you encounter a system with 8-byte ints, the extra one byte you provide for your digit array will not break you. The discrepancy arises from the difference between having (at most) 63 value bits in a signed 64-bit integer, and the formula accounting for 64 value bits in that case, with the result that sizeof(int) is a bit too much of an overestimation of the base-256 log of INT_MAX. The formula gives exact results for unsigned int up to at least size 8, provided there are no padding bits.
As a macro, then:
// Expands to an integer constant expression evaluating to a close upper bound
// on the number the number of decimal digits in a value expressible in the
// integer type given by the argument (if it is a type name) or the the integer
// type of the argument (if it is an expression). The meaning of the resulting
// expression is unspecified for other arguments.
#define DECIMAL_DIGITS_BOUND(t) (241 * sizeof(t) / 100 + 1)
An upper bound on the number of decimal digits an int may produce depends on INT_MIN.
// Mathematically
max_digits = ceil(log10(-INT_MAX))
It is easier to use the bit-width of the int as that approximates a log of -INT_MIN. sizeof(int)*CHAR_BIT - 1 is the max number of value bits in an int.
// Mathematically
max_digits = ceil((sizeof(int)*CHAR_BIT - 1)* log10(2))
// log10(2) --> ~ 0.30103
On rare machines, int has padding, so the above will over estimate.
For log10(2), which is about 0.30103, we could use 1/3 or one-third.
As a macro, perform integer math and add 1 for the ceiling
#include <stdlib.h>
#define INT_DIGIT10_WIDTH ((sizeof(int)*CHAR_BIT - 1)/3 + 1)
To account for a sign and null character add 2, use the following. With a very tight log10(2) fraction to not over calculate the buffer needs:
#define INT_STRING_SIZE ((sizeof(int)*CHAR_BIT - 1)*28/93 + 3)
Note 28/93 = ‭0.3010752... > log2(10)
The number of digits needed for any base down to base 2 would need follows below. It is interesting that +2 is needed and not +1. Consider a 2 bit signed number in base 2 could be "-10", a size of 4.
#define INT_STRING2_SIZE ((sizeof(int)*CHAR_BIT + 2)
Boringly, I think you need to hardcode this, centred around inspecting sizeof(int) and consulting your compiler documentation to see what kind of int you actually have. (All the C standard specifies is that it can't be smaller than a short, and needs to have a range of at least -32767 to +32767, and 1's complement, 2's complement, and signed magnitude can be chosen. The manner of storage is arbitrary although big and little endianness are common.) Note that an arbitrary number of padding bits are allowed, so you can't, in full generality, impute the number of decimal digits from the sizeof.
C doesn't support the level of compile time evaluable constant expressions you'd need for this.
So hardcode it and make your code intentionally brittle so that compilation fails if a compiler encounters a case that you have not thought of.
You could solve this in C++ using constexpr and metaprogramming techniques.
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
is the formula I came up with.
The +2 is for the negative sign and the null terminator.
If we suppose that integral values are either 2, 4, or 8 bytes, and if we determine the respective digits to be 5, 10, 20, then a integer constant expression yielding the exact values could be written as follows:
const int digits = (sizeof(int)==8) ? 20 : ((sizeof(int)==4) ? 10 : 5);
int testArray[digits];
I hope that I did not miss something essential. I've tested this at file scope.

Multiplication of two 32 bit numbers using only 8 bit numbers

I saw this interview question online and can't find a good method other than the usual additive methods.
Any suggestions if this can be done quicker using some bitshift / recursion or something similar ?
Bitshifting would be natural part of a solution.
To multiply a value a by an eight-bit value b, for each 1 bit in b, add up all the values of a multiplied by b with all other bits set to 0. For example, a * 10100001 = a * 10000000 + a * 00100000 + a * 00000001.
Taking this further, suppose we want to multiply 11001011 by 0010000, this is 11001011(bin) << 4(dec). Doing this on an eight-bit value gives you 10110000. You have also lost (8-4)=4 bits from the beginning. Hence you would also want to do 11001011(bin) >> 4(dec) to get 00001100 as a carry into the next "8-bit column" (assuming that we are using 8 columns to represent a 64-bit answer).
Recursion would not really be necessary. All you'd need is a loop through the 4 bytes of the first 32-bit number, with another loop through the 4 bytes of the second number inside, multiplying each pair of bytes together in turn and adding it to your solution.

Printing multiple integers as one arbitrarily long decimal string

Say I have 16 64-bit unsigned integers. I have been careful to feed the carry as appropriate between them when performing operations. Could I feed them into a method to convert all of them into a single string of decimal digits, as though it was one 1024-bit binary number? In other words, is it possible to make a method that will work for an arbitrary number of integers that represent one larger integer?
I imagine that it would be more difficult for signed integers, as there is the most significant bit to deal with. I suppose it would be that the most significant integer would be the signed integer, and the rest would be unsigned, to represent the remaining 'parts' of the number.
(This is semi-related to another question.)
You could use the double dabble algorithm, which circumvents the need for multi-precision multiplication and division. In fact, the Wikipedia page contains a C implementation for this algorithm.
This is a bit unclear.
Of course, a function such as
void print_1024bit(uint64_t digits[]);
could be written to do this. But if you mean if any of the standard library's printf()-family of functions can do this, then I think the answer is no.
As you probably saw in the other question, the core of converting a binary number into a different base b is made of two operations:
Modulo b, to figure out the current least significant digit
Division by b, to remove that digit once it's been generated
When applied until the number is 0, this generates all the digits in reverse order.
So, you need to implement "modulo 10" and "divide by 10" for your 1024-bit number.
For instance, consider the number decimal 4711, which we want to convert to octal just for this example:
4711 % 8 is 7, so the right-most digit is 7
4711 / 8 is 588
588 % 8 is 4, the next digit is 4
588 / 8 is 73
73 % 8 is 1
73 / 8 is 9
9 % 8 is 1
8 / 8 is 1
1 % 8 is 1
1 / 8 is 0, we're done.
So, reading the bold digits from the bottom and up towards the right-most digits, we conclude that 471110 = 111478. You can use a calculator to verify this, or just trust me. :)
It's possible, of course, but not terribly straight-forward.
Rather than reinventing the wheel, how about reusing a library?
The GNU Multi Precision Arithmetic Library is one such possibility. I've not needed such things myself, but it seems to fit your bill.

Resources