Can i use long double to compute integers in c language - c

I try to write a factorial function to compute a large number(factorial(105)),its result have 168 digit, so use long double, but it seems to be a error, can't it use like this?
#include <stdio.h>
long double factorial(long double n,long double base = 1){
if (n <= 1){
return 1*base;
}else{
return factorial(n-1,n * base);
}
}
int main(){
printf("%.0Lf\n",factorial(25)); // this result is correct
printf("%.0Lf\n",factorial(26));
//correct result is 403291461126605635584000000,but it return 403291461126605635592388608
return 0;
}

Back of the envelope calculation: 25! is slightly more than 1025; three orders of magnitude is roughly 10 bits, so you would need roughly 83 bits of mantissa even just to represent precisely the result.
Given that a long double, on platforms that support it, is generally 80 bits for the whole value (not just the mantissa!), apparently there's no way you have enough mantissa to perform that calculations of that order of magnitude with integer precision.
However: factorial here is a bit magic, as many of the factors contain powers of two, which just add binary zeroes to the right, which do not require mantissa (they end up in the exponent). In particular:
25! = 2 4 2 8 2 4 2 16 2 4 2 8 = 2²² · m
3 5 3 7 9 5 11 3 13 7 15 17 9 19 5 21 11 23 3 25
(m being the product of all non-2 factors, namely m = 310 · 56 · 73 · 112 · 13 · 17 · 19 · 23, so effectively the data we have to store in the mantissa)
Hence, our initial estimate exceeds the actual requirements by 22 bits.
It turns out that
log2(f) = 10·log23 + 6·log25 + 3·log27 + 2·log211 + log213 + log217 + log219 + log223 = 61.68
which is indeed just below the size of the mantissa of 80 bit long double (64 bit). But when you multiply it by 26 (so, excluding the factor 2, which ends up in the exponent, by 13), you are adding log2(13) = 3.7 bits. 61.7+3.7 is 65.4, so from 26! onwards you no longer have the precision to perform your calculation exactly.

Firstly, nobody knows what long double can or cannot represent. Specific properties of the format depend on the implementation.
Secondly, X86 extended precision floating-point format has 64-bit significand with explicit leading 1, which means that it can represent contiguous integers in ±264 range. Beyond that range representable integers are non-contiguous (i.e. they begin to "skip" with wider and wider gaps). Your factorials fall far outside that range, meaning that it is perfectly expected that they won't be represented precisely.

Since 105! is a huge number that does not fit in a single word (or two of them), you want arbitrary precision arithmetic, also known as bignums. Use the Stirling's approximation to get an idea of how big 105! is and read the wikipage on factorials.
Standard C (read n1570 to check) don't have bignums natively, but you'll find many libraries for that.
I recommend GMPlib. BTW, some of its code is hand-written assembly for performance reason (when coding bignum addition, you want to take advantage of add with carry machine instructions).
I recommend to avoid writing your own bignum operations. The existing libraries use very clever algorithms (and you'll need to make some PhD work to get something better). If you try coding your own bignum library, it probably will be much worse than competitors (unless you spend years of work).

Related

How are results rounded in floating-point arithmetic?

I wrote this code that simply sums a list of n numbers, to practice with floating point arithmetic, and I don't understand this:
I am working with float, this means I have 7 digits of precision, therefore, if I do the operation 10002*10002=100040004, the result in data type float will be 100040000.000000, since I lost any digit beyond the 7th (the program still knows the exponent, as seen here).
If the input in this program is
3
10000
10001
10002
You will see that, however, when this program computes 30003*30003=900180009 we have 30003*30003=900180032.000000
I understand this 32 appears becasue I am working with float, and my goal is not to make the program more precise but understand why this is happening. Why is it 900180032.000000 and not 900180000.000000? Why does this decimal noise (32) appear in 30003*30003 and not in 10002*10002 even when the magnitude of the numbers are the same? Thank you for your time.
#include <stdio.h>
#include <math.h>
#define MAX_SIZE 200
int main()
{
int numbers[MAX_SIZE];
int i, N;
float sum=0;
float sumb=0;
float sumc=0;
printf("introduce n" );
scanf("%d", &N);
printf("write %d numbers:\n", N);
for(i=0; i<N; i++)
{
scanf("%d", &numbers[i]);
}
int r=0;
while (r<N){
sum=sum+numbers[r];
sumb=sumb+(numbers[r]*numbers[r]);
printf("sum is %f\n",sum);
printf("sumb is %f\n",sumb);
r++;
}
sumc=(sum*sum);
printf("sumc is %f\n",sumc);
}
As explained below, the computed result of multiplying 10,002 by 10,002 must be a multiple of eight, and the computed result of multiplying 30,003 by 30,003 must be a multiple of 64, due to the magnitudes of the numbers and the number of bits available for representing them. Although your question asks about “decimal noise,” there are no decimal digits involved here. The results are entirely due to rounding to multiples of powers of two. (Your C implementation appears to use the common IEEE 754 format for binary floating-point.)
When you multiply 10,002 by 10,002, the computed result must be a multiple of eight. I will explain why below. The mathematical result is 100,040,004. The nearest multiples of eight are 100,040,000 and 100,040,008. They are equally far from the exact result, and the rule used to break ties chooses the even multiple (100,040,000 is eight times 12,505,000, an even number, while 100,040,008 is eight times 12,505,001, an odd number).
Many C implementations use IEEE 754 32-bit basic binary floating-point for float. In this format, a number is represented as an integer M multiplied by a power of two 2e. The integer M must be less than 224 in magnitude. The exponent e may be from −149 to 104. These limits come from the numbers of bits used to represent the integer and the exponent.
So all float values in this format have the value M • 2e for some M and some e. There are no decimal digits in the format, just an integer multiplied by a power of two.
Consider the number 100,040,004. The biggest M we can use is 16,777,215 (224−1). That is not big enough that we can write 100,040,004 as M • 20. So we must increase the exponent. Even with 22, the biggest we can get is 16,777,215 • 22 = 67,108,860. So we must use 23. And that is why the computed result must be a multiple of eight, in this case.
So, to produce a result for 10,002•10,002 in float, the computer uses 12,505,000 • 23, which is 100,040,000.
In 30,003•30,003, the result must be a multiple of 64. The exact result is 900,180,009. 25 is not enough because 16,777,215•25 is 536,870,880. So we need 26, which is 64. The two nearest multiples of 64 are 900,179,968 and 900,180,032. In this case, the latter is closer (23 away versus 41 away), so it is chosen.
(While I have described the format as an integer times a power of two, it can also be described as a binary numeral with one binary digit before the radix point and 23 binary digits after it, with the exponent range adjusted to compensate. These are mathematically equivalent. The IEEE 754 standard uses the latter description. Textbooks may use the former description because it makes analyzing some of the numerical properties easier.)
Floating point arithmetic is done in binary, not in decimal.
Floats actually have 24 binary bits of precision, 1 of which is a sign bit and 23 of which are called significand bits. This converts to approximately 7 decimal digits of precision.
The number you're looking at, 900180032, is already 9 digits long and so it makes sense that the last two digits (the 32) might be wrong. The rounding like the arithmetic is done in binary, the reason for the difference in rounding can only be seen if you break things down into binary.
900180032 = 110101101001111010100001000000
900180000 = 110101101001111010100000100000
If you count from the first 1 to the last 1 in each of those numbers (the part I put in bold), that is how many significand bits it takes to store the number. 900180032 takes only 23 significand bits to store while 900180000 takes 24 significand bits which makes 900180000 an impossible number to store as floats only have 23 significand bits. 900180032 is the closest number to the correct answer, 900180009, that a float can store.
In the other example
100040000 = 101111101100111110101000000
100040004 = 101111101100111110101000100
The correct answer, 100040004 has 25 significand bits, too much for floats. The nearest number that has 23 or less significand bits is 10004000 which only has 21 significant bits.
For more on floating point arithmetic works, try here http://steve.hollasch.net/cgindex/coding/ieeefloat.html

Subtracting 1 from 0 in 8 bit binary

I have 8 bit int zero = 0b00000000; and 8 bit int one = 0b00000001;
according to binary arithmetic rule,
0 - 1 = 1 (borrow 1 from next significant bit).
So if I have:
int s = zero - one;
s = -1;
-1 = 0b1111111;
where all those 1s are coming from? There are nothing to borrow since all bits are 0 in zero variable.
This is a great question and has to do with how computers represent integer values.
If you’re writing out a negative number in base ten, you just write out the regular number and then prefix it with a minus sign. But if you’re working inside a computer where everything needs to either be a zero or a one, you don’t have any minus signs. The question then comes up of how you then choose to represent negative values.
One popular way of doing this is to use signed two’s complement form. The way this works is that you write the number using ones and zeros, except that the meaning of those ones and zeros differs from “standard” binary in how they’re interpreted. Specifically, if you have a signed 8-bit number, the lower seven bits have their standard meaning as 20, 21, 22, etc. However, the meaning of the most significant bit is changed: instead of representing 27, it represents the value -27.
So let’s look at the number 0b11111111. This would be interpreted as
-27 + 26 + 25 + 24 + 23 + 22 + 21 + 20
= -128 + 64 + 32 + 16 + 8 + 4 + 2 + 1
= -1
which is why this collection of bits represents -1.
There’s another way to interpret what’s going on here. Given that our integer only has eight bits to work with, we know that there’s no way to represent all possible integers. If you pick any 257 integer values, given that there are only 256 possible bit patterns, there’s no way to uniquely represent all these numbers.
To address this, we could alternatively say that we’re going to have our integer values represent not the true value of the integer, but the value of that integer modulo 256. All of the values we’ll store will be between 0 and 255, inclusive.
In that case, what is 0 - 1? It’s -1, but if we take that value mod 256 and force it to be nonnegative, then we get back that -1 = 255 (mod 256). And how would you write 255 in binary? It’s 0b11111111.
There’s a ton of other cool stuff to learn here if you’re interested, so I’d recommend reading up on signed and unsigned two’s-complement numbers.
As some exercises: what would -4 look like in this format? How about -9?
These aren't the only ways you can represent numbers in a computer, but they're probably the most popular. Some older computers used the balanced ternary number system (notably the Setun machine). There's also the one's complement format, which isn't super popular these days.
Zero minus one must give some number such that if you add one to it, you get zero. The only number you can add one to and get zero is the one represented in binary as all 1's. So that's what you get.
So long as you use any valid form of arithmetic, you get the same results. If there are eight cars and someone takes away three cars, the value you get for how many case are left should be five, regardless of whether you do the math with binary, decimal, or any other kind of representation.
So any valid system of representation that supports the operations you are using with their normal meanings must produce the same result. When you take the representation for zero and perform the subtraction operation using the representation for one, you must get the representation such that when you add one to it, you get the representation for zero. Otherwise, the result is just wrong based on the definitions of addition, subtraction, zero, one, and so on.

The max number of digits in an int based on number of bits

So, I needed a constant value to represent the max number of digits in an int, and it needed to be calculated at compile time to pass into the size of a char array.
To add some more detail: The compiler/machine I'm working with has a very limited subset of the C language, so none of the std libraries work as they have unsupported features. As such I cannot use INT_MIN/MAX as I can neither include them, nor are they defined.
I need a compile time expression that calculates the size. The formula I came up with is:
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
It is marginally successful with n byte integers based on hand calculating it.
sizeof(int) INT_MAX characters formula
2 32767 5 7
4 2147483647 10 12
8 9223372036854775807 19 22
You're looking for a result related to a logarithm of the maximum value of the integer type in question (which logarithm depends on the radix of the representation whose digits you want to count). You cannot compute exact logarithms at compile time, but you can write macros that estimate them closely enough for your purposes, or that compute a close enough upper bound for your purposes. For example, see How to compute log with the preprocessor.
It is useful also to know that you can convert between logarithms in different bases by multiplying by appropriate constants. In particular, if you know the base-a logarithm of a number and you want the base-b logarithm, you can compute it as
logb(x) = loga(x) / loga(b)
Your case is a bit easier than the general one, though. For the dimension of an array that is not a variable-length array, you need an "integer constant expression". Furthermore, your result does not need more than two digits of precision (three if you wanted the number of binary digits) for any built-in integer type you'll find in a C implementation, and it seems like you need only a close enough upper bound.
Moreover, you get a head start from the sizeof operator, which can appear in integer constant expressions and which, when applied to an integer type, gives you an upper bound on the base-256 logarithm of values of that type (supposing that CHAR_BIT is 8). This estimate is very tight if every bit is a value bit, but signed integers have a sign bit, and they may have padding bits as well, so this bound is a bit loose for them.
If you want a a bound on the number of digits in a power-of-two radix then you can use sizeof pretty directly. Let's suppose, though, that you're looking for the number of decimal digits. Mathematically, the maximum number of digits in the decimal representation of an int is
N = ceil(log10(MAX_INT))
or
N = floor(log10(MAX_INT)) + 1
provided that MAX_INT is not a power of 10. Let's express that in terms of the base-256 logarithm:
N = floor( log256(MAX_INT) / log256(10) ) + 1
Now, log256(10) cannot be part of an integer constant expression, but it or its reciprocal can be pre-computed: 1 / log256(10) = 2.40824 (to a pretty good approximation; the actual value is slightly less). Now, let's use that to rewrite our expression:
N <= floor( sizeof(int) * 2.40824 ) + 1
That's not yet an integer constant expression, but it's close. This expression is an integer constant expression, and a good enough approximation to serve your purpose:
N = 241 * sizeof(int) / 100 + 1
Here are the results for various integer sizes:
sizeof(int) INT_MAX True N Computed N
1 127 3 3
2 32767 5 5
4 2147483648 10 10
8 ~9.223372037e+18 19 20
(The values in the INT_MAX and True N columns suppose one of the allowed forms of signed representation, and no padding bits; the former and maybe both will be smaller if the representation contains padding bits.)
I presume that in the unlikely event that you encounter a system with 8-byte ints, the extra one byte you provide for your digit array will not break you. The discrepancy arises from the difference between having (at most) 63 value bits in a signed 64-bit integer, and the formula accounting for 64 value bits in that case, with the result that sizeof(int) is a bit too much of an overestimation of the base-256 log of INT_MAX. The formula gives exact results for unsigned int up to at least size 8, provided there are no padding bits.
As a macro, then:
// Expands to an integer constant expression evaluating to a close upper bound
// on the number the number of decimal digits in a value expressible in the
// integer type given by the argument (if it is a type name) or the the integer
// type of the argument (if it is an expression). The meaning of the resulting
// expression is unspecified for other arguments.
#define DECIMAL_DIGITS_BOUND(t) (241 * sizeof(t) / 100 + 1)
An upper bound on the number of decimal digits an int may produce depends on INT_MIN.
// Mathematically
max_digits = ceil(log10(-INT_MAX))
It is easier to use the bit-width of the int as that approximates a log of -INT_MIN. sizeof(int)*CHAR_BIT - 1 is the max number of value bits in an int.
// Mathematically
max_digits = ceil((sizeof(int)*CHAR_BIT - 1)* log10(2))
// log10(2) --> ~ 0.30103
On rare machines, int has padding, so the above will over estimate.
For log10(2), which is about 0.30103, we could use 1/3 or one-third.
As a macro, perform integer math and add 1 for the ceiling
#include <stdlib.h>
#define INT_DIGIT10_WIDTH ((sizeof(int)*CHAR_BIT - 1)/3 + 1)
To account for a sign and null character add 2, use the following. With a very tight log10(2) fraction to not over calculate the buffer needs:
#define INT_STRING_SIZE ((sizeof(int)*CHAR_BIT - 1)*28/93 + 3)
Note 28/93 = ‭0.3010752... > log2(10)
The number of digits needed for any base down to base 2 would need follows below. It is interesting that +2 is needed and not +1. Consider a 2 bit signed number in base 2 could be "-10", a size of 4.
#define INT_STRING2_SIZE ((sizeof(int)*CHAR_BIT + 2)
Boringly, I think you need to hardcode this, centred around inspecting sizeof(int) and consulting your compiler documentation to see what kind of int you actually have. (All the C standard specifies is that it can't be smaller than a short, and needs to have a range of at least -32767 to +32767, and 1's complement, 2's complement, and signed magnitude can be chosen. The manner of storage is arbitrary although big and little endianness are common.) Note that an arbitrary number of padding bits are allowed, so you can't, in full generality, impute the number of decimal digits from the sizeof.
C doesn't support the level of compile time evaluable constant expressions you'd need for this.
So hardcode it and make your code intentionally brittle so that compilation fails if a compiler encounters a case that you have not thought of.
You could solve this in C++ using constexpr and metaprogramming techniques.
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
is the formula I came up with.
The +2 is for the negative sign and the null terminator.
If we suppose that integral values are either 2, 4, or 8 bytes, and if we determine the respective digits to be 5, 10, 20, then a integer constant expression yielding the exact values could be written as follows:
const int digits = (sizeof(int)==8) ? 20 : ((sizeof(int)==4) ? 10 : 5);
int testArray[digits];
I hope that I did not miss something essential. I've tested this at file scope.

Printing multiple integers as one arbitrarily long decimal string

Say I have 16 64-bit unsigned integers. I have been careful to feed the carry as appropriate between them when performing operations. Could I feed them into a method to convert all of them into a single string of decimal digits, as though it was one 1024-bit binary number? In other words, is it possible to make a method that will work for an arbitrary number of integers that represent one larger integer?
I imagine that it would be more difficult for signed integers, as there is the most significant bit to deal with. I suppose it would be that the most significant integer would be the signed integer, and the rest would be unsigned, to represent the remaining 'parts' of the number.
(This is semi-related to another question.)
You could use the double dabble algorithm, which circumvents the need for multi-precision multiplication and division. In fact, the Wikipedia page contains a C implementation for this algorithm.
This is a bit unclear.
Of course, a function such as
void print_1024bit(uint64_t digits[]);
could be written to do this. But if you mean if any of the standard library's printf()-family of functions can do this, then I think the answer is no.
As you probably saw in the other question, the core of converting a binary number into a different base b is made of two operations:
Modulo b, to figure out the current least significant digit
Division by b, to remove that digit once it's been generated
When applied until the number is 0, this generates all the digits in reverse order.
So, you need to implement "modulo 10" and "divide by 10" for your 1024-bit number.
For instance, consider the number decimal 4711, which we want to convert to octal just for this example:
4711 % 8 is 7, so the right-most digit is 7
4711 / 8 is 588
588 % 8 is 4, the next digit is 4
588 / 8 is 73
73 % 8 is 1
73 / 8 is 9
9 % 8 is 1
8 / 8 is 1
1 % 8 is 1
1 / 8 is 0, we're done.
So, reading the bold digits from the bottom and up towards the right-most digits, we conclude that 471110 = 111478. You can use a calculator to verify this, or just trust me. :)
It's possible, of course, but not terribly straight-forward.
Rather than reinventing the wheel, how about reusing a library?
The GNU Multi Precision Arithmetic Library is one such possibility. I've not needed such things myself, but it seems to fit your bill.

Modulo operation in output

In most coding competitions where the output of a program is presumed to be very large,it is generally instructed to divide the output by 10000007(or in that case a prime number).What is the significance of the prime number being taken because in many cases I find that the same number is given as 100004(i.e. not a prime number)..?
A prime number is used for two reasons. One reason is that the integers modulo a prime form a mathematical field. Arithmetic in an field works in many ways like arithmetic on integers. This makes an field useful in certain contest problems where otherwise the sequences used might collapse in some cases. Certain arithmetic might produce zero, other trivial results, or simpler results than desired, because the numbers involve happen upon a factor of the modulus, causing some reduction or elimination to occur.
Another reason is to compel the programmer to deal with arithmetic on integers of a certain size. If a composite number were used, then other techniques could be used that did not resort to arithmetic with large integers.
For example, supposed we want to know what 132 is modulo 35, but we have only a very small processor that cannot handle three-digit numbers, so it cannot compute 132 = 169.
Well, 35 = 5•7, and 13 is congruent to 3 modulo 5 and 6 modulo 7. Instead of computing the square of 13, we can compute the squares of these residues, which tells us that 132 is congruent to 32 = 9 = 4 modulo 5 and is congruent to 62 = 36 = 1 modulo 7. Combining these residues requires some additional knowledge (the Extended Euclidean Algorithm). For these particular numbers, we can multiply the residue of 5 by 21 and the residue of 7 by 15 to get 4·21+1·15 = 99. Reducing that modulo 35 yields 29, which is the answer (the residue of 132 modulo 35).
If the modulus is prime, this circumvention of the arithmetic is not available. A prime modulus essentially requires the use of arithmetic on numbers up to the square of the modulus (or time-consuming workarounds), but a composite modulus would allow the use of arithmetic on smaller numbers, up to just twice the modulus.

Resources