Can some one explain why GMP mpz_sizeinbase will return 1 too big? - c

I use gmp library write a C program like
len = mpz_sizeinbase(res, 10);
when res = 9, it gives me 2. So i check manual and it says
size_t mpz_sizeinbase (mpz_t op, int base)
Return the size of op measured in number of digits in the given base. base can vary from 2 to 62. The sign of op is ignored, just the absolute value is used. The result will be either exact or 1 too big. If base is a power of 2, the result is always exact. If op is zero the return value is always 1.
I just want to know why this function design with this leak? Why CAN'T it be exact?
Some question i found similar :
GMP mpz_sizeinbase returns size 2 for 9 in base 10
Number of digits of GMP integer

mpz_sizeinbase does not look at the whole number but only at the highest word. It then estimates the size. Problem is that it might be looking at 999999999 or 1000000000. To know exactly which of the two it is all bits of the number would have to be looked at. What mpz_sizeinbase does is (using word == digit for the example) compute the size for 9xxxxxxxx. The xxxxxxxx part is ignored and could cause an overflow of the first digit. So the size is increased by one and returned.
This lets you allocate enough space for coneverting the number quickly with only a minimal waste in some cases. The alternative would be to convert the whole number just to get the size, allocate the buffer and then do it all over again to actually store the result.

Related

Multiplication of 2 numbers with a maximum of 2000 digits [duplicate]

This question already has answers here:
What is the simplest way of implementing bigint in C?
(5 answers)
How can I compute a very big digit number like (1000 digits ) in c , and print it out using array
(4 answers)
Store very big numbers in an integer in C
(2 answers)
Closed 3 months ago.
Implement a program to multiply two numbers, with the mention that the first can have a maximum of 2048 digits, and the second number is less than 100. HINT: multiplication can be done using repeated additions.
Up to a certain point, the program works using long double, but when working with larger numbers, only INF is displayed. Any ideas?
Implement a program to multiply two numbers, with the mention that the first can have a maximum of 2048 digits, and the second number is less than 100.
OK. The nature of multiplication is that if a number with N bits is multiplied by a number with M bits, then the result will have up to N+M bits. In other words, you need to handle a result that has 2148 bits.
A long double could be anything (it's implementation dependent). Most likely (Windows or not 80x86) is that it's a synonym for double, but sometimes it might be larger (e.g. the 80-bit format described on this Wikipedia page ). The best you can realistically hope for is a dodgy estimate with lots of precision loss and not a correct result.
The worst case (the most likely case) is that the exponent isn't big enough either. E.g. for double the (unbiased) exponent has to be in the range −1022 to +1023 so attempting to shove a 2048 bit number in there will cause an overflow (an infinity).
What you're actually being asked to do is implement a program that uses "big integers". The idea would be to store the numbers as arrays of integers, like uint32_t result[2148/32];, so that you actually do have enough bits to get a correct result without precision loss or overflow problems.
With this in mind, you want a multiplication algorithm that can work with big integers. Note: I'd recommend something from that Wikipedia page's "Algorithms for multiplying by hand" section - there's faster/more advanced algorithms that are way too complicated for (what I assume is) a university assignment.
Also, the "HINT: multiplication can be done using repeated additions" is a red herring to distract you. It'd take literally days for a computer do the equivalent of a while(source2 != 0) { result += source1; source2--; } with large numbers.
Here's a few hints.
Multiplying a 2048 digit string by a 100 digit string might yield a string with as many as 2148 digits. That's two high for any primitive C type. So you'll have to do all the math the hard way against "strings". So stay in the string space since your input will most likely be read in as much.
Let's say you are trying to multiple "123456" x "789".
That's equivalent to (123456 * (700 + 80 + 9)
Which is equivalent to to 123456 * 700 + 123456 * 80 + 123456 * 9
Which is equivalent to doing these steps:
result1 = Multiply 123456 by 7 and add two zeros at the end
result2 = Multiply 123456 by 8 and add one zero at the end
result3 = Multiply 123456 by 9
final result = result1+result2+result3
So all you need is a handful of primitives that can take a digit string of arbitrary length and do some math operations on it.
You just need these three functions:
// Returns a new string that is identical to s but with a specific number of
// zeros added to the end.
// e.g. MultiplyByPowerOfTen("123", 3) returns "123000"
char* MultiplyByPowerOfTen(char* s, size_t zerosToAdd)
{
};
// Performs multiplication on the big integer represented by s
// by the specified digit
// e.g. Multiple("12345", 2) returns "24690"
char* Multiply(char* s, int digit) // where digit is between 0 and 9
{
};
// Performs addition on the big integers represented by s1 and s2
// e.g. Add("12345", "678") returns "13023"
char* Add(char* s1, char* s2)
{
};
Final hint. Any character at position i in your string can be converted to its integer equivalent like this:
int digit = s[i] - '0';
And any digit can be converted back to a printable char:
char c = '0' + digit

Why is log base 10 used in this code to convert int to string?

I saw a post explaining how to convert an int to a string. In the explanation there is a line of code to get the number of chars in a string:
(int)((ceil(log10(num))+1)*sizeof(char))
I’m wondering why log base 10 is used?
ceil(log10(num))+1 is incorrectly being used instead of floor(log10(num))+2.
The code is attempting to determine the amount of memory needed to store the decimal representation of the positive integer num as a string.
The two formulas presented above are equal except for numbers which are exact powers of 10, in which case the former version returns one less than the desired number.
For example, 10,000 requires 6 bytes, yet ceil(log10(10000))+1 returns 5. floor(log10(10000))+2 correctly returns 6.
How was floor(log10(num))+2 obtained?
A 4-digit number such as 4567 will be between 1,000 (inclusive) and 10,000 (exclusive), so it will be between 103 (inclusive) and 104 (exclusive), so log10(4567) will be between 3 (inclusive) and 4 (exclusive).
As such, floor(log10(num))+1 will return number of digits needed to represent the positive value num in decimal.
As such, floor(log10(num))+2 will return the amount of memory needed to store the decimal representation of the positive integer num as a string. (The extra char is for the NUL that terminates the string.)
I’m wondering why log base 10 is used?
I'm wondering the same thing. It uses a very complex calculation that happens at runtime, to save a couple bytes of temporary storage. And it does it wrong.
In principle, you get the number of digits in base 10 by taking the base-10 logarithm and flooring and adding 1. It comes exactly from the fact that
log10(1) = log10(10⁰) = 0
log10(10) = log10(10¹) = 1
log10(100) = log10(10²) = 2
and all numbers between 10 and 100 have their logarithms between 1 and 2 so if you floor the logarithm for any two digit number you get 1... add 1 and you get the number of digits.
But you do not need to do this at runtime. The maximum number of bytes needed for a 32-bit int in base 10 is 10 digits, negative sign and null terminator for 12 chars. The maximum you can save with the runtime calculation are 10 bytes of RAM, but it is usually temporary so it is not worth it. If it is stack memory, well, the call to log10, ceil and so forth might require far more.
In fact, we know the maximum number of bits needed to represent an integer: sizeof (int) * CHAR_BIT. This is greater than or equal to log2 of the MAX_INT + 1. And we know that log10(x) =~ 3.32192809489 * log2(x), so we get a good (possibly floored) approximation of log10(MAX_INT) by just dividing sizeof (int) * CHAR_BIT by 3. Then add 1 for we were supposed to add 1 to the floored logarithm to get the number of digits, then 1 for possible sign, and 1 for the null terminator and we get
sizeof (int) * CHAR_BIT / 3 + 3
Unlike the one from your question, this is an integer constant expression, i.e. the compiler can easily fold it at the compilation time, and it can be used to set the size of a statically-typed array, and for 32-bits it gives 13 which is only one more than the 12 actually required, for 16 bits it gives 8 which is again only one more than the maximum required 7 and for 8 bits it gives 5 which is the exact maximum.
ceil(log10(num)) + 1 is intended to provide the number of characters needed for the output string.
For example, if num=101, the expression's value is 4, the correct length of '101' plus the null terminator.
But if num=100, the value is 3. This behavior is incorrect.
This is because it's allocating enough space for the number to fit in the string.
If, for example, you had the number 1034, log10(1034) = 3.0145.... ceil(3.0145) is 4, which is the number of digits in the number. The + 1 is for the null-terminator.
This isn't perfect though: take 1000, for example. Despite having four digits, log(1000) = 3, and ceil(3) = 3, so this will allocate space for too few digits. Plus, as #phuclv mentions below, the log() function is very time-consuming for this purpose, especially since the length of a number has a (relatively low) upper-bound.
The reason it's log base 10 is because, presumably, this function represents the number in decimal form. If, for example, it were hexadecimal, log base 16 would be used.
A number N has n decimal digits iff 10^(n-1) <= N < 10^n which is equivalent to n-1 <= log(N) < n or n = floor(log(N)) + 1.
Since double representation has only limited precision floor(log(N)) may be off by 1 for certain values, so it is safer to allow for an extra digit i.e. allocate floor(log(N)) + 2 characters, and then another char for the nul terminator for a total of of floor(log(N)) + 3.
The expression in the original question ceil(log(N)) + 1 appears to not count the nul terminator, and neither allow for the chance of rounding errors, so it is one shorter in general, and two shorter for powers of 10.

Generating random numbers in ranges from 32 bytes of random data, without bignum library

I have 32 bytes of random data.
I want to generate random numbers within variable ranges between 0-9 and 0-100.
If I used an arbitrary precision arithmetic (bignum) library, and treated the 32 bytes as a big number, I could simply do:
random = random_source % range;
random_source = random_source / range;
as often as I liked (with different ranges) until the product of the ranges nears 2^256.
Is there a way of doing this using only (fixed-size) integer arithmetic?
Certainly you can do this by doing base 256 long division (or push up multiplication). It is just like the long division you learnt in primary school, but with bytes instead of digits. It involves doing a cascade of divides and remainders for each byte in turn. Note that you also need to be aware how you are consuming the big number, and that as you consume it and it becomes smaller, there is an increasing bias against the larger values in the range. Eg if you only have 110 left, and you asked for a rnd(100), the values 0-9 would be 10% more likely than 10-99 each.
But, you don't really need the bignum techniques for this, you can use ideas from arithmetic encoding compression, where you build up the single number without actually ever dealing with the whole thing.
If you start by reading 4 bytes to an unsigned uint_32 buffer, it has a range 0..4294967295 , a non-inclusive max of 4294967296. I will refer to this synthesised value as the "carry forward", and this exclusive max value is also important to record.
[For simplicity, you might start with reading 3 bytes to your buffer, generating a max of 16M. This avoids ever having to deal with the 4G value that can't be held in a 32 bit integer.]
There are 2 ways to use this, both with accuracy implications:
Stream down:
Do your modulo range. The modulo is your random answer. The division result is your new carry forward and has a smaller range.
Say you want 0..99, so you modulo by 100, your upper part has a range max 42949672 (4294967296/100) which you carry forward for the next random request
We can't feed another byte in yet...
Say you now want 0..9, so you modulo by 10, and now your upper part has a range 0..4294967 (42949672/100)
As max is less than 16M, we can now bring in the next byte. Multiply it by the current max 4294967 and add it to the carry forward. The max is also multiplied by 256 -> 1099511552
This method has a slight bias towards small values, as 1 in the "next max" times, the available range of values will not be the full range, because the last value is truncated, but by choosing to maintain 3-4 good bytes in max, that bias is minimised. It will only occur at max 1 in 16million times.
The computational cost of this algorithm is the div by the random range of both carry forward and max, and then the multiply each time you feed in a new byte. I assume the compiler will optimise the modulo
Stream up:
Say you want 0..99
Divide your max by range, to get the nextmax, and divide carryforward by nextmax. Now, your random number is in the division result, and the remainder forms the value you carry forward to get the next random.
When nextmax becomes less than 16M, simply multiply both nextmax and your carry forward by 256 and add in the next byte.
The downside if this method is that depending on the division used to generate nextmax, the top value result (i.e. 99 or 9) is heavily biased against, OR sometimes you will generate the over-value (100) - this depends whether you round up or down doing the first division.
The computational cost here is again 2 divides, presuming the compiler optimiser blends div and mod operations. The multiply by 256 is fast.
In both cases you could choose to say that if the input carry forward value is in this "high bias range" then you will perform a different technique. You could even oscillate between the techniques - use the second in preference, but if it generates the over-value, then use the first technique, though on its own the likelihood is that both techniques will bias for similar input random streams when the carry forward value is near max. This bias can be reduced by making the second method generate -1 as the out-of-range, but each of these fixes adds an extra multiply step.
Note that in arithmetic encoding this overflow zone is effectively discarded as each symbol is extracted. It is guaranteed during decoding that those edge values won't happen, and this contributes to the slight suboptimal compression.
/* The 32 bytes in data are treated as a base-256 numeral following a "." (a
radix point marking where fractional digits start). This routine
multiplies that numeral by range, updates data to contain the fractional
portion of the product, and returns the integer portion.
8-bit bytes are assumed, or "t /= 256" could be changed to
"t >>= CHAR_BIT". But then you have to check the sizes of int
and unsigned char to consider overflow.
*/
int r(int range, unsigned char *data)
{
// Start with 0 carried from a lower position.
int t = 0;
// Iterate through each byte.
for (int i = 32; 0 < i;)
{
--i;
// Multiply next byte by our multiplier and add the carried data.
t = data[i] * range + t;
// Store the low bits of the result.
data[i] = t;
// Carry the high bits of the result to the next position.
t /= 256;
}
// Return the bits that carried out of the multiplication.
return t;
}

Declaring the array size in C

Its quite embarrassing but I really want to know... So I needed to make a conversion program that converts decimal(base 10) to binary and hex. I used arrays to store values and everything worked out fine, but i declared the array as int arr[1000]; because i thought 1000 was just an ok number, not too big, not to small...someone in class said " why would you declare an array of 1000? Integers are 32 bits". I was too embarrased to ask what that meant so i didnt say anything. But does this mean that i can just declare the array as int arr[32]; instead? Im using C btw
No, the int type has tipically a 32 bit size, but when you declare
int arr[1000];
you are reserving space for 1000 integers, i.e. 32'000 bits, while with
int arr[32];
you can store up to 32 integers.
You are practically asking yourself a question like this: if an apple weighs 32 grams, I want to my bag to
contain 1000 apples or 32 apples?
Don't be embarrassed. Fear is your enemy and in the end you will be perceived based on contexts that you have no hope of significantly influencing. Anyway, to answer your question, your approach is incorrect. You should declare the array with a size completely determined by the number of positions used.
Concretely, if you access the array at 87 distinct positions (from 0 to 86) then you need a size of 87.
0 to 4,294,967,295 is the maximum possible range of numbers you can store in 32 bits.If your number is outside this range you cannot store your number in 32 bits.Since each bit will occupy one index location of your array if you number falls in that range array size of 32 will do fine.for example consider number 9 it will be stored in array as a[]={1,0,0,1}.
In order to know the know range of numbers, your formula is 0 to (2^n -1) where n is the number of bits in binary. means in the array size of 4 or 4 bits you can just store number from range 0 to 15.
In C , integer datatype can store typically up to 2,147,483,647 and 4,294,967,295 if you are using unsigned integer. Since the maximum value, an integer data type can store in C is within the range of maximum possible number which can be expressed using 32 bits. It is safe to say that array size of 32 is the best size for defining an array.Sice you will never require more than 32 bits to express any number using an int.
I will use
int a = 42;
char bin[sizeof a * CHAR_BIT + 1];
char hex[sizeof a * CHAR_BIT / 4 + 1]
I think this include all possibility.
Consider that also the 'int' type is ambiguous. Generally it depends on the machine you're working on and at minimum its ranges are: -32767,+32767:
https://en.wikipedia.org/wiki/C_data_types
Can I suggest to use the stdint types?
int32_t/uint32_t
What you did is okay. If that is precisely what you want to do. C is a language that lets you do whatever you want. Whenever you want. The reason you were berated on the declaration is because of 'hogging' memory. The thought being, how DARE YOU take up space that is possibly never used... it is inefficient.
And it is. But who cares if you just want to run a program that has a simple purpose? A 1000 16 or 32 bit block of memory is weeeeeensy teeeeny tiny compared to computers from the way back times when it was necessary to watch over how much RAM you were taking up. So - go ahead.
But what they should have said next is how to avoid that. More on that at the end - but first a thing about built in data types in C.
An int can be 16 or 32 bits depending on how you declare it. And your compiler's settings...
A LONG int is 32.
consider:
short int x = 10; // declares an integer that is 16 bits
signed int x = 10; // 32 bit integer with negative and positive range
unsigned int x = 10 // same 32 bit integer - but only 0 to positive values
To specifically code a 32 bit integer you declare it 'long'
long int = 10; // 32 bit
unsigned long int = 10; // 32 bit 0 to positive values
Typical nomenclature is to call a 16 bit value a WORD and a 32 bit value a DWORD - (double word). But why would you want to type in:
long int x = 10;
instead of:
int x = 10;
?? For a few reasons. Some compilers may handle the int as a 16 bit WORD if keeping up with older standards. But the only real reason is to maintain a convention of strongly typecasted code. Make it read directly what you intend it to do. This also helps in readability. You will KNOW when you see it = what size it is for sure, and be reminded whilst coding. Many many code mishaps happen for lack of attention to code practices and naming things well. Save yourself hours of headache later on by learning good habits now. Create YOUR OWN style of coding. Take a look at other styles just to get an idea on what the industry may expect. But in the end you will find you own way in it.
On to the array issue ---> So, I expect you know that the array takes up memory right when the program runs. Right then, wham - the RAM for that array is set aside just for your program. It is locked out from use by any other resource, service, etc the operating system is handling.
But wouldn't it be neat if you could just use the memory you needed when you wanted, and then let it go when done? Inside the program - as it runs. So when your program first started, the array (so to speak) would be zero. And when you needed a 'slot' in the array, you could just add one.... use it, and then let it go - or add another - or ten more... etc.
That is called dynamic memory allocation. And it requires the use of a data type that you may not have encountered yet. Look up "Pointers in C" to get an intro.
If you are coding in regular C there are a few functions that assist in performing dynamic allocation of memory:
malloc and free ~ in the alloc.h library routines
in C++ they are implemented differently. Look for:
new and delete
A common construct for handling dynamic 'arrays' is called a "linked-list." Look that up too...
Don't let someone get your flustered with code concepts. Next time just say your program is designed to handle exactly what you have intended. That usually stops the discussion.
Atomkey

GMP most significant digits

I'm performing some calculations on arbitrary precision integers using GNU Multiple Precision (GMP) library. Then I need the decimal digits of the result. But not all of them: just, let's say, a hundred of most significant digits (that is, the digits the number starts with) or a selected range of digits from the middle of the number (e.g. digits 100..200 from a 1000-digit number).
Is there any way to do it in GMP?
I couldn't find any functions in the documentation to extract a range of decimal digits as a string. The conversion functions which convert mpz_t to character strings always convert the entire number. One can only specify the radix, but not the starting/ending digit.
Is there any better way to do it other than converting the entire number into a humongous string only to take a small piece of it and throw out the rest?
Edit: What I need is not to control the precision of my numbers or limit it to a particular fixed amount of digits, but selecting a subset of digits from the digit string of the number of arbitrary precision.
Here's an example of what I need:
71316831 = 19821203202357042996...2076482743
The actual number has 1112852 digits, which I contracted into the ....
Now, I need only an arbitrarily chosen substring of this humongous string of digits. For example, the ten most significant digits (1982120320 in this case). Or the digits from 1112841th to 1112849th (21203202 in this case). Or just a single digit at the 1112841th position (2 in this case).
If I were to first convert my GMP number to a string of decimal digits with mpz_get_str, I would have to allocate a tremendous amount of memory for these digits only to use a tiny fraction of them and throw out the rest. (Not to mention that the original mpz_t number in binary representation already eats up quite a lot.)
If you know the number of decimal digits of x = 7^1316831 in advance, e.g., 1112852. Then you get your lower, say, 10 digits with:
x % (10^10), and the upper 20 digits with:
x / (10^(1112852 - 20)).
Note, I get 19821203202357042995 for the latter; 5 at final, not 6.
I don't think you can do that in GMP. However you can use Boost Multiprecision Library
Depending upon the number type, precision may be arbitrarily large (limited only by available memory), fixed at compile time (for example 50 or 100 decimal digits), or a variable controlled at run-time by member functions. The types are expression-template-enabled for better performance than naive user-defined types.
Emphasis mine
Another alternative is ttmath with the type ttmath::Big<e,m> that you can control the needed precision. Any fixed-precision types will work, provided that you only need the most significant digits, as they all drop the low significant digits like how float and double work. Those digits don't affect the high digits of the result, hence can be omitted safely. For instance if you need the high 20 digits then use a type that can store 20 digits and a little more, in order to provide enough data for correct rounding later
For demonstration let's take a simple example of 77 = 823543 and you only need the top 2 digits. Using a 4-digit type for calculation you'll get this
75 = 16807 => round to 1681×10¹ and store
75×7 = 1681×101×7 = 11767*10¹ ≈ 1177×102
75×7×7 = 1177×102×7 = 8232×102
As you can see the top digits are the same even without needing to get the full exact result. Calculating the full precision using GMP not only wastes a lot of time but also memory. Think about the amount of memory you need to store the result of another operation on 2 bigints to get the digits you want. By fixing the precision instead of leaving it at infinite you'll decrease the CPU and memory usage significantly.
If you need the 100th to 200th high order digits then use a type that has enough room for 201 digits and more, and extract those 101 digits after calculation. But this will be more wasteful so you may need to change to an arbitrary-precision (or fixed-precision) type that uses a base that's a power of 10 for its limbs (I'm using GMP notation here). For example if the type uses base 109 then each limb represents 9 digits in the decimal output and you can get arbitrary digit in decimal directly without any conversion from binary to decimal. That means zero waste for the string. I'm not sure which library uses base 10n but you can look at Mini-Pi's implementation which uses base 109, or write it yourself. This way it also work for efficiently getting the high digits
See
How are extremely large floating-point numbers represented in memory?
What is the simplest way of implementing bigint in C?

Resources