I'm thinking of using big large integers one way i can think is using Gmplib, i worked with small examples, but can it work with numbers like 2 ^ (2 ^ (2 ^ 1024)) ??
My question is how to represent that big number because (not sure) calculators too might get an overflow.
I'm thinking of using big large integers one way i can think is using Gmplib, i worked with small examples, but can it work with numbers like 2 ^ (2 ^ (2 ^ 1024)) ??
No. GMP has two operating modes: large integers and large floating-point numbers. The first one can only operate on numbers whose integer value can be fully represented in memory; the second is limited to exponents that can be represented within about 64 bits. The number you're describing does not fit within either of those limits. (The exponent alone is too large to fit into memory!)
My approach : I'll try to reduce noise by storing them as binary numbers / bitvectors because it'll give me get away with > one 2^ step.
It's not entirely clear what you're trying to say here or in the following paragraph, but what you're describing sounds like a typical multiprecision integer implementation. It's no different from what GMP does to store large integers, and it won't work for this application.
Numbers of the scale you're describing are not easy to work with. Whether you find a library to work with them or write one yourself, it'll likely need to be designed specifically for the purpose of operating on numbers with this particular structure. They're simply too large to do anything else with.
Related
I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?
A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.
What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.
In some situations, one generally uses a large enough integer value to represent infinity. I usually use the largest representable positive/negative integer. That usually yields more code, since you need to check if one of the operands is infinity before virtually all arithmetic operations in order to avoid overflows. Sometimes it would be desirable to have saturated integer arithmetic. For that reason, some people use smaller values for infinity, that can be added or multiplied several times without overflow. What intrigues me is the fact that it's extremely common to see (specially in programming competitions):
const int INF = 0x3f3f3f3f;
Why is that number special? It's binary representation is:
00111111001111110011111100111111
I don't see any specially interesting property here. I see it's easy to type, but if that was the reason, almost anything would do (0x3e3e3e3e, 0x2f2f2f2f, etc). It can be added once without overflow, which allows for:
a = min(INF, b + c);
But all the other constants would do, then. Googling only shows me a lot of code snippets that use that constant, but no explanations or comments.
Can anyone spot it?
I found some evidence about this here (original content in Chinese); the basic idea is that 0x7fffffff is problematic since it's already "the top" of the range of 4-byte signed ints; so, adding anything to it results in negative numbers; 0x3f3f3f3f, instead:
is still quite big (same order of magnitude of 0x7fffffff);
has a lot of headroom; if you say that the valid range of integers is limited to numbers below it, you can add any "valid positive number" to it and still get an infinite (i.e. something >=INF). Even INF+INF doesn't overflow. This allows to keep it always "under control":
a+=b;
if(a>INF)
a=INF;
is a repetition of equal bytes, which means you can easily memset stuff to INF;
also, as #Jörg W Mittag noticed above, it has a nice ASCII representation, that allows both to spot it on the fly looking at memory dumps, and to write it directly in memory.
I may or may not be one of the earliest discoverers of 0x3f3f3f3f. I published a Romanian article about it in 2004 (http://www.infoarena.ro/12-ponturi-pentru-programatorii-cc #9), but I've been using this value since 2002 at least for programming competitions.
There are two reasons for it:
0x3f3f3f3f + 0x3f3f3f3f doesn't overflow int32. For this some use 100000000 (one billion).
one can set an array of ints to infinity by doing memset(array, 0x3f, sizeof(array))
0x3f3f3f3f is the ASCII representation of the string ????.
Krugle finds 48 instances of that constant in its entire database. 46 of those instances are in a Java project, where it is used as a bitmask for some graphics manipulation.
1 project is an operating system, where it is used to represent an unknown ACPI device.
1 project is again a bitmask for Java graphics.
So, in all of the projects indexed by Krugle, it is used 47 times because of its bitpattern, once because of its ASCII interpretation, and not a single time as a representation of infinity.
I am writing functions that serialize/deserialize a large data structure for efficient reloading later on. There is a particular set of decimal numbers for which precision is not a huge deal, and I would like to store them in 4 bytes of binary data.
For most, reading the bytes into a buffer and using memcpy to place them into a float is sufficient, and is the most common solution I've found. However, this is not portable, as floats on the systems this software is meant for are not guaranteed to be 4 bytes in size.
What I would like is something very portable (which is one of the reasons I'm limited to C89). I'm not wedded to 4 byte storage, but it is an attractive option to me. I am pretty wholly against storing the numbers as strings. I'm familiar with endianness issues, and such things are already taken into account.
What I am looking for, therefore, is a system-independent way to store and retrieve floating point numbers in a small amount of binary data (preferably around 4 bytes). I, in my foolishness, imagined this would be the easiest part of this task, since it seems like such a common problem, but popular search engines and various reference books have provided no material assistance.
You could store them in 32 bit IEEE float format (or a very close approximation to it, for instance you might what to restrict denorms and NaNs). Then have each platform adjust as necessary to coerce its own float type to that format and back.
Of course there will be some loss of accuracy, but that's inevitable anyway if you're transferring float values of difference precisions from one system to another.
It should be possible to write portable code to find the closest IEEE value to a native float value, and vice-versa, if that's required. You wouldn't really want to use it, though, because it would probably be far less efficient than code that takes advantage of knowing the float format. In the common case where the platform uses an IEEE representation it's a no-op or a simple narrowing/widening conversion. Even in the worst case you're likely to encounter, as long as it's a binary fraction you basically just have to extract the sign, exponent and significand bits and do the right thing with them (discard bits from the significand if it's too big, adjust the bias and possibly the width of the exponent, do the right thing with underflow and overflow).
If you want to avoid losing accuracy in the case where the file is saved and then reloaded on the same system (but that system doesn't use 32bit IEEE), you could look at storing some data indicating the format in the file (size of each value, number of bits of significand and exponent), then store each value at native precision, so that it only gets rounded if it's ever loaded onto a less-precise system. I don't know whether ASN.1 has a standard to encode floating-point values along these lines, but it's the kind of complicated trickery I'd expect from it.
Check this out:http://steve.hollasch.net/cgindex/coding/portfloat.html
They give a routine which is portable and doesnt add too much overhead.
I am to program the Solovay-Strassen primality test presented in the original paper on RSA.
Additionally I will need to write a small bignum library, and so when searching for a convenient representation for bignum I came across this specification:
struct {
int sign;
int size;
int *tab;
} bignum;
I will also be writing a multiplication routine using the Karatsuba method.
So, for my question:
What base would be convenient to store integer data in the bignum struct?
Note: I am not allowed to use third party or built-in implementations for bignum such as GMP.
Thank you.
A power of 2.
For a simple implementation, probably half the size of a word on your machine, so that you can multiply two digits without overflow. So 65536 or 4294967296. Or possibly half the size of the largest integer type, for the same reason but maybe better performance over all.
But I've never actually implemented such a library: if you're using best known algorithms then you won't be doing school-style long multiplication. Karatsuba multiplication (and whatever other clever tricks you use) might benefit from being done in an integer that's more than twice the size of the digits, I really don't know how the performance works out. If so, then you'd be best off using 256 and 32 bit arithmetic, or 65536 and 64 bit arithmetic.
In any case if your representation is binary, then you can pick and choose larger power-of-two bases as convenient for each operation. For instance, you could treat the data as base 2^16 for multiplication, but base 2^32 for addition. It's all the same thing provided you're careful about endian-ness. I'd probably start with base 2^16 (since that forces me to get the endian-ness right to begin with, while 2^8 wouldn't), and see how I get on - as each operation is optimised, part of the optimisation is to identify the best base.
Using a size which isn't a multiple of bytes is a possibility, but then you have to use the same base for everything, because there are unused bits in the storage in specific places according to the base.
You will be doing the following operation a whole lot:
ab+cd...;
Either choose 1/4 the largest word size, or 1/2 the largest word size less a bit or two. That would be either 2^16 or 2^30 for 64 bit systems and 2^8 or 2^14 for 32 bit systems. Use the largest size the compiler supports, not the hardware.
If you choose 2^31 on a 64 bit system, that means you can add 4 products without overflow. If you choose 2^30 then you can add 16 products without overflow. The more you can add without overflow, the larger interim blocks you can use.
If you choose 1/4 the word size you will still have a native type so it will be easier to store results back out. You can pretty much ignore overflow too. This will basically make writing code faster and less error prone, and is slightly more memory efficient. I would recommend this unless you like lots of bit manipulation along with your math.
Choosing a larger base will make the big O numbers look better. In practice, while it would probably be faster to have a larger base, it will not be the 4x speed bump that you might hope for.
The base you use should be a power of 2. Since it looks like you're going to keep track of sign separately, you can use unsigned ints for storing the numbers themselves. You're going to need the ability to multiply 2 pieces/digits/units of these numbers at a time, so the size must be no more than half the word size you've got available. i.e. on x86 an unsigned int is 32 bits, so you'd want your digits to be not more than 16 bits. You may also use "long long" for the intermediate results of products of unsigned ints. Then you're looking at 2^32 for your base. One last thing to consider is that you may want to add sums of products, which will overflow unless you use fewer bits.
If performance is not a major concern, I'd just use base 256 and call it a day. You may want to use typedefs and defined constants so you can later change these parameters easily.
The integers in the tab array should be unsigned. They should be largest possible size (base) that you can multiply and still represent the product. If your compiler/processor supports 64 bit unsigned long long, for example, you might use uint32_t for the array of "digits." If your compiler/processor can only natively produce 32 bit products, you should use uint16_t.
When you sum two arrays you will need to deal with overflow; in assembly this is easy. In C you may opt to use one less bit (31 or 15) to make the overflow detection easier.
Also consider endianess, and the effect it and the algorithm will have on cache behavior.
I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.
Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.
I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.
If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.
Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.
Try PARI/GP, see wikipedia.
You could store its decimals digits as text in a file and mmap it to an array.
i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.
IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.