What is the fastest way to calculate e to 2 trillion digits? - c

I want to calculate e to 2 trillion (2,000,000,000,000) digits. This is about 1,8 TiB of pure e. I just implemented a taylor series expansion algorithm using GMP (code can be found here).
Unfortuanetly it crashes when summing more than 4000 terms on my computer, probably because it runs out of memory.
What is the current state of the art in computing e? Which algorithm is the fastest? Any open source implementations that are worth looking at? Please don't mention y-cruncher, it's closed source.

Since I'm the author of the y-cruncher program that you mention, I'll add my 2 cents.
For such a large task, the two biggest barriers that must be tackled are as follows:
Memory
Run-time Complexity
Memory
2 trillion digits is extreme - to say the least. That's double the current record set by Shigeru Kondo and myself back in 2010. (It took us more than 9 days to compute 1 trillion digits using y-cruncher.)
In plain text, that's about 1.8 TiB in decimal. In packed binary representation, that's 773 GiB.
If you're going to be doing arithmetic on numbers of this size, you're gonna need 773 GiB for each operand not counting scratch memory.
Feasibly speaking, y-cruncher actually needs 8.76 TiB of memory to do this computation all in ram. So you can expect other implementations to need the same give or take a factor of 2 at most.
That said, I doubt you're gonna have enough ram. And even if you did, it'd be heavily NUMA. So the alternative is to use disk. But this is not trivial, as to be efficient, you need to treat memory as a cache and micromanage all data that is transferred between memory and disk.
Run-time Complexity
Here we have the other problem. For 2 trillion digits, you're gonna need a very fast algorithm. Not just any fast algorithm, but a quasi-linear run-time algorithm.
Your current attempt runs in about O(N^2). So even if you had enough memory, it won't finish in your lifetime.
The standard approach to computing e to high precision runs in O(N log(N)^2) and combines the following algorithms:
Binary Splitting on the Taylor series expansion of e.
FFT-based large multiplication
Fortunately, GMP already uses FFT-based large multiplication. But it lacks two crucial features:
Out-of-core (swap) computation to use disk when there isn't enough memory.
It isn't parallelized.
The second point isn't as important since you can just wait longer. But for all practical purposes, you're probably gonna need to roll out your own. And that's what I did when I wrote y-cruncher.
That said, there are many other loose-ends that also need to be taken care of:
The final division will require a fast algorithm like Newton's Method.
If you're gonna compute in binary, you're gonna need to do a radix conversion.
If the computation is gonna take a lot of time and a lot of resources, you may need to implement fault-tolerance to handle hardware failures.

Since you have a goal how many digits you want (2 trillion) you can estimate how many terms you'll need to calculate e to that number of digits. From this, you can estimate how many extra digits of precision you'll need to keep track of to avoid rounding errors at the 2 trillionth place.
If my calculation from Stirling's approximation is correct, the reciprocal of 10 to the 2 trillion is about the reciprocal of 100 billion factorial. So that's about how many terms you'll need (100 billion). The story's a little better than that, though, because you'll start being able to throw away a lot of the numbers in the calculation of the terms well before that.
Since e is calculated as a sum of inverse factorials, all of your terms are rational, and hence they are expressible as repeating decimals. So the decimal expansion of your terms will be (a) an exponent, (b) a non-repeating part, and (c) a repeating part. There may be some efficiencies you can take advantage of if you look at the terms in this way.
Anyway, good luck!

Related

Determine if a given integer number is element of the Fibonacci sequence in C without using float

I had recently an interview, where I failed and was finally told having not enough experience to work for them.
The position was embedded C software developer. Target platform was some kind of very simple 32-bit architecture, those processor does not support floating-point numbers and their operations. Therefore double and float numbers cannot be used.
The task was to develop a C routine for this architecture. This takes one integer and returns whether or not that is a Fibonacci number. However, from the memory only an additional 1K temporary space is allowed to use during the execution. That means: even if I simulate very great integers, I can't just build up the sequence and interate through.
As far as I know, a positive integer is a exactly then a Fibonacci number if one of
(5n ^ 2) + 4
or
(5n ^ 2) − 4
is a perfect square. Therefore I responded the question: it is simple, since the routine must determine whether or not that is the case.
They responded then: on the current target architecture no floating-point-like operations are supported, therefore no square root numbers can be retrieved by using the stdlib's sqrt function. It was also mentioned that basic operations like division and modulus may also not work because of the architecture's limitations.
Then I said, okay, we may build an array with the square numbers till 256. Then we could iterate through and compare them to the numbers given by the formulas (see above). They said: this is a bad approach, even if it would work. Therefore they did not accept that answer.
Finally I gave up. Since I had no other ideas. I asked, what would be the solution: they said, it won't be told; but advised me to try to look for it myself. My first approach (the 2 formula) should be the key, but the square root may be done alternatively.
I googled at home a lot, but never found any "alternative" square root counter algorithms. Everywhere was permitted to use floating numbers.
For operations like division and modulus, the so-called "integer-division" may be used. But what is to be used for square root?
Even if I failed the interview test, this is a very interesting topic for me, to work on architectures where no floating-point operations are allowed.
Therefore my questions:
How can floating numbers simulated (if only integers are allowed to use)?
What would be a possible soultion in C for that mentioned problem? Code examples are welcome.
The point of this type of interview is to see how you approach new problems. If you happen to already know the answer, that is undoubtedly to your credit but it doesn't really answer the question. What's interesting to the interviewer is watching you grapple with the issues.
For this reason, it is common that an interviewer will add additional constraints, trying to take you out of your comfort zone and seeing how you cope.
I think it's great that you knew that fact about recognising Fibonacci numbers. I wouldn't have known it without consulting Wikipedia. It's an interesting fact but does it actually help solve the problem?
Apparently, it would be necessary to compute 5n²±4, compute the square roots, and then verify that one of them is an integer. With access to a floating point implementation with sufficient precision, this would not be too complicated. But how much precision is that? If n can be an arbitrary 32-bit signed number, then n² is obviously not going to fit into 32 bits. In fact, 5n²+4 could be as big as 65 bits, not including a sign bit. That's far beyond the precision of a double (normally 52 bits) and even of a long double, if available. So computing the precise square root will be problematic.
Of course, we don't actually need a precise computation. We can start with an approximation, square it, and see if it is either four more or four less than 5n². And it's easy to see how to compute a good guess: it will very close to n×√5. By using a good precomputed approximation of √5, we can easily do this computation without the need for floating point, without division, and without a sqrt function. (If the approximation isn't accurate, we might need to adjust the result up or down, but that's easy to do using the identity (n+1)² = n²+2n+1; once we have n², we can compute (n+1)² with only addition.
We still need to solve the problem of precision, so we'll need some way of dealing with 66-bit integers. But we only need to implement addition and multiplication of positive integers, is considerably simpler than a full-fledged bignum package. Indeed, if we can prove that our square root estimation is close enough, we could safely do the verification modulo 2³¹.
So the analytic solution can be made to work, but before diving into it, we should ask whether it's the best solution. One very common caregory of suboptimal programming is clinging desperately to the first idea you come up with even when as its complications become increasingly evident. That will be one of the things the interviewer wants to know about you: how flexible are you when presented with new information or new requirements.
So what other ways are there to know if n is a Fibonacci number. One interesting fact is that if n is Fib(k), then k is the floor of logφ(k×√5 + 0.5). Since logφ is easily computed from log2, which in turn can be approximated by a simple bitwise operation, we could try finding an approximation of k and verifying it using the classic O(log k) recursion for computing Fib(k). None of the above involved numbers bigger than the capacity of a 32-bit signed type.
Even more simply, we could just run through the Fibonacci series in a loop, checking to see if we hit the target number. Only 47 loops are necessary. Alternatively, these 47 numbers could be precalculated and searched with binary search, using far less than the 1k bytes you are allowed.
It is unlikely an interviewer for a programming position would be testing for knowledge of a specific property of the Fibonacci sequence. Thus, unless they present the property to be tested, they are examining the candidate’s approaches to problems of this nature and their general knowledge of algorithms. Notably, the notion to iterate through a table of squares is a poor response on several fronts:
At a minimum, binary search should be the first thought for table look-up. Some calculated look-up approaches could also be proposed for discussion, such as using find-first-set-bit instruction to index into a table.
Hashing might be another idea worth considering, especially since an efficient customized hash might be constructed.
Once we have decided to use a table, it is likely a direct table of Fibonacci numbers would be more useful than a table of squares.

Searching missing number - simple example

A little task on searching algorithm and complextiy in C. I just want to make sure im right.
I have n natural numbers from 1 to n+1 ordered from small to big, and i need to find the missing one.
For example: 1 2 3 5 6 7 8 9 10 11 - ans: 4
The fastest and the simple answer is do one loop and check every number with the number that comes after it. And the complexity of that is O(n) in the worst case.
I thought maybe i missing something and i can find it with using Binary Search. Can anybody think on more efficient algorithm in that simple example?
like O(log(n)) or something ?
There's obviously two answers:
If your problem is a purely theoretical problem, especially for large n, you'd do something like a binary search and check whether the middle between the two last boundaries is actually (upper-lower)/2.
However, if this is a practical question, for modern systems executing programs written in C and compiled by a modern, highly optimizing compiler for n << 10000, I'd assume that the linear search approach is much, much faster, simply because it can be vectorized so easily. In fact, modern CPUs have instructions to take e.g. each
4 integers at once, subtract four other integers,
compare the result to [4 4 4 4]
increment the counter by 4,
load the next 4 integers,
and so on, which very neatly lends itself to the fact that CPUs and memory controllers prefetch linear memory, and thus, jumping around in logarithmically descending step sizes can have an enormous performance impact.
So: For large n, where linear search would be impractical, go for the binary search approach; for n where that is questionable, go for the linear search. If you not only have SIMD capabilities but also multiple cores, you will want to split your problem. If your problem is not actually exactly 1 missing number, you might want to use a completely different approach ... The whole O(n) business is generally more of a benchmark usable purely for theoretical constructs, and unless the difference is immensely large, is rarely the sole reason to pick a specific algorithm in a real-world implementation.
For a comparison-based algorithm, you can't beat Lg(N) comparisons in the worst case. This is simply because the answer is a number between 1 and N and it takes Lg(N) bits of information to represent such a number. (And a comparison gives you a single bit.)
Unless the distribution of the answers is very skewed, you can't do much better than Lg(N) on average.
Now I don't see how a non-comparison-based method could exploit the fact that the sequence is ordered, and do better than O(N).

How to multiply terabyte-sized numbers?

When multiplying very large numbers, you use FFT based multiplication (see Schönhage–Strassen algorithm). For performance reason I'm caching the twiddle factors. The problem is for huge numbers (Gigabyte-sized) I need FFT tables of size 2^30 and more, which occupy too much RAM (16 GB and above). So it seems I should use another algorithm.
There is a software called y-cruncher, which is used to calculate Pi and other constants, which can multiply terabyte-sized numbers. It uses an algorithm called Hybrid NTT and another algorithm called VST (see A Peak into y-cruncher v0.6.1 in section The VST Multiplication Algorithm).
Can anyone shed some light on these algorithms or any other algorithm which can be used to multiply terabyte-sized numbers?
FFT can be done on the same array with constant number of additional memory (might need to swap the number smartly). Therefore it can be done on the harddisk as well. At worst case it's a log(N)*N times of disk access. It seems a lot slower then doing it on RAM but the overall complexity remains the same.

Efficiency of arcsin computation from sine lookup table

I have implemented a lookup table to compute sine/cosine values in my system. I now need inverse trigonometric functions (arcsin/arccos).
My application is running on an embedded device on which I can't add a second lookup table for arcsin as I am limited in program memory. So the solution I had in mind was to browse over the sine lookup table to retrieve the corresponding index.
I am wondering if this solution will be more efficient than using the standard implementation coming from the math standard library.
Has someone already experimented on this?
The current implementation of the LUT is an array of the sine values from 0 to PI/2. The value stored in the table are multiplied by 4096 to stay with integer values with enough precision for my application. The lookup table as a resolution of 1/4096 which give us an array of 6434 values.
Then I have two funcitons sine & cosine that takes an angle in radian multiplied by 4096 as argument. Those functions convert the given angle to the corresponding angle in the first quadrant and read the corresponding value in the table.
My application runs on dsPIC33F at 40 MIPS an I use the C30 compiling suite.
It's pretty hard to say anything with certainty since you have not told us about the hardware, the compiler or your code. However, a priori, I'd expect the standard library from your compiler to be more efficient than your code.
It is perhaps unfortunate that you have to use the C30 compiler which does not support C++, otherwise I'd point you to Optimizing Math-Intensive Applications with Fixed-Point Arithmetic and its associated library.
However the general principles of the CORDIC algorithm apply, and the memory footprint will be far smaller than your current implementation. The article explains the generation of arctan() and the arccos() and arcsin() can be calculated from that as described here.
Of course that suggests also that you will need square-root and division also. These may be expensive though PIC24/dsPIC have hardware integer division. The article on math acceleration deals with square-root also. It is likely that your look-up table approach will be faster for the direct look-up, but perhaps not for the reverse search, but the approaches explained in this article are more general and more precise (the library uses 64bit integers as 36.28 bit fixed point, you might get away with less precision and range in your application), and certainly faster than a standard library implementation using software-floating-point.
You can use a "halfway" approach, combining a coarse-grained lookup table to save memory, and a numeric approximation for the intermediate values (e.g. Maclaurin Series, which will be more accurate than linear interpolation.)
Some examples here.
This question also has some related links.
A binary search of 6434 will take ~12 lookups to find the value, followed by an interpolation if more accuracy is needed. Due to the nature if the sin curve, you will get much more accuracy at one end than the other. If you can spare the memory, making your own inverse table evenly spaced on the inputs is likely a better bet for speed and accuracy.
In terms of comparison to the built-in version, you'll have to test that. When you do, pay attention to how much the size of your image increases. The stdin implementations can be pretty hefty in some systems.

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Resources