How to multiply terabyte-sized numbers? - c

When multiplying very large numbers, you use FFT based multiplication (see Schönhage–Strassen algorithm). For performance reason I'm caching the twiddle factors. The problem is for huge numbers (Gigabyte-sized) I need FFT tables of size 2^30 and more, which occupy too much RAM (16 GB and above). So it seems I should use another algorithm.
There is a software called y-cruncher, which is used to calculate Pi and other constants, which can multiply terabyte-sized numbers. It uses an algorithm called Hybrid NTT and another algorithm called VST (see A Peak into y-cruncher v0.6.1 in section The VST Multiplication Algorithm).
Can anyone shed some light on these algorithms or any other algorithm which can be used to multiply terabyte-sized numbers?

FFT can be done on the same array with constant number of additional memory (might need to swap the number smartly). Therefore it can be done on the harddisk as well. At worst case it's a log(N)*N times of disk access. It seems a lot slower then doing it on RAM but the overall complexity remains the same.

Related

Memory and time efficient PI algorithm (binary)

just a very simple question. I would like to efficiently and SEQUENTIALLY compute binary digits of pi on an Arduino microprocessor. There is no actual computing purpose in this project, it is for an artistic installation for a friend, with a light pulsing the digits of pi as they are generated.
Therefore I'd need an algorithm sequentially generating binary digits of PI with the following requirements:
- Low memory requirement
- Speed is not really important since the pulsing light frequency will be of the order of a second
- Good asymptotic time complexity, on exampl the BBP algorithm grows linearly in time with the digits computed and it soon gets to be slow on an Arduino board, and I can compute the previous digits since I want to show them.
Any ideas? Thank you very much indeed!
Matteo
Well you may easily find billions of binary digits of pi online, just copy them, put into a file, and ......
I really think it's the best way to solve your problem.

What is the fastest way to calculate e to 2 trillion digits?

I want to calculate e to 2 trillion (2,000,000,000,000) digits. This is about 1,8 TiB of pure e. I just implemented a taylor series expansion algorithm using GMP (code can be found here).
Unfortuanetly it crashes when summing more than 4000 terms on my computer, probably because it runs out of memory.
What is the current state of the art in computing e? Which algorithm is the fastest? Any open source implementations that are worth looking at? Please don't mention y-cruncher, it's closed source.
Since I'm the author of the y-cruncher program that you mention, I'll add my 2 cents.
For such a large task, the two biggest barriers that must be tackled are as follows:
Memory
Run-time Complexity
Memory
2 trillion digits is extreme - to say the least. That's double the current record set by Shigeru Kondo and myself back in 2010. (It took us more than 9 days to compute 1 trillion digits using y-cruncher.)
In plain text, that's about 1.8 TiB in decimal. In packed binary representation, that's 773 GiB.
If you're going to be doing arithmetic on numbers of this size, you're gonna need 773 GiB for each operand not counting scratch memory.
Feasibly speaking, y-cruncher actually needs 8.76 TiB of memory to do this computation all in ram. So you can expect other implementations to need the same give or take a factor of 2 at most.
That said, I doubt you're gonna have enough ram. And even if you did, it'd be heavily NUMA. So the alternative is to use disk. But this is not trivial, as to be efficient, you need to treat memory as a cache and micromanage all data that is transferred between memory and disk.
Run-time Complexity
Here we have the other problem. For 2 trillion digits, you're gonna need a very fast algorithm. Not just any fast algorithm, but a quasi-linear run-time algorithm.
Your current attempt runs in about O(N^2). So even if you had enough memory, it won't finish in your lifetime.
The standard approach to computing e to high precision runs in O(N log(N)^2) and combines the following algorithms:
Binary Splitting on the Taylor series expansion of e.
FFT-based large multiplication
Fortunately, GMP already uses FFT-based large multiplication. But it lacks two crucial features:
Out-of-core (swap) computation to use disk when there isn't enough memory.
It isn't parallelized.
The second point isn't as important since you can just wait longer. But for all practical purposes, you're probably gonna need to roll out your own. And that's what I did when I wrote y-cruncher.
That said, there are many other loose-ends that also need to be taken care of:
The final division will require a fast algorithm like Newton's Method.
If you're gonna compute in binary, you're gonna need to do a radix conversion.
If the computation is gonna take a lot of time and a lot of resources, you may need to implement fault-tolerance to handle hardware failures.
Since you have a goal how many digits you want (2 trillion) you can estimate how many terms you'll need to calculate e to that number of digits. From this, you can estimate how many extra digits of precision you'll need to keep track of to avoid rounding errors at the 2 trillionth place.
If my calculation from Stirling's approximation is correct, the reciprocal of 10 to the 2 trillion is about the reciprocal of 100 billion factorial. So that's about how many terms you'll need (100 billion). The story's a little better than that, though, because you'll start being able to throw away a lot of the numbers in the calculation of the terms well before that.
Since e is calculated as a sum of inverse factorials, all of your terms are rational, and hence they are expressible as repeating decimals. So the decimal expansion of your terms will be (a) an exponent, (b) a non-repeating part, and (c) a repeating part. There may be some efficiencies you can take advantage of if you look at the terms in this way.
Anyway, good luck!

Should cublas be outperformed by atlas?

According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Resources