Memory and time efficient PI algorithm (binary) - c

just a very simple question. I would like to efficiently and SEQUENTIALLY compute binary digits of pi on an Arduino microprocessor. There is no actual computing purpose in this project, it is for an artistic installation for a friend, with a light pulsing the digits of pi as they are generated.
Therefore I'd need an algorithm sequentially generating binary digits of PI with the following requirements:
- Low memory requirement
- Speed is not really important since the pulsing light frequency will be of the order of a second
- Good asymptotic time complexity, on exampl the BBP algorithm grows linearly in time with the digits computed and it soon gets to be slow on an Arduino board, and I can compute the previous digits since I want to show them.
Any ideas? Thank you very much indeed!
Matteo

Well you may easily find billions of binary digits of pi online, just copy them, put into a file, and ......
I really think it's the best way to solve your problem.

Related

How to multiply terabyte-sized numbers?

When multiplying very large numbers, you use FFT based multiplication (see Schönhage–Strassen algorithm). For performance reason I'm caching the twiddle factors. The problem is for huge numbers (Gigabyte-sized) I need FFT tables of size 2^30 and more, which occupy too much RAM (16 GB and above). So it seems I should use another algorithm.
There is a software called y-cruncher, which is used to calculate Pi and other constants, which can multiply terabyte-sized numbers. It uses an algorithm called Hybrid NTT and another algorithm called VST (see A Peak into y-cruncher v0.6.1 in section The VST Multiplication Algorithm).
Can anyone shed some light on these algorithms or any other algorithm which can be used to multiply terabyte-sized numbers?
FFT can be done on the same array with constant number of additional memory (might need to swap the number smartly). Therefore it can be done on the harddisk as well. At worst case it's a log(N)*N times of disk access. It seems a lot slower then doing it on RAM but the overall complexity remains the same.

What is the fastest way to calculate e to 2 trillion digits?

I want to calculate e to 2 trillion (2,000,000,000,000) digits. This is about 1,8 TiB of pure e. I just implemented a taylor series expansion algorithm using GMP (code can be found here).
Unfortuanetly it crashes when summing more than 4000 terms on my computer, probably because it runs out of memory.
What is the current state of the art in computing e? Which algorithm is the fastest? Any open source implementations that are worth looking at? Please don't mention y-cruncher, it's closed source.
Since I'm the author of the y-cruncher program that you mention, I'll add my 2 cents.
For such a large task, the two biggest barriers that must be tackled are as follows:
Memory
Run-time Complexity
Memory
2 trillion digits is extreme - to say the least. That's double the current record set by Shigeru Kondo and myself back in 2010. (It took us more than 9 days to compute 1 trillion digits using y-cruncher.)
In plain text, that's about 1.8 TiB in decimal. In packed binary representation, that's 773 GiB.
If you're going to be doing arithmetic on numbers of this size, you're gonna need 773 GiB for each operand not counting scratch memory.
Feasibly speaking, y-cruncher actually needs 8.76 TiB of memory to do this computation all in ram. So you can expect other implementations to need the same give or take a factor of 2 at most.
That said, I doubt you're gonna have enough ram. And even if you did, it'd be heavily NUMA. So the alternative is to use disk. But this is not trivial, as to be efficient, you need to treat memory as a cache and micromanage all data that is transferred between memory and disk.
Run-time Complexity
Here we have the other problem. For 2 trillion digits, you're gonna need a very fast algorithm. Not just any fast algorithm, but a quasi-linear run-time algorithm.
Your current attempt runs in about O(N^2). So even if you had enough memory, it won't finish in your lifetime.
The standard approach to computing e to high precision runs in O(N log(N)^2) and combines the following algorithms:
Binary Splitting on the Taylor series expansion of e.
FFT-based large multiplication
Fortunately, GMP already uses FFT-based large multiplication. But it lacks two crucial features:
Out-of-core (swap) computation to use disk when there isn't enough memory.
It isn't parallelized.
The second point isn't as important since you can just wait longer. But for all practical purposes, you're probably gonna need to roll out your own. And that's what I did when I wrote y-cruncher.
That said, there are many other loose-ends that also need to be taken care of:
The final division will require a fast algorithm like Newton's Method.
If you're gonna compute in binary, you're gonna need to do a radix conversion.
If the computation is gonna take a lot of time and a lot of resources, you may need to implement fault-tolerance to handle hardware failures.
Since you have a goal how many digits you want (2 trillion) you can estimate how many terms you'll need to calculate e to that number of digits. From this, you can estimate how many extra digits of precision you'll need to keep track of to avoid rounding errors at the 2 trillionth place.
If my calculation from Stirling's approximation is correct, the reciprocal of 10 to the 2 trillion is about the reciprocal of 100 billion factorial. So that's about how many terms you'll need (100 billion). The story's a little better than that, though, because you'll start being able to throw away a lot of the numbers in the calculation of the terms well before that.
Since e is calculated as a sum of inverse factorials, all of your terms are rational, and hence they are expressible as repeating decimals. So the decimal expansion of your terms will be (a) an exponent, (b) a non-repeating part, and (c) a repeating part. There may be some efficiencies you can take advantage of if you look at the terms in this way.
Anyway, good luck!

Efficiency of arcsin computation from sine lookup table

I have implemented a lookup table to compute sine/cosine values in my system. I now need inverse trigonometric functions (arcsin/arccos).
My application is running on an embedded device on which I can't add a second lookup table for arcsin as I am limited in program memory. So the solution I had in mind was to browse over the sine lookup table to retrieve the corresponding index.
I am wondering if this solution will be more efficient than using the standard implementation coming from the math standard library.
Has someone already experimented on this?
The current implementation of the LUT is an array of the sine values from 0 to PI/2. The value stored in the table are multiplied by 4096 to stay with integer values with enough precision for my application. The lookup table as a resolution of 1/4096 which give us an array of 6434 values.
Then I have two funcitons sine & cosine that takes an angle in radian multiplied by 4096 as argument. Those functions convert the given angle to the corresponding angle in the first quadrant and read the corresponding value in the table.
My application runs on dsPIC33F at 40 MIPS an I use the C30 compiling suite.
It's pretty hard to say anything with certainty since you have not told us about the hardware, the compiler or your code. However, a priori, I'd expect the standard library from your compiler to be more efficient than your code.
It is perhaps unfortunate that you have to use the C30 compiler which does not support C++, otherwise I'd point you to Optimizing Math-Intensive Applications with Fixed-Point Arithmetic and its associated library.
However the general principles of the CORDIC algorithm apply, and the memory footprint will be far smaller than your current implementation. The article explains the generation of arctan() and the arccos() and arcsin() can be calculated from that as described here.
Of course that suggests also that you will need square-root and division also. These may be expensive though PIC24/dsPIC have hardware integer division. The article on math acceleration deals with square-root also. It is likely that your look-up table approach will be faster for the direct look-up, but perhaps not for the reverse search, but the approaches explained in this article are more general and more precise (the library uses 64bit integers as 36.28 bit fixed point, you might get away with less precision and range in your application), and certainly faster than a standard library implementation using software-floating-point.
You can use a "halfway" approach, combining a coarse-grained lookup table to save memory, and a numeric approximation for the intermediate values (e.g. Maclaurin Series, which will be more accurate than linear interpolation.)
Some examples here.
This question also has some related links.
A binary search of 6434 will take ~12 lookups to find the value, followed by an interpolation if more accuracy is needed. Due to the nature if the sin curve, you will get much more accuracy at one end than the other. If you can spare the memory, making your own inverse table evenly spaced on the inputs is likely a better bet for speed and accuracy.
In terms of comparison to the built-in version, you'll have to test that. When you do, pay attention to how much the size of your image increases. The stdin implementations can be pretty hefty in some systems.

How to use cepstral?

Recently I asked this question: How to get the fundamental frequency from FFT? (you don't actually need to read it)
My doubt right now it: how to use the cepstral algorithm?
I just don't know how to use it because the only language that I know is ActionScript 3, and for this reason I have few references about the native functions found in C, Java and so on, and how I should implement them on AS. Most articles are about these languages =/
(althought, answers in other languages than AS are welcome, just explain how the script works please)
The articles I found about cepstral to find the fundamental frequency of a FFT result told me that I should do this:
signal → FT → abs() → square → log → FT → abs() → square → power cepstrum
mathematically:
|F{log(|F{f(t)}|²)}|²
Important info:
I am developing a GUITAR TUNER in flash
This is the first time I am dealing with advanced sound
I am using an FFT to extract frequency bins from the signal that reaches user's microphone, but I got stuck in getting the fundamental frequency from it
I don't know:
How to apply a square in an ARRAY (I mean, the data that my FFT gives me is an array. Should I multiply it by itself? ActionScript's debug throws errors when I try to fftResults * fftResults)
How to apply the "log". I would not know how to apply it even if I had a single number.
What is the difference between complex cepstral and power cepstral. Also, what of them should I use? I am trying to develop a guitar tuner.
Thanks!
Note that the output of an FFT is an array of complex values, i.e. each bin = re + j*im. I think you can just combine the abs and square operations and calculate re*re + im*im for each bin. This gives you a single positive value for each bin, and obviously you can calculate the log value for each bin quite easily. You then need to do a second FFT on this log squared data and again using the output of this second FFT you will calculate re*re + im*im for each bin. You will then have an array of postive values which will have one or more peaks representing the fundamental frequency or frequencies of your input.
The autocorrelation is the easiest and most logical approach, and the best place to start.
To get this working, start with a simple autocorrelation, and then, if necessary, improve it following the outline provided by YIN. (YIN is based on the autocorrelation with refinements. But whether or not you'll need these refinements depends on details of your situation.) This way also, you can learn as you go rather than trying to understand the whole thing in one shot.
Although FFT approaches can also work, they are a bit more confusing. The issue is that what you are really after is the period, and this isn't well represented by the FFT. The missing fundamental is a good example of this, where if you have 2Hz and 3Hz, the fundamental is 1Hz, but is nowhere in the FFT, while 1Hz is obvious in a time based representation (e.g. the autocorrelation). Add to this that overtones aren't necessarily harmonic, and noise, etc... and all of these issues make it usually best to start with a direct approach to the problem.
There are many ways of finding fundamental frequency (F0).
For languages like Java etc there are many libraries with those type of algorithms already implemented (you can study their sources).
MFCC (based on cepstral) implemented in Comirva (Open source).
Audacity (beta version!) (Open source) presents cepstrum, autocorellation, enhanced autocorellation,
Yin based on autocorrelation (example )
Finding max signal values after FFT
All these algorithms may be be very helpful for you. However easiest way to get F0 (one value in Hz) would be to use Yin.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Resources