What good source of entropy are available on a smart phone? - mobile

I'm thinking of this in the context of how much (kB/s) cryptographically secure entropy can be generated by a smart phone. For an example application: a VoIP app that continually generates new encryption keys.
Things I can think of off hand:
- camera(s)
- microphones
- accelerometers
- magnetometer
- touch screen
- signal strength and timing for Call, WiFi, Bluetooth, etc.
Clearly, each of these will generate differing amounts of signal (predictable data) and noise (the entropy that is wanted) but suitably combined it should be good.
Also if any one has any estimates on the amount of entropy various sources would produce under normal conditions that would also be on interest.

The usual answer is that you do not need much entropy. 128 bits are enough; once you have 128 truly random bits, you can use them in a cryptographically secure pseudo-random number generator (PRNG) which will produce as many random bits as you need, at a high rate, limited only by the local computing power (on a smartphone, PRNG bandwidth will be in megabytes per second, not kilobytes per second).
Continuous entropy gathering is more a fetish than a scientific, rational need. Some say that getting "true" random protects you from any future cryptanalytic breach on the PRNG; but that argument holds only if you can get fresh uniformly random bits (which does not happen in practice: you need to apply a hash function to smooth out the gathered "noise") and if you use the random bits directly, not as keys in an encryption algorithm. A stronger case for continuous entropy gathering can be made about seed storage: the fear that an attacker, getting hold of the PRNG, could look at its entrails, recover internal state, and retroactively guess random bits which were previously emitted. Good PRNG protect against that. At the very least, you can reseed with 128 fresh bits every second, which is a low rate.
That being said, if you need entropy, the phone camera is probably the best source to use, because the CCD detector is very sensitive to heat-generated noise, and it outputs data with a very high bandwidth. A basic phone camera single picture will easily contain a megabyte worth of data, and, even if the phone is inside a pitch-back fridge, you will still have thousands of bits worth of noise (only one thousand bits of noise means that over the million pixels, 99.9% are "perfect", a somewhat ludicrous notion in a 400$ phone -- NASA engineers cannot do that in space probes which cost a million times more).
So just take a picture, hash it with any convenient hash function (e.g. SHA-256), and voila! you have 256 bits of entropy, which you use in a PRNG. If you really get nervous about the PRNG after some time, just take another snapshot.

Related

Fastest disk based solution to cache trillions of unique md5 hashes

Is there a very low latency disk based caching solution that I can use to store only unique values (NOT key+value)?
My script needs to keep track of which files it has processed so it doesn't redo any work. I need to check the cache to search for the md5 hash of the file, if it doesn't exist, I process the file and add the hash to the cache.
Is there a faster disk based caching solution than using a key-value based solution?
Try LevelDB.
It's a key-value store but is very compact due to the trie structure.
Less space usage => less I/O => better performance.
Not sure about "trillions" (a trillion MD5 hashes would be 16,000 TB), but Bitcoin core as well as Ethereum implementations all use LevelDB.
In your case, there is no need for an "Ordered Key-Value Store". That is you can rely on plain Key-Value stores (direct dbm successors):
Good candidates are:
tokyo cabinet it has a hash-based format, that might be faster in your case.
gdbm
In the case where the datatset fit into memory, you might want to try LMDB.
I do not recomment LevelDB because it is slow.
Do the math. 1 trillion MD5s, without any tricks, would take 16TB of disk space. This is, I assume, far more than your RAM size.
Since each MD5 lookup is essentially a 'random' probe into the disk, there will necessarily be about 1 disk hit per check.
If, say, an SSD read is 1ms, that is 1e9 seconds to insert (or check) a trillion hashes. That's 30 years.
There are a lot flaws in my math, but I think this says that it is not practical today to store and check a trillion of anything random.
If you want to crank it down to a billion MD5s, now we are getting in the range of RAM sizes. But you probably want to have the data persisted? So you really need some database-like tool that will do the persisting for you, while making the checks purely in RAM (CPU-speed).
In any case, I would consider writing code that breaks the MD5 into 2 or 3 chunks, then use the chunks like a directory structure. At the bottom level, you have a variable-length bunch of values for the last chunk. Each is perhaps 8 bytes long. That would need a linear or binary search into a bunch of numbers that are half the size of an MD5. The savings here helps compensate for the various overheads in the rest of the structure, plus the need for writing blocks to disk. Hence, I would still expect needing about 16GB of RAM to house a billion MD5s.
Given that approach, virtually any database engine is already geared up to do most of the work reasonably efficiently. The lowest level would be some type of BLOB containing multiple 8-byte chunks.
Another trick to use... Let's look at just the first 5 bytes of an MD5. There are a trillion different values in 5 bytes. If you have only a billion entries in your dataset, then checking the 5 bytes has a 99.9% chance of correctly saying "the md5 is not in the dataset" versus less than 0.01% chance of saying "the md5 might be in the dataset". In the former case, you get a quick answer with only 5GB for a billion items. In the latter case, you may have to go to disk and be slower. Still the average time is better. This helps with the speed of checking. (But does not address the speed of loading.)

Generating totally random numbers without random function? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
True random number generator
I was talking to a friend the other day and we were trying to figure out if it is possible to generate completely random numbers without the help of a random function? In C for example "rand" generates pseudo-random numbers. Or we can use something like "srand( time( NULL ) );" This will allow the computer to read numbers from its clock as seed values. So if I understand everything I have read so far right, then I am pretty sure that no random function actually produces truely random numbers. How would one write a program that generates numbers that are completely random and what would code look like?
Check out this question:
True random number generator
Also, from wikipedia's entry on pseudorandom numbers
As John von Neumann joked, "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."
The excellent random.org website provides hardware-based random numbers as well as a number of software interfaces to retrieve these.
This can be used e.g. for genuinely unpredictable seeds or for 'true' random numbers. Being a web service, there are limits on the number of draws you can make, so don't try to use this for your graduate school monte carlo simulation.
FWIW, I wrapped one of those interface in the R package random.
It would look like:
int random = CallHardwareRandomGenerator();
Even with hardware, randomness is tricky. There are things which are physically random (atomic decay is random, but with predictable average amounts, so that can be used as a source of random information) there are things that are physically random enough to make prediction impractical (this is how casinos make money).
There are things that are largely indeterminate (mix up information from key-stroke rate, mouse-movements, and a few things like that), which are a good-enough source of "randomness" for many uses.
Mathematically, we cannot produce randomness, but we can improve distribution and make something harder to predict. Cryptographic PRNGs do a stronger job at this than most, but are more expensive in terms of resources.
This is more of a physics question I think. If you think about it nothing is random, it just occurs due to events the complexity of which make them unpredictable to us. A computer is a subsystem just like any other in the universe and by giving it unpredictable external inputs (RTC, I/O garbage) we can get the same kind of randomness that that a roulette wheel gets from varying friction, air resistance, initial impulse and millions of factors that I can't wrap my head around.
There's room for a fair amount of philosophical debate about what "truly random" really even means. From a practical viewpoint, even sources we know aren't truly random can be used in ways that produce what are probably close enough for almost any practical purpose though (in particular, that at least with current technology, full knowledge of the previously produced bitstream appears to be insufficient to predict the next bit accurately). Most of those do involve a bit of extra hardware though -- for example, it's pretty easy to put a source together from a little bit of Americium out of a smoke detector.
There are quite a few more sources as well, though they're mostly pretty low bandwidth (e.g., collect one bit for each keystroke, based on whether the interval between keystrokes was an even or odd number of CPU clocks -- assuming the CPU clock and keyboard clock are derived from separate crystals). OTOH, you have to be really careful with this -- a fair number of security holes (e.g., in Netscape around v. 4.0 or so) have stemmed from people believing that such sources were a lot more random than they really were.
While there are a number of web sites that produce random numbers from hardware sources, most of them are useless from a viewpoint of encryption. Even at best, you're just trusting your SSL (or TLS) connection to be secure so nobody captured the data you got from the site.

Arbitrary-precision random numbers in C: generation for Monte Carlo simulation without atmospheric noise

I know that there are other questions similar to this one, however the following question pertains to arbitrary-precision random number generation in C for use in Monte Carlo simulation.
How can we generate good quality arbitrary-precision random numbers in C, when atmospheric noise isn't always available, without relying on disk i/o or network access that would create bottlenecks?
libgmp is capable of generating random numbers, but, like other implementations of pseudo-random number generators, it requires a seed. As the manual mentions, "the system time is quite easy to guess, so if unpredictability is required then it should definitely not be the only source for the seed value."
Is there a portable/ported library for generating random numbers, or seeds for random numbers? The libgmp also mentions that "On some systems there's a special device /dev/random which provides random data better suited for use as a seed." However, /dev/random and /dev/urandom can only be used on *nix systems.
Don't overestimate importance of seed.
Firstly, it doesn't need to be truly chaotic - only to have good distribution and not be correlated with any processes in your simulation or pseudo-random generator.
Secondly, for Monte-Carlo statistical characteristics matter, not randomness (in any sense) of a particular number.
Low bytes of high-precision time or some derivative of keyboard-mouse actions make a good seed for anything that is going to be run on a regular PC.
By definition, true random numbers require chaotic information from the real world. /dev/random often (but not always) provides this. Another option for *ix is Entropy Gathering Demon. /dev/urandom by design will happily provide non-random data, since it doesn't block when the entropy pool is exhausted.
Internet APIs that provide this include HotBits (radioactive decay), LavaRnd (CCD), and Random.org (atmospheric noise, which I realize you don't want). See also Hardware random number generator
This device (no affiliation) has drivers for Windows and *ix.
Why your arbitrary precision requirement? There is no "random number between 0 and infinity". You always need a range.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Alternative Entropy Sources

Okay, I guess this is entirely subjective and whatnot, but I was thinking about entropy sources for random number generators. It goes that most generators are seeded with the current time, correct? Well, I was curious as to what other sources could be used to generate perfectly valid, random (The loose definition) numbers.
Would using multiple sources (Such as time + current HDD seek time [We're being fantastical here]) together create a "more random" number than a single source? What are the logical limits of the amount of sources? How much is really enough? Is the time chosen simply because it is convenient?
Excuse me if this sort of thing is not allowed, but I'm curious as to the theory behind the sources.
The Wikipedia article on Hardware random number generator's lists a couple of interesting sources for random numbers using physical properties.
My favorites:
A nuclear decay radiation source detected by a Geiger counter attached to a PC.
Photons travelling through a semi-transparent mirror. The mutually exclusive events (reflection — transmission) are detected and associated to "0" or "1" bit values respectively.
Thermal noise from a resistor, amplified to provide a random voltage source.
Avalanche noise generated from an avalanche diode. (How cool is that?)
Atmospheric noise, detected by a radio receiver attached to a PC
The problems section of the Wikipedia article also describes the fragility of a lot of these sources/sensors. Sensors almost always produce decreasingly random numbers as they age/degrade. These physical sources should be constantly checked by statistical tests which can analyze the generated data, ensuring the instruments haven't broken silently.
SGI once used photos of a lava lamp at various "glob phases" as the source for entropy, which eventually evolved into an open source random number generator called LavaRnd.
I use Random.ORG, they provide free random data from Atmospheric noise, that I use to periodically re-seed a Mersene-Twister RNG. Its about as random as you can get with no hardware dependencies.
Don't worry about a "good" seed for a random number generator. The statistical properties of the sequence do not depend on how the generator is seeded. There are other things, however. to worry about. See Pitfalls in Random Number Generation.
As for hardware random number generators, these physical sources have to be measured, and the measurement process has systematic errors. You might find "pseudo" random numbers to have higher quality than "real" random numbers.
Linux kernel uses device interrupt timing (mouse, keyboard, hard drives) to generate entropy. There's a nice article on Wikipedia on entropy.
Modern RNGs are both checked against correlations in nearby seeds and run several hundred iterations after the seeding. So, the unfortunately boring but true answer is that it really doesn't matter very much.
Generally speaking, using random physical processes have to be checked that they conform to a uniform distribution and are otherwise detrended.
In my opinion, it's often better to use a very well understood pseudo-random number generator.
I've used an encryption program that used the users mouse movement to generate random numbers. The only problem was that the program had to pause and ask the user to move the mouse around randomly for a few seconds to work properly which might not always be practical.
I found HotBits several years ago - the numbers are generated from radioactive decay, genuinely random numbers.
There are limits on how many numbers you can download a day, but it has always amused me to use these as really, really random seeds for RNG.
Some TPM (Trusted Platform Module) "chips" have a hardware RNG. Unfortunately, the (Broadcom) TPM in my Dell laptop lacks this feature, but many computers sold today come with a hardware RNG that uses truly unpredictable quantum mechanical processes. Intel has implemented the thermal noise variety.
Also, don't use the current time alone to seed an RNG for cryptographic purposes, or any application where unpredictability is important. Using a few low order bits from the time in conjunction with several other sources is probably okay.
A similar question may be useful to you.
Sorry I'm late to this discussion (what is it 3 1/2 years old now?), but I've a rekindled interest in PRN generation and alternate sources of entropy. Linux kernel developer Rusty Russell recently had a discussion on his blog on alternate sources of entropy (other than /dev/urandom).
But, I'm not all that impressed with his choices; a NIC's MAC address never changes (although it is unique from all others), and PID seems like too small a possible sample size.
I've dabbled with a Mersenne Twister (on my Linux box) which is seeded with the following algorithm. I'm asking for any comments/feedback if anyone's willing and interested:
Create an array buffer of 64 bits + 256 bits * number of /proc files below.
Place the time stamp counter (TSC) value in the first 64 bits of this buffer.
For each of the following /proc files, calculate the SHA256 sum:
/proc/meminfo
/proc/self/maps
/proc/self/smaps
/proc/interrupts
/proc/diskstats
/proc/self/stat
Place each 256-bit hash value into its own area of the array created in (1).
Create a SHA256 hash of this entire buffer. NOTE: I could (and probably should) use a different hash function completely independent of the SHA functions - this technique has been proposed as a "safeguard" against weak hash functions.
Now I have 256 bits of HOPEFULLY random (enough) entropy data to seed my Mersenne Twister. I use the above to populate the beginning of the MT Array (624 32-bit integers), and then initialize the remainder of that array with the MT author's code. Also, I could use a different hash function (e.g. SHA384, SHA512), but I'd need a different size array buffer (obviously).
The original Mersenne Twister code called for one single 32-bit seed, but I feel that's horribly inadequate. Running "merely" 2^32-1 different MTs in search of breaking the crypto is not beyond the realm of practical possibility in this day and age.
I'd love to read anyone's feedback on this. Criticism is more than welcome. I will defend my use of the /proc files as above because they're constantly changing (especially the /proc/self/* files, and the TSC always yields a different value (nanosecond [or better] resolution, IIRC). I've run Diehard tests on this (to the tune of several hundred billion bits), and it seems to be passing with flying colors. But that's probably more testament to the soundness of the Mersenne Twister as a PRNG than to how I'm seeding it.
Of course, these aren't totally impervious to someone hacking them, but I just don't see all of these (and SHA*) being hacked and broken to in my lifetime.
Some use keyboard input (timeouts between keystrokes), I heard of I think in a novel that radio static reception can be used - but of course that requires other hardware and software...
Noise on top of the Cosmic Microwave Background spectrum. Of course you must first remove some anisotropy, foreground objects, correlated detector noise, galaxy and local group velocities, polarizations etc. Many pitfalls remain.
Source of seed isn't that much important. More important is the pseudo numbers generator algorithm. However I've heard some time ago about generating seed for some bank operations. They took many factors together:
time
processor temperature
fan speed
cpu voltage
I don't remember more :)
Even if some of these parameters doesn't change much in time, you can put them into some good hashing function.
How to generate good random number?
Maybe we can take into account inifinite number of universes? If this is true, that all the time new parallel universes are being created, we can do something like this:
int Random() {
return Universe.object_id % MAX_INT;
}
In every moment we should be on another branch of parallel universes, so we should have different id. The only problem is how to get Universe object :)
How about spinning off a thread that will manipulate some variable in a tight loop for a fixed amount of time before it is killed. What you end up with will depend on the processor speed, system load, etc... Very hokey, but better than just srand(time(NULL))...
Don't worry about a "good" seed for a random number generator. The statistical properties of the sequence do not depend on how the generator is seeded.
I disagree with John D. Cook's advice. If you seed the Mersenne Twister with all bits set to zero except one, it will initially generate numbers which are anything but random. It takes a long time for the generator to churn this state into anything that would pass statistical tests. Simply setting the first 32 bits of the generator to a seed will have a similar effect. Also, if the entire state is set to zero the generator will produce endless zeroes.
Properly written RNG code will have a properly written seeding algorithm that accepts say a 64 bit value and seeds the generator so it will produce decent random numbers for each possible input. So if you are using a reliable library then any seed will do. But if you hack together your own implementation then you need to be careful.

Resources