Generating totally random numbers without random function? [duplicate] - c

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
True random number generator
I was talking to a friend the other day and we were trying to figure out if it is possible to generate completely random numbers without the help of a random function? In C for example "rand" generates pseudo-random numbers. Or we can use something like "srand( time( NULL ) );" This will allow the computer to read numbers from its clock as seed values. So if I understand everything I have read so far right, then I am pretty sure that no random function actually produces truely random numbers. How would one write a program that generates numbers that are completely random and what would code look like?

Check out this question:
True random number generator
Also, from wikipedia's entry on pseudorandom numbers
As John von Neumann joked, "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."

The excellent random.org website provides hardware-based random numbers as well as a number of software interfaces to retrieve these.
This can be used e.g. for genuinely unpredictable seeds or for 'true' random numbers. Being a web service, there are limits on the number of draws you can make, so don't try to use this for your graduate school monte carlo simulation.
FWIW, I wrapped one of those interface in the R package random.

It would look like:
int random = CallHardwareRandomGenerator();
Even with hardware, randomness is tricky. There are things which are physically random (atomic decay is random, but with predictable average amounts, so that can be used as a source of random information) there are things that are physically random enough to make prediction impractical (this is how casinos make money).
There are things that are largely indeterminate (mix up information from key-stroke rate, mouse-movements, and a few things like that), which are a good-enough source of "randomness" for many uses.
Mathematically, we cannot produce randomness, but we can improve distribution and make something harder to predict. Cryptographic PRNGs do a stronger job at this than most, but are more expensive in terms of resources.

This is more of a physics question I think. If you think about it nothing is random, it just occurs due to events the complexity of which make them unpredictable to us. A computer is a subsystem just like any other in the universe and by giving it unpredictable external inputs (RTC, I/O garbage) we can get the same kind of randomness that that a roulette wheel gets from varying friction, air resistance, initial impulse and millions of factors that I can't wrap my head around.

There's room for a fair amount of philosophical debate about what "truly random" really even means. From a practical viewpoint, even sources we know aren't truly random can be used in ways that produce what are probably close enough for almost any practical purpose though (in particular, that at least with current technology, full knowledge of the previously produced bitstream appears to be insufficient to predict the next bit accurately). Most of those do involve a bit of extra hardware though -- for example, it's pretty easy to put a source together from a little bit of Americium out of a smoke detector.
There are quite a few more sources as well, though they're mostly pretty low bandwidth (e.g., collect one bit for each keystroke, based on whether the interval between keystrokes was an even or odd number of CPU clocks -- assuming the CPU clock and keyboard clock are derived from separate crystals). OTOH, you have to be really careful with this -- a fair number of security holes (e.g., in Netscape around v. 4.0 or so) have stemmed from people believing that such sources were a lot more random than they really were.
While there are a number of web sites that produce random numbers from hardware sources, most of them are useless from a viewpoint of encryption. Even at best, you're just trusting your SSL (or TLS) connection to be secure so nobody captured the data you got from the site.

Related

Energy consumption of an algorithm in c code

I need to calculate the energy consumption of an algorithm in c code. Any ideas how this can be done and if there are pre-defined functions for that?
Thanks in advance,
It is nigh impossible to do it from C code alone. The multiple ways a C-compiler is allowed to translate a chunk of C-code do not allow for a precise calculation of the energy consumption. You need to know how the compiler translates which code for which architecture.
It is much simpler (for certain degrees of "simple") to count the assembler commands and multiply them with the respective latencies (Listed in the technical specs, e.g. for a randomly chosen MCU from the Wiki list: http://www.atmel.com/Images/doc32002.pdf). This is still not exact, division for example might take a different amount of CPU cycles depending on the input, CPU architecture and implementation in the hardware, but it comes quite close and is rather simple, although a bit tedious.
And than there are the loops with a completely unknown number of iterations, input taking different paths with different runtimes and much more. The people writing encryption software know more about it, especially how to avoid it. You might not like their solutions.
Otherwise: check what you expect for input (you do know which path which input takes, do you?), write a test program, go to e.g.: https://www.rohde-schwarz.com/ to get a good meter and measure the power consumption. You also need an engineer who knows how to do that, it's not easy!

Randomness in Artificial Intelligence & Machine Learning

This question came to my mind while working on 2 projects in AI and ML. What If I'm building a model (e.g. Classification Neural Network,K-NN, .. etc) and this model uses some function that includes randomness. If I don't fix the seed, then I'm going to get different accuracy results every time I run the algorithm on the same training data. However, If I fix it then some other setting might give better results.
Is averaging a set of accuracies enough to say that the accuracy of this model is xx % ?
I'm not sure If this is the right place to ask such a question/open such a discussion.
Simple answer, yes, you randomize it and use statistics to show the accuracy. However, it's not sufficient to just average a handful of runs. You need, at a minimum, some notion of the variability as well. It's important to know whether "70%" accurate means "70% accurate for each of 100 runs" or "100% accurate once and 40% accurate once".
If you're just trying to play around a bit and convince yourself that some algorithm works, then you can just run it 30 or so times and look at the mean and standard deviation and call it a day. If you're going to convince anyone else that it works, you need to look into how to do more formal hypothesis testing.
There are models which are naturally dependent on randomness (e.g., random forests) and models which only use randomness as part of exploring the space (e.g., initialisation of values for neural networks), but actually have a well-defined, deterministic, objective function.
For the first case, you will want to use multiple seeds and report average accuracy, std. deviation, and the minimum you obtained. It is often good if you have a way to reproduce this, so just use multiple fixed seeds.
For the second case, you can always tell, just on the training data, which run is best (although it might actually not be the one which gives you the best test accuracy!). Thus, if you have the time, it is good to do say, 10 runs, and then evaluate on the one with the best training error (or validation error, just never evaluate on testing for this decision). You can go a level up and do multiple multiple runs and get a standard deviation too. However, if you find that this is significant, it probably means you weren't trying enough initialisations or that you are not using the right model for your data.
Stochastic techniques are typically used to search very large solution spaces where exhaustive search is not feasible. So it's almost inevitable that you will be trying to iterate over a large number of sample points with as even a distribution as possible. As mentioned elsewhere, basic statistical techniques will help you determine when your sample is big enough to be representative of the space as a whole.
To test accuracy, it is a good idea to set aside a portion of your input patterns and avoid training against those patterns (assuming you are learning from a data set). Then you can use the set to test whether your algorithm is learning the underlying pattern correctly, or whether it's simply memorizing the examples.
Another thing to think about is the randomness of your random number generator. Standard random number generators (such as rand from <stdlib.h>) may not make the grade in many cases so look around for a more robust algorithm.
I generalize the answer from what i get of your question,
I suppose Accuracy is always average accuracy of multiple runs and the standard deviation. So if you are considering accuracy you get using different seeds to the random generator, are you not actually considering a greater range of input (which should be a good thing). But you have to consider the Standard deviation to consider the accuracy. Or did i get your question it totally wrong ?
I believe cross-validation may give you what you ask about: an averaged, and therefore more reliable, estimate of classification performance. It contains no randomness, except in permuting the data set initially. The variation comes from choosing different train/test splits.

What is a pseudo-random integer?

I am reading a C book. In the description of the rand() function, they say:
rand returns a pseudo-random integer in the range 0 to RAND_MAX.  RAND_MAX is implementation dependent but at least 32767.
I don't understand; what is a "pseudo-random integer"?
Thanks.
Informally, a pseudorandom number is a number that isn't truly random, but is "random enough" for most purposes.
Computers are inherently deterministic devices. The processor executes specific commands in a specific order, and programs control how the processor does so. Consequently, it's hard for programs to generate random numbers because no deterministic process can create a random number. Thus what many programs do is use a pseudorandom number generator, which is a function that produces numbers according to some deterministic formula that appear to be random but actually are not. Most programming languages provide some sort of pseudorandom number generator for general programming use, and when true randomness isn't needed they work just fine.
However, they have their limitations. In cryptographic settings, in many cases true randomness is required in order to prevent attackers from guessing the workings of a system and compromising it, for example. In this case, it is possible to get truly random numbers by using specialized hardware that can amplify background noise or use quantum effects. This sort of randomness is extremely hard to generate, though, and so it's not commonly used unless absolute unpredictability is required.
From wiki entry : Pseudorandom number generator
A pseudorandom number generator
(PRNG), also known as a deterministic
random bit generator (DRBG), is an
algorithm for generating a sequence of
numbers that approximates the
properties of random numbers.
In other words, its approximately random (to a known extent), but not truly random in the sense of random noise from a physical source.
It means that the number may seems to be random but the truth is computer can't generate a truly random number, it works out a number based on a logic which tends to be random.
Like observing time difference in keystrokes, and then using it as just an input in calculating a number. Such things combined with several other inputs tend to give random numbers but in reality its just an algorithm which tends to give random numbers. If the same conditions are matched, it will give the same number.
To add some pedantry to the other correct answers that you've already received, there really is no such thing as a "pseudorandom integer" (or for that matter, a random integer). This is the point that John von Neumann was making in his famous quote (which is usually misleading abbreviated to just the first sentence):
"Any one who considers arithmetical
methods of producing random digits is,
of course, in a state of sin. For, as
has been pointed out several times,
there is no such thing as a random
number — there are only methods to
produce random numbers, and a strict
arithmetic procedure of course is not
such a method"
Consider the number 7. Is it a random integer? If it is, is it a pseudorandom integer or a true random integer? What about the number -10? Clearly these questions don't make sense (obligatory link to xkcd 221).
So, as others have pointed out, what the book actually means is that the number is generated by a pseudorandom (i.e. deterministic) number generator as opposed to being obtained from a truly random sequence. There are several different pseudorandom number generator algorithms and the best ones generate output that is statistically indistinguishable from a true random sequence.
It means that it generates a number sequence that exhibits the properties of a random number in terms of distribution, but which is mathematically generated and deterministic in its output for any particular seed value.
You can see this by creating a program that uses such a function to generate a short sequence, and observing that each time you run it it generates the same sequence. This option surprises the unsuspecting, but is also quite a useful property in testing code. When a less predictable sequence is required, one normally initialised the PRNG with a seed value that is itself unpredictable, often derived from the current system time, since it is not normally predictable when a program will be started.
Pseudo random number generators are not normally suitable for security applications such as data encryption, or on-line gambling applications, where their ultimate predictability could be a serious weakness.
In most computers, a random number is considered psuedorandom because it is not completely random.
pseudo=fake
Basically, the current time is used to generate a random number, so it can be predicted what number will be generated initially, and for most applications, this is fine.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Alternative Entropy Sources

Okay, I guess this is entirely subjective and whatnot, but I was thinking about entropy sources for random number generators. It goes that most generators are seeded with the current time, correct? Well, I was curious as to what other sources could be used to generate perfectly valid, random (The loose definition) numbers.
Would using multiple sources (Such as time + current HDD seek time [We're being fantastical here]) together create a "more random" number than a single source? What are the logical limits of the amount of sources? How much is really enough? Is the time chosen simply because it is convenient?
Excuse me if this sort of thing is not allowed, but I'm curious as to the theory behind the sources.
The Wikipedia article on Hardware random number generator's lists a couple of interesting sources for random numbers using physical properties.
My favorites:
A nuclear decay radiation source detected by a Geiger counter attached to a PC.
Photons travelling through a semi-transparent mirror. The mutually exclusive events (reflection — transmission) are detected and associated to "0" or "1" bit values respectively.
Thermal noise from a resistor, amplified to provide a random voltage source.
Avalanche noise generated from an avalanche diode. (How cool is that?)
Atmospheric noise, detected by a radio receiver attached to a PC
The problems section of the Wikipedia article also describes the fragility of a lot of these sources/sensors. Sensors almost always produce decreasingly random numbers as they age/degrade. These physical sources should be constantly checked by statistical tests which can analyze the generated data, ensuring the instruments haven't broken silently.
SGI once used photos of a lava lamp at various "glob phases" as the source for entropy, which eventually evolved into an open source random number generator called LavaRnd.
I use Random.ORG, they provide free random data from Atmospheric noise, that I use to periodically re-seed a Mersene-Twister RNG. Its about as random as you can get with no hardware dependencies.
Don't worry about a "good" seed for a random number generator. The statistical properties of the sequence do not depend on how the generator is seeded. There are other things, however. to worry about. See Pitfalls in Random Number Generation.
As for hardware random number generators, these physical sources have to be measured, and the measurement process has systematic errors. You might find "pseudo" random numbers to have higher quality than "real" random numbers.
Linux kernel uses device interrupt timing (mouse, keyboard, hard drives) to generate entropy. There's a nice article on Wikipedia on entropy.
Modern RNGs are both checked against correlations in nearby seeds and run several hundred iterations after the seeding. So, the unfortunately boring but true answer is that it really doesn't matter very much.
Generally speaking, using random physical processes have to be checked that they conform to a uniform distribution and are otherwise detrended.
In my opinion, it's often better to use a very well understood pseudo-random number generator.
I've used an encryption program that used the users mouse movement to generate random numbers. The only problem was that the program had to pause and ask the user to move the mouse around randomly for a few seconds to work properly which might not always be practical.
I found HotBits several years ago - the numbers are generated from radioactive decay, genuinely random numbers.
There are limits on how many numbers you can download a day, but it has always amused me to use these as really, really random seeds for RNG.
Some TPM (Trusted Platform Module) "chips" have a hardware RNG. Unfortunately, the (Broadcom) TPM in my Dell laptop lacks this feature, but many computers sold today come with a hardware RNG that uses truly unpredictable quantum mechanical processes. Intel has implemented the thermal noise variety.
Also, don't use the current time alone to seed an RNG for cryptographic purposes, or any application where unpredictability is important. Using a few low order bits from the time in conjunction with several other sources is probably okay.
A similar question may be useful to you.
Sorry I'm late to this discussion (what is it 3 1/2 years old now?), but I've a rekindled interest in PRN generation and alternate sources of entropy. Linux kernel developer Rusty Russell recently had a discussion on his blog on alternate sources of entropy (other than /dev/urandom).
But, I'm not all that impressed with his choices; a NIC's MAC address never changes (although it is unique from all others), and PID seems like too small a possible sample size.
I've dabbled with a Mersenne Twister (on my Linux box) which is seeded with the following algorithm. I'm asking for any comments/feedback if anyone's willing and interested:
Create an array buffer of 64 bits + 256 bits * number of /proc files below.
Place the time stamp counter (TSC) value in the first 64 bits of this buffer.
For each of the following /proc files, calculate the SHA256 sum:
/proc/meminfo
/proc/self/maps
/proc/self/smaps
/proc/interrupts
/proc/diskstats
/proc/self/stat
Place each 256-bit hash value into its own area of the array created in (1).
Create a SHA256 hash of this entire buffer. NOTE: I could (and probably should) use a different hash function completely independent of the SHA functions - this technique has been proposed as a "safeguard" against weak hash functions.
Now I have 256 bits of HOPEFULLY random (enough) entropy data to seed my Mersenne Twister. I use the above to populate the beginning of the MT Array (624 32-bit integers), and then initialize the remainder of that array with the MT author's code. Also, I could use a different hash function (e.g. SHA384, SHA512), but I'd need a different size array buffer (obviously).
The original Mersenne Twister code called for one single 32-bit seed, but I feel that's horribly inadequate. Running "merely" 2^32-1 different MTs in search of breaking the crypto is not beyond the realm of practical possibility in this day and age.
I'd love to read anyone's feedback on this. Criticism is more than welcome. I will defend my use of the /proc files as above because they're constantly changing (especially the /proc/self/* files, and the TSC always yields a different value (nanosecond [or better] resolution, IIRC). I've run Diehard tests on this (to the tune of several hundred billion bits), and it seems to be passing with flying colors. But that's probably more testament to the soundness of the Mersenne Twister as a PRNG than to how I'm seeding it.
Of course, these aren't totally impervious to someone hacking them, but I just don't see all of these (and SHA*) being hacked and broken to in my lifetime.
Some use keyboard input (timeouts between keystrokes), I heard of I think in a novel that radio static reception can be used - but of course that requires other hardware and software...
Noise on top of the Cosmic Microwave Background spectrum. Of course you must first remove some anisotropy, foreground objects, correlated detector noise, galaxy and local group velocities, polarizations etc. Many pitfalls remain.
Source of seed isn't that much important. More important is the pseudo numbers generator algorithm. However I've heard some time ago about generating seed for some bank operations. They took many factors together:
time
processor temperature
fan speed
cpu voltage
I don't remember more :)
Even if some of these parameters doesn't change much in time, you can put them into some good hashing function.
How to generate good random number?
Maybe we can take into account inifinite number of universes? If this is true, that all the time new parallel universes are being created, we can do something like this:
int Random() {
return Universe.object_id % MAX_INT;
}
In every moment we should be on another branch of parallel universes, so we should have different id. The only problem is how to get Universe object :)
How about spinning off a thread that will manipulate some variable in a tight loop for a fixed amount of time before it is killed. What you end up with will depend on the processor speed, system load, etc... Very hokey, but better than just srand(time(NULL))...
Don't worry about a "good" seed for a random number generator. The statistical properties of the sequence do not depend on how the generator is seeded.
I disagree with John D. Cook's advice. If you seed the Mersenne Twister with all bits set to zero except one, it will initially generate numbers which are anything but random. It takes a long time for the generator to churn this state into anything that would pass statistical tests. Simply setting the first 32 bits of the generator to a seed will have a similar effect. Also, if the entire state is set to zero the generator will produce endless zeroes.
Properly written RNG code will have a properly written seeding algorithm that accepts say a 64 bit value and seeds the generator so it will produce decent random numbers for each possible input. So if you are using a reliable library then any seed will do. But if you hack together your own implementation then you need to be careful.

Resources