Generating non-uniform random numbers [closed]

Generating non-uniform random numbers [closed] - artificial-intelligence

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Can you tell me any ways to generate non-uniform random numbers?
I am using Java but the code examples can be in whatever you want.
One way is to create a skewed distribution by adding two uniform random numbers together (i.e. rolling 2 dice).

Try generating uniformly distributed random numbers, then applying your inverted non-uniform cumulative distribution function to each of them.

What distribution of deviates do you want?
Here is a technique which always works, but isn't always the most efficient. The cumulative distrubtion function P(x) gives the fraction of the time that values fall below x. Thus P(x)=0 at the lowest possible value of x and P(x)=1 at the highest possible value of x. Every distribution has a unique CDF, which encodes all the properties of the distrubtion in the way that P(x) rises from 0 to 1. If y is a uniform deviate on the interval [0,1], then x satisfying P(x)=y will be disributed according to your distribution. To make this work comuptationally, you just need a way computing the inverse of P(x) for your distribution.
The Meta.Numerics library defines a large number of commonly used distrubtions (e.g. normal, lognormal, exponential, chi squared, etc.) and has functions for computing the CDF (Distribution.LeftProbability) and the inverse CDF (Distribution.InverseLeftProbability) of each.
For specialized techniques that are fast for particular distrubtions, e.g. the Box-Muller technique for normaly distributed deviates, see the book Numerical Recipies.

If you are using Java then my Uncommons Maths library may be of interest. It includes classes for generating random numbers for Uniform, Gaussian, Poisson, Binomial and Exponential distributions. This article shows how you might use these distributions.

Related

When to use quicksort, mergesort, and radix sort? (with regard to quantities rather than paradigms) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've seen plenty of questions asking whether quicksort or mergesort is 'better', and when to use each of them, but what I'd like to see is some input on when to use them with regard to the size of the data being sorted.
Let's say I have a number of items, whether they be ints or custom objects. I sort these items.
I see mergesort, in a way, as being the optimal case of quicksort (picking the median as the pivot) at every step, but with some overhead. So at a certain size, when the overhead is negligible in comparison to the consistent optimal nature of mergesort, it would make sense to use it in favor of quicksort.
Radix sort has a 'linear' runtime given that the number of digits of the keys being sorted on does not approach the number of separate items being sorted. However, radix sort also has a relatively large constant on its runtime to my knowledge as well.
If I recall from some testing in the past, it made sense to use mergesort when the number of items being sorted began to number in the millions, and radix in the high millions/billion range.
Am I reasonably accurate in these assessments? Can someone confirm, deny, or correct them to some extent?
(I'm talking about rather 'simple' implementations of each sort. Also, in the case of radix sort, let's say that the largest single key is no larger than twice the number of items being sorted. i.e. sorting 4,000,000 items, the largest possible key is 8,000,000)
edit - I would like some input on the generic number ranges that one of the given sorts is fastest. I provided some in the question, and that may have been a mistake. What I'd like to see in an answer is an opinion on the number ranges. I know quicksort tends to be the default since its usually 'good enough' and doesn't have the space complexity of merge and doesn't come with the worry of malicious data purposefully made with obscenely large keys (radix).

A/B testing sorting algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to make an algorithm which will enable the conduct of A/B testing over a variable number of subjects with a variable number of properties per subject.
For example I have 1000 people with the following properties: they come from two departments, some are managers, some are women etc. these properties may increase/decrease according to the situation.
I want to make an algorithm which will split the population in two with the best representation possible in both A and B of all the properties. So i want two groups of 500 people with equal number of both departments in both, equal number of managers and equal number of women. More specifically, I would like to maintain the ratio of each property in both A and B. So if we have 10% managers I want 10% of sample A and Sample B to be managers.
Any pointers on where to begin? I am pretty sure that such an algorithm exists. I have a gut feeling that this may be unsolvable in some cases as there may be an odd number of managers AND women AND Dept. 1.

Make a list of permutations of all a/b variables.
Dept1,Manager,Male
Dept1,Manager,Female
Dept1,Junior,Male
...
Dept2,Junior,Female
Go through all the people and assign them to their respective permutation. Maybe randomise the order of the people first just to be sure there is no bias in the order they are added to each permutation.
Dept1,Manager,Male-> Person1, Person16, Person143...
Dept1,Manager,Female-> Person7, Person10, Person83...
Have a second process that goes through each permutation and assigns half the people to one test group and half to the other. You will need to account for odd numbers of people in the group, but that should be fairly easy to factor in, obviously a larger sample size will reduce the impact of this odd number on the final results.

The algorithm for splitting the groups is simple - take each group of people who have all dimensions in common and assign half to the treatment and half to the control. You don't need to worry about odd numbers of people, whatever statistical test you are using will account for that. If some dimension is so skewed (i.e., there are only 2 females in your entire sample), it may be wise throw the dimension out.
Simple A/B tests usually use a t-test or g-test, but in your case, you'd be better of using an ANOVA to determine the significance of the treatment on each of the individual dimensions.

Partial sine data fit code C [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have 8 data points that form the peak of a partial sine wave. I am trying to fit these to get an equation so I discover the point of the true maximum position (which most likely lies between the data points). The coding will be in C. Does anyone have any info on algorithms or ideally code samples?

Since the data points are all near a maximum, the wave y = A*sin(B*x + C) + D can be approximated as a parabola much like the first 2 terms of cos(x) = (1.0 - x*x/2! + ...).
So find the best fit parabola for the 8 data points and calculate the maximum.
C- Peak detection via quadratic fit
Lots of google examples exist. Example

Provided your sample-values form a "hump", i.e. increasing followed by decreasing samples, you could try viewing the samplevalues as "weights" and compute the "center of gravity":
float cog = 0f;
for (i=0; i<num_samples; ii+) {
cog += i * samples[i];
}
cog /= num_samples;
I've used that in similar cases in the past.
NOTE: This scheme only works if the set of samples used contain a single peak, which the question phrasing certainly made me think was the case. Finding locations of interest can easily be done by monitoring, if sample values are increasing or decreasing, selecting an "interesting" range of samples and computing the peak location as described.
Also note, that if the actual goal is to determine the sine wave phase or frequency of an input signal, it would be a lot better to correlate the signal against reference set of sine-waves (in other words, do a Fourier transform).

Fitting an unknown curve [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
There are some related questions that I've come across (like this, this, this, and this) but they all deal with fitting data to a known curve. Is there a way to fit given data to an unknown curve? By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions. I'm programming in C, but I'm at a complete loss on how to use the gsl package to do this. I'm open to using anything that can (ideally) be piped through C. But any help on what direction I should look will be greatly appreciated.
EDIT: This is basically experimental (physics) data that I've collected, so the data will have some trend modified by additive gaussian distributed noise. In general the trend will be non-linear, so I guess that a linear regression fitting method will be unsuitable. As for the ordering, the data is time-ordered, so the curve necessarily has to be fit in that order.

You might be looking for polynomial interpolation, in the field of numerical analysis.
In polynomial interpolation - given a set of points (x,y) - you are trying to find the best polynom that fits these points. One way to do it is using Newton interpolation, which is fairly easy to program.
The field of numerical analysis and interpolations in specifics is widely studied, and you might be able to get some nice upper bound to the error of the polynom.
Note however, because you are looking for a polynom that best fits your data, and the function is not really a polynom - the scale of the error when getting far from your initial training set blasts off.
Also note, your data set is finite, and there are inifnite number (actually, non-enumerable infinity) of functions that can fit the data (exactly or approximately) - so which one out of these is the best might be specific to what you actually are trying to achieve.
If you are looking for a model to fit your data, note that linear regression and polynomial interpolations are at the opposite ends of the scale: polynomial interpolation might be an overfitting to a model, while a linear regression might be underfitting it, what exactly should be used is case specific and varies from one application to the other.
Simple polynomial interpolation example:
Let's say we have (0,1),(1,2),(3,10) as our data.
The table1 we get using newton method is:
0 | 1 | |
1 | 2 | (2-1)/(1-0)=1 |
3 | 9 | (10-2)/(3-1)=4 | (4-1)/(3-0)=1
Now, the polynom we get is the "diagonal" that ends with the last element:
1 + 1*(x-0) + 1*(x-0)(x-1) = 1 + x + x^2 - x = x^2 +1
(and that is a perfect fit indeed to the data we used)
(1) The table is recursively created: The first 2 columns are the x,y values - and each next column is based on the prior one. It is really easy to implement once you get it, the full explanation is in the wikipedia page for newton interpolation.

Another alternative is using linear regression, but multi-dimensional.
The trick here is to artificially generate extra dimensions. You can do so by simply implying some functions on the original data set. A common usage is doing it to generate polynoms to match the data, so in here the function you imply is f(x) = x^i for all i < k (where k is the degree of the polynom you want to get).
For example, the data set (0,2),(2,3) with k = 3 you will get extra 2 dimnsions, and your data set will be: (0,2,4,8),(2,3,9,27).
The linear-regression algorithm will find the values a_0,a_1,...,a_k for the polynom p(x) = a_0 + a_1*x + ... + a_k * x^k that minimized the error for each point in the data comparing to the predicted model (the value of p(x)).
Now, the problem is - when you start increasing the dimension - you are moving from underfitting (of 1 dimensional linear regression) to overfitting (when k==n, you effectively getting polynomial interpolation).
To "chose" what is the best k value - you can use cross-validation, and chose the k that minimized the error according to your cross-validation.
Note that this process can be fully automated, all you need is to iteratively check all k values in the desired range1, and chose the model with the k that minimized the error according to the cross-validation.
(1) The range could be [1,n] - though it will probably be way too time consuming, I'd go for [1,sqrt(n)] or even [1,log(n)] - but it is just a hunch.

You might want to use (Fast) Fourier Transforms to convert data to frequency domain.
With the result of the transform (a set of amplitudes and phases and frequencies) even the most twisted set of data can be represented by several functions (harmonics) of the form:
r * cos(f * t - p)
where r is the harmonic amplitude, f is the frequency an p the phase.
Finally, the unknonwn data curve is the sum of all harmonics.
I have done this in R (you have some examples of it) but I believe C has enough tools to manage it. It is also possible to pipe C and R but don't know much about it. This might be of help.
This method is really good for large chunks of data because it has complexities of:
1) decompose data with Fast Fourier Transforms (FTT) = O(n log n)
2) built the function with the resulting components = O(n)

An algorithm to solve an electricity circuit [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm studying programming on my own and I would like to have an idea how to solve this problem.
I have been given the set of resistors with given resistances and a given value restot. I can pick a given number of those resistors. How can I make a circuit which resistance is as near as possible to restot? A programmer told me that that one can use genetic algorithms but I'm not limited to use such.
I guess I have to make a linear system of equations using Kirchoff's laws to make equations but as I don't have very much experience on electricity problems nor numerical algorithms to linear systems so I would like to have some guidance about how can I make those equations automatically to computers memory as the system changes all the time. And how can I make sure that the algorithm converges to a better solutions?
The problem is from a Finnish discussion forum.

Resistors can either exist in series or in parallel, and their resistances add up differently (add values for series, add reciprocals for parallel).
You can also have networks for resistors in series and parallel.
This sounds to me like a classic case of a recursive data structure, and you could probably represent it as a tree, in a similar way to a binary expression tree: http://en.wikipedia.org/wiki/Binary_expression_tree
Combine that some exploratory tree building (you should look into the way Prolog does this) and you can find the best combination of resistors that gets close to your total.
No genetic algorithms in this approach, although you could take a genetic approach to building and refining the tree.

To apply a genetic algorithm you would need to find a way to represent, mutate and combine the "DNA" of a resistor network.
One way would be to:
Add some number of 0 ohm resistors to your resister set (representing wires).
Number the resistors from 1 to N
For some M, imagine a set of M junctions including the source (1) and sink (M).
You could define which junctions the two endpoints of each resistor are connected to as the unique identifier of a network. This is just an N-tuple of integer pairs in the range 1..M. This tuple can be the "DNA".
Then:
Generate a bunch of random networks from random tuples.
Calculate each networks resistence
Discard some amount of the population farthest from the target resistence.
Combine random pairs of them to form new networks. (perhaps by randomly selecting each resistor endpoint from either parent A or parent B with 50% probability)
Randomly change a few endpoints (mutation).
Goto 2
Not sure if it would actually work exactly like this, but you get the general idea.
There is undoubtably a better non-genetic algorithm, but you specifically asked for a genetic one so there you go.

If you are not limited to genetic algorithm, then I think you can also solve this problem with help of linear programming. You can encode the problem as below and ask a solver to give the answer for you.
Required Resistance Of Circuit = x ohms
// We want to have total 33 resistors.
selected_in_series_1 + selected_in_series_2 +... + selected_in_series_211 + selected_in_parallel_1 + selected_in_parallel_2 + ... + selected_in_parallel_211 = 33
// Resistor in Series
(selected_in_series_1 * Resistor_1) + (selected_in_series_2 * Resistor_2) + ..(selected_in_series_211 * Resistor_211) = total_resistence_in series
// Similarly write formula for parallel
(selected_in_parallel_1 * 1/Resistor_1) + (selected_in_parallel_2 * 1/Resistor_2) + ..(selected_in_parallel_211 * 1/Resistor_211) = 1/total_resistence_in parallel
total_resistence_in series + total_resistence_in parallel = Required Resistance Of Circuit