Generating random poll numbers - c

I struggle with this simple problem: I want to create some random poll numbers. I have 4 variables I need to fill with data (actually an array of integer). These numbers should represent a random percentage. All percentages added will be 100% . Sounds simple.
But I think it isn't that easy. My first attempt was to generate a random number between 10 and base (base = 100), and substract the number from the base. Did this 3 times, and the last value was assigned the base. Is there a more elegant way to do that?
My question in a few words:
How can I fill this array with random values, which will be 100 when added together?
int values[4];

You need to write your code to emulate what you are simulating.
So if you have four choices, generate a sample size of random number (0..1 * 4) and then sum all the 0's, 1's, 2's, and 3's (remember 4 won't be picked). Then divide the counts by the sample size.
for (each sample) {
poll = random(choices);
survey[poll] += 1;
}
It's easy to use a computer to simulate things, simple simulations are very fast.
Keep in mind that you are working with integers, and integers don't divide nicely without converting them to floats or doubles. If you are missing a few percentage points, odds are it has to do with your integers dividing with remainders.

What you have here is a problem of partitioning the number 100 into 4 random integers. This is called partitioning in number theory.
This problem has been addressed here.
The solution presented there does essentially the following:
If computes, how many partitions of an integer n there are in O(n^2) time. This produces a table of size O(n^2) which can then be used to generate the kth partition of n, for any integer k, in O(n) time.
In your case, n = 100, and k = 4.

Generate x1 in range <0..1>, subtract it from 1, then generate x2 in range <0..1-x1> and so on. Last value should not be randomed, but in your case equal 1-x1-x2-x3.

I don't think this is a whole lot prettier than what it sounds like you've already done, but it does work. (The only advantage is it's scalable if you want more than 4 elements).
Make sure you #include <stdlib.h>
int prev_sum = 0, j = 0;
for(j = 0; j < 3; ++j)
{
values[j] = rand() % (100-prev_sum);
prev_sum += values[j];
}
values[3] = 100 - prev_sum;

It takes some work to get a truly unbiased solution to the "random partition" problem. But it's first necessary to understand what "unbiased" means in this context.
One line of reasoning is based on the intuition of a random coin toss. An unbiased coin will come up heads as often as it comes up tails, so we might think that we could produce an unbiased partition of 100 tosses into two parts (head-count and tail-count) by tossing the unbiased coin 100 times and counting. That's the essence of Edwin Buck's proposal, modified to produce a four-partition instead of a two-partition.
However, what we'll find is that many partitions never show up. There are 101 two-partitions of 100 -- {0, 100}, {1, 99} … {100, 0} but the coin sampling solution finds less than half of them in 10,000 tries. As might be expected, the partition {50, 50} is the most common (7.8%), while all of the partitions from {0, 100} to {39, 61} in total achieved less than 1.7% (and, in the trial I did, the partitions from {0, 100} to {31, 69} didn't show up at all.) [Note 1]
So that doesn't seem like a unbiased sample of possible partitions. An unbiased sample of partitions would return every partition with equal probability.
So another temptation would be to select the size of the first part of the partition from all the possible sizes, and then the size of the second part from whatever is left, and so on until we've reached one less than the size of the partition at which point anything left is in the last part. However, this will turn out to be biased as well, because the first part is much more likely to be large than any other part.
Finally, we could enumerate all the possible partitions, and then choose one of them at random. That will obviously be unbiased, but unfortunately there are a lot of possible partitions. For the case of 4-partitions of 100, for example, there are 176,581 possibilities. Perhaps that is feasible in this case, but it doesn't seem like it will lead to a general solution.
For a better algorithm, we can start with the observation that a partition
{p1, p2, p3, p4}
could be rewritten without bias as a cumulative distribution function (CDF):
{p1, p1+p2, p1+p2+p3, p1+p2+p3+p4}
where the last term is just the desired sum, in this case 100.
That is still a collection of four integers in the range [0, 100]; however, it is guaranteed to be in increasing order.
It's not easy to generate a random sorted sequence of four numbers ending in 100, but it is trivial to generate three random integers no greater than 100, sort them, and then find adjacent differences. And that leads to an almost unbiased solution, which is probably close enough for most practical purposes, particularly since the implementation is almost trivial:
(Python)
def random_partition(n, k):
d = sorted(randrange(n+1) for i in range(k-1))
return [b - a for a, b in zip([0] + d, d + [n])]
Unfortunately, this is still biased because of the sort. The unsorted list is selected without bias from the universe of possible lists, but the sortation step is not a simple one-to-one match: lists with repeated elements have fewer permutations than lists without repeated elements, so the probability of a particular sorted list without repeats is much higher than the probability of a sorted list with repeats.
As n grows large with respect to k, the number of lists with repeats declines rapidly. (These correspond to final partitions in which one or more of the parts is 0.) In the asymptote, where we are selecting from a continuum and collisions have probability 0, the algorithm is unbiased. Even in the case of n=100, k=4, the bias is probably ignorable for many practical applications. Increasing n to 1000 or 10000 (and then scaling the resulting random partition) would reduce the bias.
There are fast algorithms which can produce unbiased integer partitions, but they are typically either hard to understand or slow. The slow one, which takes time(n), is similar to reservoir sampling; for a faster algorithm, see the work of Jeffrey Vitter.
Notes
Here's the quick-and-dirty Python + shell test:
$ python -c '
from random import randrange
n = 2
for i in range(10000):
d = n * [0]
for j in range(100):
d[randrange(n)] += 1
print(' '.join(str(f) for f in d))
' | sort -n | uniq -c
1 32 68
2 34 66
5 35 65
15 36 64
45 37 63
40 38 62
66 39 61
110 40 60
154 41 59
219 42 58
309 43 57
385 44 56
462 45 55
610 46 54
648 47 53
717 48 52
749 49 51
779 50 50
788 51 49
723 52 48
695 53 47
591 54 46
498 55 45
366 56 44
318 57 43
234 58 42
174 59 41
118 60 40
66 61 39
45 62 38
22 63 37
21 64 36
15 65 35
2 66 34
4 67 33
2 68 32
1 70 30
1 71 29

You can brute force it by, creating a calculation function that adds up the numbers in your array. If they do not equal 100 then regenerate the random values in array, do calculation again.

Related

Maximize sum of subset of arrays under given cost

I have an array of n Objects with an int 'value' and another int 'cost'. I want to get the subset of size k (k < n) of that array that maximizes the sum of the values. For instance...
Value - Cost
32 - 24
25 - 17
39 - 40
10 - 10
47 - 44
0 - 10
18 - 10
I need to choose, say, 5 of those that will maximize value while staying below a certain total cost (say, 100). I don't get bonus points for having the lowest cost, only for having the highest value. I'm not looking to get the most bang for my buck, I'm looking to get the most bang, while staying below a given buck.
Is there just a name for the algorithm I'm looking for? I don't want to have to brute force it.

Add Countif to Array Formula (Subtotal) in Excel

I am new to array formulae and have noticed that while SUBTOTAL includes many functions, it does not feature COUNTIF (only COUNT and COUNTA).
I'm trying to figure out how I can integrate a COUNTIF-like feature to my array formula.
I have a matrix, a small subset of which looks like:
A B C D E
48 53 46 64 66
48 66 89
40 38 42 49 44
37 33 35 39 41
Thanks to the help of #Tom Shape in this post, I (he) was able to average the sum of each row in the matrix provided it had complete data (so rows 2 and 4 in the example above would not be included).
Now I would like to count the number of rows with complete data (so rows 2 and 4 would be ignored) which include at least one value above a given threshold (say 45).
In the current example, the result would be 2, since row 1 has 5/5 values > 45, and row 3 has 1 value > 45. Row 5 has values < 45 and rows 2 and 3 have partially or fully missing data, respectively.
I have recently discovered the SUMPRODUCT function and think that perhaps SUMPRODUCT(--(A1:E1 >= 45 could be useful but I'm not sure how to integrate it within Tom Sharpe's elegant code, e.g.,
=AVERAGE(IF(SUBTOTAL(2,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))=COLUMNS(A1:E1),SUBTOTAL(9,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1))),""))
Remember, I am no longer looking for the average: I want to filter rows for whether they have full data, and if they do, I want to count rows with at least 1 entry > 45.
Try the following. Enter as array formula.
=COUNT(IF(SUBTOTAL(4,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))>45,IF(SUBTOTAL(2,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))=COLUMNS(A1:E1),SUBTOTAL(9,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1))))))
Data

Multi-Dimensional Arrays Julia

I am new to using Julia and have little experience with the language. I am trying to understand how multi-dimensional arrays work in it and how to access the array at the different dimensions. The documentation confuses me, so maybe someone here can explain it better.
I created an array (m = Array{Int64}(6,3)) and am trying to access the different parts of that array. Clearly I am understanding it wrong so any help in general about Arrays/Multi-Dimensional Arrays would help.
Thanks
Edit I am trying to read a file in that has the contents
58 129 10
58 129 7
25 56 10
24 125 25
24 125 15
13 41 10
0
The purpose of the project is to take these fractions (58/129) and round the fractions using farey sequence. The last number in the row is what both numbers need to be below. Currently, I am not looking for help on how to do the problem, just how to create a multidimensional array with all the numbers except the last row (0). My trouble is how to put the numbers into the array after I have created it.
So I want m[0][0] = 58, so on. I'm not sure how syntax works for this and the manual is confusing. Hopefully this is enough information.
Julia's arrays are not lists-of-lists or arrays of pointers. They are a single container, with elements arranged in a rectangular shape. As such, you do not access successive dimensions with repeated indexing calls like m[j][i] — instead you use one indexing call with multiple indices: m[i, j].
If you trim off that last 0 in your file, you can just use the built-in readdlm to load that file into a matrix. I've copied those first six rows into my clipboard to make it a bit easier to follow here:
julia> str = clipboard()
"58 129 10\n58 129 7\n25 56 10\n24 125 25\n24 125 15\n13 41 10"
julia> readdlm(IOBuffer(str), Int) # or readdlm("path/to/trimmed/file", Int)
6×3 Array{Int64,2}:
58 129 10
58 129 7
25 56 10
24 125 25
24 125 15
13 41 10
That's not very helpful in teaching you how Julia's arrays work, though. Constructing an array like m = Array{Int64}(6,3) creates an uninitialized matrix with 18 elements arranged in 6 rows and 3 columns. It's a bit easier to see how things work if we fill it with a sensible pattern:
julia> m .= [10,20,30,40,50,60] .+ [1 2 3]
6×3 Array{Int64,2}:
11 12 13
21 22 23
31 32 33
41 42 43
51 52 53
61 62 63
This has set up the values of the array to have the row number in their tens place and the column number in the ones place. Accessing m[r,c] returns the value in m at row r and column c.
julia> m[2,3] # second row, third column
23
Now, r and c don't have to be integers — they can also be vectors of integers to select multiple rows or columns:
julia> m[[2,3,4],[1,2]] # Selects rows 2, 3, and 4 across columns 1 and 2
3×2 Array{Int64,2}:
21 22
31 32
41 42
Of course ranges like 2:4 are just vectors themselves, so you can more easily and efficiently write that example as m[2:4, 1:2]. A : by itself is a shorthand for a vector of all the indices within the dimension it indexes into:
julia> m[1, :] # the first row of all columns
3-element Array{Int64,1}:
11
12
13
julia> m[:, 1] # all rows of the first column
6-element Array{Int64,1}:
11
21
31
41
51
61
Finally, note that Julia's Array is column-major and arranged contiguously in memory. This means that if you just use one index, like m[2], you're just going to walk down that first column. As a special extension, we support what's commonly referred to as "linear indexing", where we allow that single index to span into the higher dimensions. So m[7] accesses the 7th contiguous element, wrapping around into the first row of the second column:
julia> m[5],m[6],m[7],m[8]
(51, 61, 12, 22)

An Algorithm Comparing Peaks: are they in phase or not?

I am developing an algorithm for comparing two lists of numbers. The lists represent peaks discovering in a signal using a robust peak detection method. I wish to come up with some way of determining whether the peaks are either in phase, out of phase, or neither (could not be determined). For example:
These arrays would be considered in phase:
[ 94 185 278 373 469], [ 89 180 277 369 466]
But these arrays would be out of phase:
[51 146 242 349], [99 200 304 401]
There is no requirement that the arrays must be the same length. I have looked into measuring periodicity, however in this case I can assume the signal is already periodic.
Another idea I had was to divide all the array elements by their index (or their index+1) to see if they cluster around one or two points, but this is not robust and fails if a single peak is missing.
What approaches might be useful in solving this problem?
One approach would be to find the median distance from each peak in the first list to a peak in the second list.
If you divide this distance by the median distance between peaks in the first list, you will get a fraction where 0 means in phase, and 0.5 means out of phase.
For example:
[ 94 185 278 373 469], [ 89 180 277 369 466]
94->89 = 5
185->180 = 5
278->277 = 1
373->369 = 4
469->466 = 5
Score = median(5,5,1,4,5) / median distance between peaks
= 5 / 96 = 5.2% => in phase
[51 146 242 349], [99 200 304 401]
51->99 = 48
146->99 = 47
242->200 = 42
349->304 = 45
score = median(48,47,42,45) / median distance between peaks
= 46 / 95.5
= 48% => out of phase
I would enter the peak locations, using them as index locations, into a much larger array (best if the length of the array is close to an integer multiple of the periodicity distance of your peaks), and then do either a complex Goertzel filter (if you know the frequency), or do a DFT or FFT (if you don't know the frequency) of the array. Then use atan2() on the complex result (at the peak magnitude frequency for the FFT) to measure phase relative to the array starts. Then compare unwrapped phases using some difference threshold.

How to find subarray between min and max

I have a Sorted array .Lets assume
{4,7,9,12,23,34,56,78} Given min and max I want to find elements in array between min and max in efficient way.
Cases:min=23 and max is 78 op:{23,34,56,78}
min =10 max is 65 op:{12,23,34,56}
min 0 and max is 100 op:{4,7,9,12,23,34,56,78}
Min 30 max= 300:{34,56,78}
Min =100 max=300 :{} //empty
I want to find efficient way to do this?I am not asking code any algorithm which i can use here like DP exponential search?
Since it's sorted, you can easily find the lowest element greater than or equal to the minimum desired, by using a binary search over the entire array.
A binary search basically reduces the serch space by half with each iteration. Given your first example of 10, you start as follows with the midpoint on the 12:
0 1 2 3 4 5 6 7 <- index
4 7 9 12 23 34 56 78
^^
Since the element you're looking at is higher than 10 and the next lowest is lesser, you've found it.
Then, you can use a similar binary search but only over that section from the element you just found to the end. This time you're looking for the highest element less than or equal to the maximum desired.
On the same example as previously mentioned, you start with:
3 4 5 6 7 <- index
12 23 34 56 78
^^
Since that's less than 65 and the following one is also, you need to increase the pointer to the halfway point of 34..78:
3 4 5 6 7 <- index
12 23 34 56 78
^^
And there you have it, because that number is less and the following number is more (than 65)
Then you have the start at stop indexes (3 and 6) for extracting the values.
0 1 2 3 4 5 6 7 <- index
4 7 9 ((12 23 34 56)) 78
-----------
The time complexity of the algorithm is O(log N). Though keep in mind that this really only becomes important when dealing with larger data sets. If your data sets do consist of only about eight elements, you may as well use a linear search since (1) it'll be easier to write; and (2) the time differential will be irrelevant.
I tend not to worry about time complexity unless the operations are really expensive, the data set size gets into the thousands, or I'm having to do it thousands of times a second.
Since it is sorted, this should do:
List<Integer> subarray = new ArrayList<Integer>();
for (int n : numbers) {
if (n >= MIN && n <= MAX) subarray.add(n);
}
It's O(n) as you only look at every number once.

Resources