How to get an evenly distributed sample from Perl array values? - arrays

I have an array containing many values between 0 and 360 (like degrees in a circle), but unevenly distributed:
1,45,46,47,48,49,50,51,52,53,54,55,100,120,140,188, 210, 280, 355
Now I need to reduce those values to e.g. 4 only, but as evenly as possible distributed values.
How to do that?
Thanks,
Jan

Put the numbers on a circle, like a clock. Now construct a logical cross, say at 12, 3, 6, and 9 o’clock. Put the 12 at the first number. Now find what numbers would be nearest to 3, 6, and 9 o’clock, and record the sum of those three numbers’ distances next to the first number.
Iterate by rotating the top of your cross — the 12 o’clock point — clockwise until it exactly lines up with the next number. Again measure how far the nearest numbers are to each of your three other crosspoints, and record that score next to this current 12 o’clock number.
Repeat until you reach your 12 o’clock has rotated all the way to the original 3 o’clock, at which point you’re done. Whichever number has the lowest sum assigned to it determines the winning configuration.
This solution generalizes to any range of values R and any number N of final points you wish to reduce the set to. Each point on the “cross” is R/N away from each other, and you need only rotate until the top of your cross reaches where the next arm was in the original position. So if you wanted 6 points, you would have a 6-pointed cross, each 60 degrees apart instead of a 4-pointed cross each 90 degrees apart. If your range is different, you still do the same sort of operation. That way you don’t need a physical clock and cross to implement this algorithm: it works for any R and N.
I feel bad about this answer from a Perl perspective, as I’ve not managed to include any dollar signs in the solution. :)

Use a clustering algorithm to divide your data into evenly distributed partitions. Then grab a random value from each cluster. The following $datafile looks like this:
1 1
45 45
46 46
...
210 210
280 280
355 355
First column is a tag, second column is data. Running the following with $K = 4:
use strict; use warnings;
use Algorithm::KMeans;
my $datafile = $ARGV[0] or die;
my $K = $ARGV[1] or 0;
my $mask = 'N1';
my $clusterer = Algorithm::KMeans->new(
datafile => $datafile,
mask => $mask,
K => $K,
terminal_output => 0,
);
$clusterer->read_data_from_file();
my ($clusters, $cluster_centers) = $clusterer->kmeans();
my %clusters;
while (#$clusters) {
my $cluster = shift #$clusters;
my $center = shift #$cluster_centers;
$clusters{"#$center"} = $cluster->[int rand( #$cluster - 1)];
}
use YAML; print Dump \%clusters;
returns this:
120: 120
199: 188
317.5: 355
45.9166666666667: 46
First column is the center of the cluster, second is the selected value from that cluster. The centers' distance to one another should be maximized according to the Expectation Maximization algorithm.

Related

How do you find a value between multiple named number ranges in Google Sheets

I have values in a certain column as follows.
Rank Score
A 10
B 24
C 35
D 88
E 192
.
.
.
And so on. There are far too many entries to do an IFS statement and the numbers have an arbitrary difference between levels (A to Z). If I have a number, say 85, as per the info above, it should be rank C (between 35 and 88).
I want to check which rank it falls under. I need a single formula so I can apply it across another sheet with multiple scores that need to be ranked.
use floating VLOOKUP:
=VLOOKUP(D2, {B:B, A:A}, 2, 1)
for arrayformula do:
=ARRAYFORMULA(IFNA(VLOOKUP(D2:D, {B:B, A:A}, 2, 1)))
also see alternatives: https://webapps.stackexchange.com/q/123729/186471

Determine the adjacency of two fibonacci number

I have many fibonacci numbers, if I want to determine whether two fibonacci number are adjacent or not, one basic approach is as follows:
Get the index of the first fibonacci number, say i1
Get the index of the second fibonacci number, say i2
Get the absolute value of i1-i2, that is |i1-i2|
If the value is 1, then return true.
else return false.
In the first step and the second step, it may need many comparisons to get the correct index by using accessing an array.
In the third step, it need one subtraction and one absolute operation.
I want to know whether there exists another approach to quickly to determine the adjacency of the fibonacci numbers.
I don't care whether this question could be solved mathematically or by any hacking techniques.
If anyone have some idea, please let me know. Thanks a lot!
No need to find the index of both number.
Given that the two number belongs to Fibonacci series, if their difference is greater than the min. number among them then those two are not adjacent. Other wise they are.
Because Fibonacci series follows following rule:
F(n) = F(n-1) + F(n-2) where F(n)>F(n-1)>F(n-2).
So F(n) - F(n-1) = F(n-2) ,
=> Diff(n,n-1) < F(n-1) < F(n-k) for k >= 1
Difference between two adjacent fibonaci number will always be less than the min number among them.
NOTE : This will only hold if numbers belong to Fibonacci series.
Simply calculate the difference between them. If it is smaller than the smaller of the 2 numbers they are adjacent, If it is bigger, they are not.
Each triplet in the Fibonacci sequence a, b, c conforms to the rule
c = a + b
So for every pair of adjacent Fibonaccis (x, y), the difference between them (y-x) is equal to the value of the previous Fibonacci, which of course must be less than x.
If 2 Fibonaccis, say (x, z) are not adjacent, then their difference must be greater than the smaller of the two. At minimum, (if they are one Fibonacci apart) the difference would be equal to the Fibonacci between them, (which is of course greater than the smaller of the two numbers).
Since for (a, b, c, d)
since c= a+b
and d = b+c
then d-b = (b+c) - b = c
By Binet's formula, the nth Fibonacci number is approximately sqrt(5)*phi**n, where phi is the golden ration. You can use base phi logarithms to recover the index easily:
from math import log, sqrt
def fibs(n):
nums = [1,1]
for i in range(n-2):
nums.append(sum(nums[-2:]))
return nums
phi = (1 + sqrt(5))/2
def fibIndex(f):
return round((log(sqrt(5)*f,phi)))
To test this:
for f in fibs(20): print(fibIndex(f),f)
Output:
2 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55
11 89
12 144
13 233
14 377
15 610
16 987
17 1597
18 2584
19 4181
20 6765
Of course,
def adjacentFibs(f,g):
return abs(fibIndex(f) - fibIndex(g)) == 1
This fails with 1,1 -- but there is little point for explicit testing special logic for such an edge-case. Add it in if you want.
At some stage, floating-point round-off error will become an issue. For that, you would need to replace math.log by an integer log algorithm (e.g. one which involves binary search).
On Edit:
I concentrated on the question of how to recover the index (and I will keep the answer since that is an interesting problem in its own right), but as #LeandroCaniglia points out in their excellent comment, this is overkill if all you want to do is check if two Fibonacci numbers are adjacent, since another consequence of Binet's formula is that sufficiently large adjacent Fibonacci numbers have a ratio which differs from phi by a negligible amount. You could do something like:
def adjFibs(f,g):
f,g = min(f,g), max(f,g)
if g <= 34:
return adjacentFibs(f,g)
else:
return abs(g/f - phi) < 0.01
This assumes that they are indeed Fibonacci numbers. The index-based approach can be used to verify that they are (calculate the index and then use the full-fledged Binet's formula with that index).

Generating random poll numbers

I struggle with this simple problem: I want to create some random poll numbers. I have 4 variables I need to fill with data (actually an array of integer). These numbers should represent a random percentage. All percentages added will be 100% . Sounds simple.
But I think it isn't that easy. My first attempt was to generate a random number between 10 and base (base = 100), and substract the number from the base. Did this 3 times, and the last value was assigned the base. Is there a more elegant way to do that?
My question in a few words:
How can I fill this array with random values, which will be 100 when added together?
int values[4];
You need to write your code to emulate what you are simulating.
So if you have four choices, generate a sample size of random number (0..1 * 4) and then sum all the 0's, 1's, 2's, and 3's (remember 4 won't be picked). Then divide the counts by the sample size.
for (each sample) {
poll = random(choices);
survey[poll] += 1;
}
It's easy to use a computer to simulate things, simple simulations are very fast.
Keep in mind that you are working with integers, and integers don't divide nicely without converting them to floats or doubles. If you are missing a few percentage points, odds are it has to do with your integers dividing with remainders.
What you have here is a problem of partitioning the number 100 into 4 random integers. This is called partitioning in number theory.
This problem has been addressed here.
The solution presented there does essentially the following:
If computes, how many partitions of an integer n there are in O(n^2) time. This produces a table of size O(n^2) which can then be used to generate the kth partition of n, for any integer k, in O(n) time.
In your case, n = 100, and k = 4.
Generate x1 in range <0..1>, subtract it from 1, then generate x2 in range <0..1-x1> and so on. Last value should not be randomed, but in your case equal 1-x1-x2-x3.
I don't think this is a whole lot prettier than what it sounds like you've already done, but it does work. (The only advantage is it's scalable if you want more than 4 elements).
Make sure you #include <stdlib.h>
int prev_sum = 0, j = 0;
for(j = 0; j < 3; ++j)
{
values[j] = rand() % (100-prev_sum);
prev_sum += values[j];
}
values[3] = 100 - prev_sum;
It takes some work to get a truly unbiased solution to the "random partition" problem. But it's first necessary to understand what "unbiased" means in this context.
One line of reasoning is based on the intuition of a random coin toss. An unbiased coin will come up heads as often as it comes up tails, so we might think that we could produce an unbiased partition of 100 tosses into two parts (head-count and tail-count) by tossing the unbiased coin 100 times and counting. That's the essence of Edwin Buck's proposal, modified to produce a four-partition instead of a two-partition.
However, what we'll find is that many partitions never show up. There are 101 two-partitions of 100 -- {0, 100}, {1, 99} … {100, 0} but the coin sampling solution finds less than half of them in 10,000 tries. As might be expected, the partition {50, 50} is the most common (7.8%), while all of the partitions from {0, 100} to {39, 61} in total achieved less than 1.7% (and, in the trial I did, the partitions from {0, 100} to {31, 69} didn't show up at all.) [Note 1]
So that doesn't seem like a unbiased sample of possible partitions. An unbiased sample of partitions would return every partition with equal probability.
So another temptation would be to select the size of the first part of the partition from all the possible sizes, and then the size of the second part from whatever is left, and so on until we've reached one less than the size of the partition at which point anything left is in the last part. However, this will turn out to be biased as well, because the first part is much more likely to be large than any other part.
Finally, we could enumerate all the possible partitions, and then choose one of them at random. That will obviously be unbiased, but unfortunately there are a lot of possible partitions. For the case of 4-partitions of 100, for example, there are 176,581 possibilities. Perhaps that is feasible in this case, but it doesn't seem like it will lead to a general solution.
For a better algorithm, we can start with the observation that a partition
{p1, p2, p3, p4}
could be rewritten without bias as a cumulative distribution function (CDF):
{p1, p1+p2, p1+p2+p3, p1+p2+p3+p4}
where the last term is just the desired sum, in this case 100.
That is still a collection of four integers in the range [0, 100]; however, it is guaranteed to be in increasing order.
It's not easy to generate a random sorted sequence of four numbers ending in 100, but it is trivial to generate three random integers no greater than 100, sort them, and then find adjacent differences. And that leads to an almost unbiased solution, which is probably close enough for most practical purposes, particularly since the implementation is almost trivial:
(Python)
def random_partition(n, k):
d = sorted(randrange(n+1) for i in range(k-1))
return [b - a for a, b in zip([0] + d, d + [n])]
Unfortunately, this is still biased because of the sort. The unsorted list is selected without bias from the universe of possible lists, but the sortation step is not a simple one-to-one match: lists with repeated elements have fewer permutations than lists without repeated elements, so the probability of a particular sorted list without repeats is much higher than the probability of a sorted list with repeats.
As n grows large with respect to k, the number of lists with repeats declines rapidly. (These correspond to final partitions in which one or more of the parts is 0.) In the asymptote, where we are selecting from a continuum and collisions have probability 0, the algorithm is unbiased. Even in the case of n=100, k=4, the bias is probably ignorable for many practical applications. Increasing n to 1000 or 10000 (and then scaling the resulting random partition) would reduce the bias.
There are fast algorithms which can produce unbiased integer partitions, but they are typically either hard to understand or slow. The slow one, which takes time(n), is similar to reservoir sampling; for a faster algorithm, see the work of Jeffrey Vitter.
Notes
Here's the quick-and-dirty Python + shell test:
$ python -c '
from random import randrange
n = 2
for i in range(10000):
d = n * [0]
for j in range(100):
d[randrange(n)] += 1
print(' '.join(str(f) for f in d))
' | sort -n | uniq -c
1 32 68
2 34 66
5 35 65
15 36 64
45 37 63
40 38 62
66 39 61
110 40 60
154 41 59
219 42 58
309 43 57
385 44 56
462 45 55
610 46 54
648 47 53
717 48 52
749 49 51
779 50 50
788 51 49
723 52 48
695 53 47
591 54 46
498 55 45
366 56 44
318 57 43
234 58 42
174 59 41
118 60 40
66 61 39
45 62 38
22 63 37
21 64 36
15 65 35
2 66 34
4 67 33
2 68 32
1 70 30
1 71 29
You can brute force it by, creating a calculation function that adds up the numbers in your array. If they do not equal 100 then regenerate the random values in array, do calculation again.

Get average of two consecutive values in a vector depending on logical vector

I am reading data from a file and I am trying to do some manipulation on the vector containing the data basically i want to check if the values come from consecutive lines and if so i want to average each two and put the value in a output vector
part of the data and lines
lines=[153 152 153 154 233 233 234 235 280 279 280 281];
Sail=[ 3 4 3 1.5 3 3 1 2 2.5 5 2.5 2 ];
here is what i am doing
Sail=S(lines);
Y=diff(lines)==1;
for ii=1:length(Y)
if Y(ii)
output(ceil(ii/2))=(Sail(ii)+Sail(ii+1))/2;
end
end
is this correct also is there a way to do that without a for loop
Thanks
My suggestion:
y = find(diff(lines)==1);
output = mean([Sail(y);Sail(y+1)]);
This assumes that when you have, say [233 234 235], you want one value averaging the values from lines [233 234] and one value averaging those from [234 245]. If you wanted to do something more complex when longer sets of consecutive lines exist in your data, then the problem becomes more complex.
Incidentally it's a bad idea to do something like (ceil(ii/2)) - you can't guarantee a unique index for each matching value of ii. If you did want an output the same size as Sail (will have zeros in non-matching areas) then you can do something like this:
output2 = zeros(size(Sail));
output2(y)=output;

Find all possible row-wise sums in a 2D array

Ideally I'm looking for a c# solution, but any help on the algorithm will do.
I have a 2-dimension array (x,y). The max columns (max x) varies between 2 and 10 but can be determined before the array is actually populated. Max rows (y) is fixed at 5, but each column can have a varying number of values, something like:
1 2 3 4 5 6 7...10
A 1 1 7 9 1 1
B 2 2 5 2 2
C 3 3
D 4
E 5
I need to come up with the total of all possible row-wise sums for the purpose of looking for a specific total. That is, a row-wise total could be the cells A1 + B2 + A3 + B5 + D6 + A7 (any combination of one value from each column).
This process will be repeated several hundred times with different cell values each time, so I'm looking for a somewhat elegant solution (better than what I've been able to come with). Thanks for your help.
The Problem Size
Let's first consider the worst case:
You have 10 columns and 5 (full) rows per column. It should be clear that you will be able to get (with the appropriate number population for each place) up to 5^10 ≅ 10^6 different results (solution space).
For example, the following matrix will give you the worst case for 3 columns:
| 1 10 100 |
| 2 20 200 |
| 3 30 300 |
| 4 40 400 |
| 5 50 500 |
resulting in 5^3=125 different results. Each result is in the form {a1 a2 a3} with ai ∈ {1,5}
It's quite easy to show that such a matrix will always exist for any number n of columns.
Now, to get each numerical result, you will need to do n-1 sums, adding up to a problem size of O(n 5^n). So, that's the worst case and I think nothing can be done about it, because to know the possible results you NEED to effectively perform the sums.
More benign incarnations:
The problem complexity may be cut off in two ways:
Less numbers (i.e. not all columns are full)
Repeated results (i.e. several partial sums give the same result, and you can join them in one thread). Much more in this later.
Let's see a simplified example of the later with two rows:
| 7 6 100 |
| 3 4 200 |
| 1 2 200 |
at first sight you will need to do 2 3^3 sums. But that's not the real case. As you add up the first column you don't get the expected 9 different results, but only 6 ({13,11,9,7,5,3}).
So you don't have to carry your nine results up to the third column, but only 6.
Of course, that is on the expense of deleting the repeating numbers from the list. The "Removal of Repeated Integer Elements" was posted before in SO and I'll not repeat the discussion here, but just cite that doing a mergesort O(m log m) in the list size (m) will remove the duplicates. If you want something easier, a double loop O(m^2) will do.
Anyway, I'll not try to calculate the size of the (mean) problem in this way for several reasons. One of them is that the "m" in the sort merge is not the size of the problem, but the size of the vector of results after adding up any two columns, and that operation is repeated (n-1) times ... and I really don't want to do the math :(.
The other reason is that as I implemented the algorithm, we will be able to use some experimental results and save us from my surely leaking theoretical considerations.
The Algorithm
With what we said before, it is clear that we should optimize for the benign cases, as the worst case is a lost one.
For doing so, we need to use lists (or variable dim vectors, or whatever can emulate those) for the columns and do a merge after every column add.
The merge may be replaced by several other algorithms (such as an insertion on a BTree) without modifying the results.
So the algorithm (procedural pseudocode) is something like:
Set result_vector to Column 1
For column i in (2 to n-1)
Remove repeated integers in the result_vector
Add every element of result_vector to every element of column i+1
giving a new result vector
Next column
Remove repeated integers in the result_vector
Or as you asked for it, a recursive version may work as follows:
function genResVector(a:list, b:list): returns list
local c:list
{
Set c = CartesianProduct (a x b)
Set c = Sum up each element {a[i],b[j]} of c </code>
Drop repeated elements of c
Return(c)
}
function ResursiveAdd(a:matrix, i integer): returns list
{
genResVector[Column i from a, RecursiveAdd[a, i-1]];
}
function ResursiveAdd(a:matrix, i==0 integer): returns list={0}
Algorithm Implementation (Recursive)
I choose a functional language, I guess it's no big deal to translate to any procedural one.
Our program has two functions:
genResVector, which sums two lists giving all possible results with repeated elements removed, and
recursiveAdd, which recurses on the matrix columns adding up all of them.
recursiveAdd, which recurses on the matrix columns adding up all of them.
The code is:
genResVector[x__, y__] := (* Header: A function that takes two lists as input *)
Union[ (* remove duplicates from resulting list *)
Apply (* distribute the following function on the lists *)
[Plus, (* "Add" is the function to be distributed *)
Tuples[{x, y}],2] (*generate all combinations of the two lists *)];
recursiveAdd[t_, i_] := genResVector[t[[i]], recursiveAdd[t, i - 1]];
(* Recursive add function *)
recursiveAdd[t_, 0] := {0}; (* With its stop pit *)
Test
If we take your example list
| 1 1 7 9 1 1 |
| 2 2 5 2 2 |
| 3 3 |
| 4 |
| 5 |
And run the program the result is:
{11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27}
The maximum and minimum are very easy to verify since they correspond to taking the Min or Max from each column.
Some interesting results
Let's consider what happens when the numbers on each position of the matrix is bounded. For that we will take a full (10 x 5 ) matrix and populate it with Random Integers.
In the extreme case where the integers are only zeros or ones, we may expect two things:
A very small result set
Fast execution, since there will be a lot of duplicate intermediate results
If we increase the Range of our Random Integers we may expect increasing result sets and execution times.
Experiment 1: 5x10 matrix populated with varying range random integers
It's clear enough that for a result set near the maximum result set size (5^10 ≅ 10^6 ) the Calculation time and the "Number of != results" have an asymptote. The fact that we see increasing functions just denote that we are still far from that point.
Morale: The smaller your elements are, the better chances you have to get it fast. This is because you are likely to have a lot of repetitions!
Note that our MAX calculation time is near 20 secs for the worst case tested
Experiment 2: Optimizations that aren't
Having a lot of memory available, we can calculate by brute force, not removing the repeated results.
The result is interesting ... 10.6 secs! ... Wait! What happened ? Our little "remove repeated integers" trick is eating up a lot of time, and when there are not a lot of results to remove there is no gain, but looses in trying to get rid of the repetitions.
But we may get a lot of benefits from the optimization when the Max numbers in the matrix are well under 5 10^5. Remember that I'm doing these tests with the 5x10 matrix fully loaded.
The Morale of this experiment is: The repeated integer removal algorithm is critical.
HTH!
PS: I have a few more experiments to post, if I get the time to edit them.

Resources