Closeness score to a target array of numbers - arrays

It's hard to know what terms to search for on stackoverflow for this problem. Say you have a target array of numbers like [100, 250, 400, 60]
I want to be able to score the closeness other arrays have to this target based on a threshold / error bars of say 10. So for example, the array:
[90, 240, 390, 50] would get a high score (or positive match result) because of the error bars.
The order matters, so
[60, 400, 250, 100] would get zero score (or negative match result)
The arrays can be different sizes so
[33, 77, 300, 110, 260, 410, 60, 99, 23] would get good score or positive match result.
A good way to think about the problem is to imagine these numbers are frequencies of musical notes like C,G,E,F and I'm trying to match a sequence of notes against a target.
Searching stackoverflow I'm not sure is this post will work, but it's close:
Compare difference between multiple numbers
Update 17th Jan 2015:
I failed to mention a scenario that might affect current answers. If the array has noise between those target numbers, I still want to find a positive match. For example [33, 77, 300, 110, 260, 300, 410, 40, 60, 99, 23].

I believe what you're looking for is sequence similarity.
You can read about them on this wikipedia page. Your case seems fit to local alignment category. There's some algorithm you can choose :
Needleman–Wunsch algorithm
Levenshtein distance
However, since these algorithms compare strings, you have to design your own scoring rule when inserting, deleting or comparing numbers.

Sounds like what you're looking for is the RMS error, where RMS is the square Root of the Mean Squared error. Let me illustrate by example. Assume the target array is [100, 250, 400, 60] and the array to be scored is [104, 240, 410, 55]
First compute the difference values, i.e. the errors
100 250 400 60
-104 -240 -410 -55
---- ---- ---- ---
-4 10 -10 5
Then square the errors to get 16 100 100 25. Compute the mean of the squared errors
(16 + 100 + 100 + 25) / 4 = 60.25
And finally, take the square root sqrt(60.25) = 7.76
When the arrays are different sizes, you can speed things up by only computing the RMS error if the first value is within a certain threshold, say +- 30. Using the example [33, 77, 300, 110, 260, 410, 60, 99, 23], there would only be two alignments to check, because with the other alignments the first number is more than 30 away from 100
33 77 300 110 260 410 60 99 23
100 250 400 60 --> RMS score = 178
100 250 400 60 --> RMS score = 8.7
Low score wins!

Related

Identifying Selection Sort vs Insertion Sort

I've read multiple articles on how Selection Sort and Insertion sort work, and believe I understand their implementations. Selection sort iterates over the unsorted numbers in the inner loop, whereas insertion sort iterates over the sorted numbers in the inner loop. From what I understand, that's basically the only difference.
My question lies in the scenario where you're posed an input array, lets say it's this one:
Input Array: 30, 70, 40, 60, 50
Now, you're given a further list where the iterations are shown:
30, 70, 40, 60, 50
30, 40, 70, 60, 50
30, 40, 50, 60, 70
30, 40, 50, 60, 70
How is one meant to identify whether Insertion Sort or Selection sort has been used based PURELY on this? There is no code given, nor are we required to write any code. We are only required to choose which algorithm has been used from a multiple choice list. (Yes, both appear in the list).
To be clear, this is not an assignment question. However, this is assisting me with revision for an exam.
Think about what happens in each of the algorithms: selection sort always selects the minimum of the unsorted elements and adds it to the end of the sorted elements; insertion sort always takes the first of the unsorted elements and inserts it in the correct place in the sorted list.
Selection sort:
Sorted | Unsorted
| 30 70 40 60 50
30 | 70 40 60 50 # selects 30, the minimum unsorted element
30 40 | 70 60 50 # selects 40
30 40 50 | 70 60 # selects 50
30 40 50 60 | 70 # selects 60
30 40 50 60 70 | # selects 70
Insertion sort:
Sorted | Unsorted
| 30 70 40 60 50
30 | 70 40 60 50 # inserts 30, the first unsorted element
30 70 | 40 60 50 # inserts 70
30 40 70 | 60 50 # inserts 40
30 40 60 70 | 50 # inserts 60
30 40 50 60 70 | # inserts 50
The arrays listed in each iteration would be the concatenation of the sorted and unsorted portions of the array. It looks like these iterations show neither selection sort nor insertion sort.
After speaking with the lecturer via email, I have a solution to this question. This is indeed a Selection Sort, with the elements therefore being swapped in place. (See https://en.wikipedia.org/wiki/Selection_sort).
Now, for the explanation:
Selection Sort:
Input Array: 30, 70, 40, 60, 50
Sorting:
30, 70, 40, 60, 50 // 30 is already sorted.
30, 40, 70, 60, 50 // Swap 40 and 70.
30, 40, 50, 60, 70 // Swap 70 and 50.
30, 40, 50, 60, 70 // Array is sorted.
Here's what it looks like for an insertion sort:
Input Array: 30, 70, 40, 60, 50
Sorting:
30, 70, 40, 60, 50 // 30 is inserted.
30, 70, 40, 60, 50 // 70 is inserted.
30, 40, 70, 60, 50 // 40 is inserted.
30, 40, 60, 70, 50 // 60 is inserted.
30, 40, 50, 60, 70 // 50 is inserted.
Array is now sorted.
I hope this helps anybody else that may come across a similar problem in the future while undertaking an algorithms course at college or university.

Understanding input and labels in word2vec (TensorFlow)

I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial.
For instance, my data
1 1 1 1 1 1 1 1 5 251 371 371 1685 ...
... starts with
skip_window = 2 # How many words to consider left and right.
num_skips = 1 # How many times to reuse an input to generate a label.
Then the generated input array is:
bach_input = 1 1 1 1 1 1 5 251 371 ....
This makes sense, starts from after 2 (= window size) and then continuous. The labels:
batch_labels = 1 1 1 1 1 1 251 1 1685 371 589 ...
I don't understand these labels very well. There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.
From the tensorflow tutorial:
The skip-gram model takes two inputs. One is a batch full of integers
representing the source context words, the other is for the target
words.
As per the tutorial, I have declared the two variables as:
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
How should I interpret the batch_labels?
There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.
The key setting is num_skips = 1. This value defines the number of (input, label) tuples each word generates. See the examples with different num_skips below (my data sequence seems to be different from yours, sorry about that).
Example #1 - num_skips=4
batch, labels = generate_batch(batch_size=8, num_skips=4, skip_window=2)
It generates 4 labels for each word, i.e. uses the whole context; since batch_size=8 only 2 words are processed in this batch (12 and 6), the rest will go into the next batch:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [12 12 12 12 6 6 6 6]
labels = [[6 3084 5239 195 195 3084 12 2]]
Example #2 - num_skips=2
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)
Here you would expect each word appear twice in the batch sequence; the 2 labels are randomly sampled from 4 possible words:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12 12 6 6 195 195 2 2]
labels = [[ 195 3084 12 195 3137 12 46 195]]
Example #3 - num_skips=1
batch, labels = generate_batch(batch_size=8, num_skips=1, skip_window=2)
Finally, this setting, same as yours, produces exactly one label per each word; each label is drawn randomly from the 4-word context:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12 6 195 2 3137 46 59 156]
labels = [[ 6 12 12 195 59 156 46 46]]
How should I interpret the batch_labels?
Each label is the center word to be predicted from the context. But the generated data may take not all (context, center) tuples, depending on the settings of the generator.
Also note that the train_labels tensor is 1-dimensional. Skip-Gram trains the model to predict any context word from the given center word, not all 4 context words at once. This explains why all training pairs (12, 6), (12, 3084), (12, 5239) and (12, 195) are valid.

How are 3-state cellular automata rules generated?

Let's limit the neighborhood to n=1 (which means we always need 3 cells to evaluate the next-gen cell).
Here's an example of a 2 state rule. Note that the upper row of the rules are generated in a particular order, whereas the lower row is the bit representation of the number 30.
I cannot find a single visualization of the equivalent for a 3 state CA. Following the logic of 2 state CA, it should contain 27 possible outcomes, but I have no clue in which order they should be generated. The lower row should be 30 in ternary (with leading zeroes to occupy a total of 27 positions).
Is there a general algorithm for generating these permutations in the conventional order of CAs (regardless of the number of states)?
Thank you very much in advance and sorry if the question is stupid. :(
What you are using is called Wolfram's code (from Stephen Wolfram) that is used for elementary CAs.
If you use more states or bigger neighborhoods then it is sufficient to extend it naturally.
Your question is not stupid.
For three states, this will give you ternary numbers. First write all the three digits number in ternary (descending order):
222, 221, 220, 212, 211, 210, 202, 201, 200, 122, 121, 120, 112, 111, 110, 102, 101, 100, 022, 021, 020, 012, 011, 010, 002, 001, 000
There are 27 of them 3^3, and 222_3 = 26, 221_3 = 25, 001_3 = 1, 000_3 = 0
Now decompose 30 onto base 3 27-digits number: 30 = 1*3^3+ 1*3^1, so there is only two digits equals to 1, the fourth and the second (from the right), here is rule 30 for radius-1 3-states CA:
000000000000000000000001010
This CA has a very different behavior than rule 30 radius-1 2-states CA.
Here is rule 33 for radius-1 3-states (33 = 1*3^3 + 2*3^1):
000000000000000000000001020
So for n,r, enumerate in descending order all 2r+1 digits numbers in base n and associate for each of them a value in [0,n[.

separate chaining vs linear probing

a set of objects with keys: 12, 44, 13, 88, 23, 94, 11, 39, 20, 16, 5
Write the hash table where M=N=11 and collisions are handled using separate chaining.
h(x) = | 2x + 5 | mod M
So I did it with linear probing and got
11 39 20 5 16 44 88 12 23 13 94
which I am pretty sure is right, but how do you do it with separate chaining? I realize separate chaining uses linked lists, but how would the hash table look like?

How to get an evenly distributed sample from Perl array values?

I have an array containing many values between 0 and 360 (like degrees in a circle), but unevenly distributed:
1,45,46,47,48,49,50,51,52,53,54,55,100,120,140,188, 210, 280, 355
Now I need to reduce those values to e.g. 4 only, but as evenly as possible distributed values.
How to do that?
Thanks,
Jan
Put the numbers on a circle, like a clock. Now construct a logical cross, say at 12, 3, 6, and 9 o’clock. Put the 12 at the first number. Now find what numbers would be nearest to 3, 6, and 9 o’clock, and record the sum of those three numbers’ distances next to the first number.
Iterate by rotating the top of your cross — the 12 o’clock point — clockwise until it exactly lines up with the next number. Again measure how far the nearest numbers are to each of your three other crosspoints, and record that score next to this current 12 o’clock number.
Repeat until you reach your 12 o’clock has rotated all the way to the original 3 o’clock, at which point you’re done. Whichever number has the lowest sum assigned to it determines the winning configuration.
This solution generalizes to any range of values R and any number N of final points you wish to reduce the set to. Each point on the “cross” is R/N away from each other, and you need only rotate until the top of your cross reaches where the next arm was in the original position. So if you wanted 6 points, you would have a 6-pointed cross, each 60 degrees apart instead of a 4-pointed cross each 90 degrees apart. If your range is different, you still do the same sort of operation. That way you don’t need a physical clock and cross to implement this algorithm: it works for any R and N.
I feel bad about this answer from a Perl perspective, as I’ve not managed to include any dollar signs in the solution. :)
Use a clustering algorithm to divide your data into evenly distributed partitions. Then grab a random value from each cluster. The following $datafile looks like this:
1 1
45 45
46 46
...
210 210
280 280
355 355
First column is a tag, second column is data. Running the following with $K = 4:
use strict; use warnings;
use Algorithm::KMeans;
my $datafile = $ARGV[0] or die;
my $K = $ARGV[1] or 0;
my $mask = 'N1';
my $clusterer = Algorithm::KMeans->new(
datafile => $datafile,
mask => $mask,
K => $K,
terminal_output => 0,
);
$clusterer->read_data_from_file();
my ($clusters, $cluster_centers) = $clusterer->kmeans();
my %clusters;
while (#$clusters) {
my $cluster = shift #$clusters;
my $center = shift #$cluster_centers;
$clusters{"#$center"} = $cluster->[int rand( #$cluster - 1)];
}
use YAML; print Dump \%clusters;
returns this:
120: 120
199: 188
317.5: 355
45.9166666666667: 46
First column is the center of the cluster, second is the selected value from that cluster. The centers' distance to one another should be maximized according to the Expectation Maximization algorithm.

Resources