creating an array of arrays in perl and deleting from the array - arrays
I'm writing this to avoid a O(n!) time complexity but I only have pseudocode right now because there are some things I'm unsure about implementing.
This is the format of the file that I want to pass into this script. The data is sorted by the third column -- the start position.
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
Explanation of the code:
I want to create an array of arrays to find when two pieces of information have overlapping lengths.
Columns 3 and 4 of the input file are start and stop positions on a single track line. If any row(x) has a position in column 3 that is shorter than the position in column 4 in any row(y) then this means that x starts before y ends and there is some overlap.
I want to find every row that overlaps with asnyrow without having to compare every row to every row. Because they are sorted I simply add a string to an inner array of the array which represents one row.
If the new row being looked at does not overlap with one of the rows already in the array then (because the array is sorted by the third column) no further row will be able to overlap with the row in the array and it can be removed.
This is what I have an idea of
#!/usr/bin/perl -w
use strict;
my #array
while (<>) {
my thisLoop = ($id, $name, $begin, $end) = split;
my #innerArray = split; # make an inner array with the current line, to
# have strings that will be printed after it
push #array(#innerArray)
for ( #array ) { # loop through the outer array being made to see if there
# are overlaps with the current item
if ( $begin > $innerArray[3]) # if there are no overlaps then print
# this inner array and remove it
# (because it is sorted and everything
# else cannot overlap because it is
# larger)
# print #array[4-]
# remove this item from the array
else
# add to array this string
"$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id : $begin, $end
# otherwise because there is overlap add a statement to the inner
# array explaining the overlap
The code should produce something like
87 overlap with 93 93: 1 82 87: 1 7982
76 overlap with 93 93: 1 82 76: 1 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530
This was tricky to explain so ask me if you have any questions
I am using the posted input and output files as a guide on what is required.
A note on complexity. In principle, each line has to be compared to all following lines. The number of operations actually carried out depends on the data. Since it is stated that the data is sorted on the field to be compared the inner loop iterations can be cut as soon as overlapping stops. A comment on complexity estimate is at the end.
This compares each line to the ones following it. For that all lines are first read into an array. If the data set is very large this should be changed to read line by line and then the procedure turned around, to compare the currently read line to all previous. This is a very basic approach. It may well be better to build auxiliary data structures first, possibly making use of suitable libraries.
use warnings;
use strict;
my $file = 'data_overlap.txt';
my #lines = do {
open my $fh, '<', $file or die "Can't open $file -- $!";
<$fh>;
};
# For each element compare all following ones, but cut out
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines)
{
my #ref_fields = split '\s+', $lines[$i];
for my $j ($i+1..$#lines)
{
my #curr_fields = split '\s+', $lines[$j];
if ( $ref_fields[-1] > $curr_fields[-2] ) {
print "$curr_fields[0] overlap with $ref_fields[0]\t" .
"$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
"$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
}
else { print "\tNo overlap, move on.\n"; last }
}
}
With the input in file 'data_overlap.txt' this prints
87 overlap with 93 93: 1 82 87: 1 7912
76 overlap with 93 93: 1 82 76: 2 20690
65 overlap with 93 93: 1 82 65: 2 170
No overlap, move on.
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
No overlap, move on.
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
No overlap, move on.
No overlap, move on.
No overlap, move on.
228 overlap with 166 166: 72503 123150 228: 72510 114530
A comment on complexity
Worst case Each element has to be compared to every other (they all overlap). This means that for each element we need N-1 comparisons, and we have N elements. This is O(N^2) complexity. This complexity is not good for operations that are used often and on potentially large data sets, like what libraries do. But it is not necessarily bad for a particular problem -- the data set still needs to be quite large for that to result in prohibitively long runtimes.
Best case Each element is compared only once (no overlap at all). This implies N comparisons, thus O(N) complexity.
Average Let us assume that each element overlaps with a "few" next ones, let us say 3 (three). This means that there would be 3N comparisons. This is still O(N) complexity. This holds as long as the number of comparisons does not depend on the length of the list (but is constant), which is a very reasonable typical scenario here. This is good.
Thanks to ikegami for bringing this up in the comment, along with the estimate.
Remember that the importance of the computational complexity of a technique depends on its use.
This produces exactly the output that you asked for given your sample data as input. It runs in well under one millisecond
Do you have other constraints that you haven't explained? Making your code run faster should never be an end in itself. There is nothing inherently wrong with an O(n!) time complexity: it is the execution time that you must consider, and if your code is fast enough then your job is done
use strict;
use warnings 'all';
my #data = map [ split ], grep /\S/, <DATA>;
for my $i1 ( 0 .. $#data ) {
my $v1 = $data[$i1];
for my $i2 ( $i1 .. $#data ) {
my $v2 = $data[$i2];
next if $v1 == $v2;
unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, #{$v1}[0, 2, 3], #{$v2}[0, 2, 3];
}
}
}
__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
output
87 overlap with 93 93: 1 82 87: 1 7912
76 overlap with 93 93: 1 82 76: 2 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530
Related
SockMerchant Challenge Ruby Array#count not counting?
So, i'm doing a beginners challenge on HackerHank and, a strange behavior of ruby is boggling my mind. The challenge is: find and count how many pairs there are in the array. (sock pairs) Here's my code. n = 100 ar = %w(50 49 38 49 78 36 25 96 10 67 78 58 98 8 53 1 4 7 29 6 59 93 74 3 67 47 12 85 84 40 81 85 89 70 33 66 6 9 13 67 75 42 24 73 49 28 25 5 86 53 10 44 45 35 47 11 81 10 47 16 49 79 52 89 100 36 6 57 96 18 23 71 11 99 95 12 78 19 16 64 23 77 7 19 11 5 81 43 14 27 11 63 57 62 3 56 50 9 13 45) def sockMerchant(n, ar) counter = 0 ar.each do |item| if ar.count(item) >= 2 counter += ar.count(item)/2 ar.delete(item) end end counter end print sockMerchant(n, ar) The problem is, it doesn't count well. after running the function, in it's internal array ar still have countable pairs, and i prove it by running it again. There's more. If you sort the array, it behaves differently. it doesnt make sense to me. you can check the behavior on this link https://repl.it/repls/HuskyFrighteningNaturallanguage
You're deleting items from a collection while iterating over it - expect bad stuff to happen. In short, don't do that if you don't want to have such problems, see: > arr = [1,2,1] # => [1, 2, 1] > arr.each {|x| puts x; arr.delete(x) } # 1 # => [2] We never get the 2 in our iteration. A simple solution, that is a small variation of your code, could look as follows: def sock_merchant(ar) ar.uniq.sum do |item| ar.count(item) / 2 end end Which is basically finding all unique socks, and then counting pairs for each of them. Note that its complexity is n^2 as for each unique element n of the array, you have to go through the whole array in order to find all elements that are equal to n. An alternative, first group all socks, then check how many pairs of each type we have: ar.group_by(&:itself).sum { |k,v| v.size / 2 } As ar.group_by(&:itself), short for ar.group_by { |x| x.itself } will loop through the array and create a hash looking like this: {"50"=>["50", "50"], "49"=>["49", "49", "49", "49"], "38"=>["38"], ...} And by calling sum on it, we'll iterate over it, summing the number of found elements (/2).
how can we find the complexity of merge sort with an array of size 16
I have an array of size 16 and have to find its theta and big Oh general case is nlogn but what will it be for specific case. 73 3 69 88 36 56 44 63 14 60 80 84 6 80 55 62
Array size or its composition/pattern don't affect the merge sort technique. So it is going to be same for 16-elements array as well. Mergesort will anyway at first divide the array, then compare and merge.
Array manipulation in Perl
The Scenario is as follows: I have a dynamically changing text file which I'm passing to a variable to capture a pattern that occurs throughout the file. It looks something like this: my #array1; my $file = `cat <file_name>.txt`; if (#array1 = ( $file =~ m/<pattern_match>/g) ) { print "#array1\n"; } The array looks something like this: 10:38:49 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 10:38:51 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 From the above array1 output, the pattern of the array is something like this: T1 P1 t1(1) t1(2)...t1(25) T2 P2 t2(1) t2(2)...t2(25) so on and so forth Currently, /g in the regex returns a set of values that occur only twice (only because the txt file contains this pattern that number of times). This particular pattern occurrence will change depending on the file name that I plan to pass dynamically. What I intend to acheive: The final result should be a csv file that contains these values in the following format: T1,P1,t1(1),t1(2),...,t1(25) T2,P2,t2(1),t2(2),...,t2(25) so on and so forth For instance: My final CSV file should look like this: 10:38:49,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54 10:38:51,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54 The delimiter for this pattern is T1 which is time in the format \d\d:\d\d:\d\d Example: 10:38:49, 10:38:51 etc What I have tried so far: use Data::Dumper; use List::MoreUtils qw(part); my $partitions = 2; my $i = 0; print Dumper part {$partitions * $i++ / #array1} #array1; In this particular case, my $partitions = 2; holds good since the pattern occurrence in the txt file is only twice, and hence, I'm splitting the array into two. However, as mentioned earlier, the pattern occurrence number keeps changing according to the txt file I use. The Question: How can I make this code more generic to achieve my final goal of splitting the array into multiple equal sized arrays without losing the contents of the original array, and then converting these mini-arrays into one single CSV file? If there is any other workaround for this other than array manipulation, please do let me know. Thanks in advance. PS: I considered Hash of Hashes and Array of Hashes, but that kind of a data structure did not seem to be healthy solution for the problem I'm facing right now.
As far as I can tell, all you need is splice, which will work fine as long as you know the record size and it's constant The data you showed has 52 fields, but the description of it requires 27 fields per record. It looks like each line has T, P, and t1 .. t24, rather than ending at t25 Here's how it looks if I split the data into 26-element chunks use strict; use warnings 'all'; my #data = qw/ 10:38:49 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 10:38:51 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 /; while ( #data ) { my #set = splice #data, 0, 26; print join(',', #set), "\n"; } output 10:38:49,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54 10:38:51,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54 If you wanted to use List::MoreUtils instead of splice, the the natatime function returns an iterator that will do the same thing as the splice above Like this use List::MoreUtils qw/ natatime /; my $iter = natatime 26, #data; while ( my #set = $iter->() ) { print join(',', #set), "\n"; } The output is identical to that of the program above Note It is very wrong to start a new shell process just to use cat to read a file. The standard method is to undefine the input record separator $/ like this my $file = do { open my $fh, '<', '<file_name>.txt' or die "Unable to open file for input: $!"; local $/; <$fh>; }; Or if you prefer you could use File::Slurper like this use File::Slurper qw/ read_binary /; my $file = read_binary '<file_name>.txt'; although you will probably have to install it as it is not a core module
comparing multiple column files using python3
input_file1: a 1 33 a 34 67 a 68 78 b 1 99 b 100 140 c 1 70 c 71 100 c 101 190 input file2: a 5 23 a 30 72 a 76 78 b 5 30 c 23 88 c 92 98 I want to compare these two files such that for every value of 'a' in file2 the two integers (boundary) fall in the range (boundaries) of 'a' in file1 or between two ranges.
Instead of storing values like this 'a 1 33', you can make one structure (like 'a:1:33') for your data while writing into file. So that it will become easy to read data also. Then, you can read each line and can split it based on ':' separator and you can compare with another file easily.
How can I create an array of ratios inside a for loop in MATLAB?
I would like to create an array or vector of musical notes using a for loop. Every musical note, A, A#, B, C...etc is a 2^(1/12) ratio of the previous/next. E.G the note A is 440Hz, and A# is 440 * 2^(1/12) Hz = 446.16Hz. Starting from 27.5Hz (A0), I want a loop that iterates 88 times to create an array of each notes frequency up to 4186Hz, so that will look like f= [27.5 29.14 30.87 ... 4186.01] So far, I've understood this much: f = []; for i=1:87, %what goes here % f = [27.5 * 2^(i/12)]; ? end return;
There is no need to do a loop for this in matlab, you can simply do: f = 27.5 * 2.^((0:87)/12) The answer: f = Columns 1 through 13 27.5 29.135 30.868 32.703 34.648 36.708 38.891 41.203 43.654 46.249 48.999 51.913 55 Columns 14 through 26 58.27 61.735 65.406 69.296 73.416 77.782 82.407 87.307 92.499 97.999 103.83 110 116.54 Columns 27 through 39 123.47 130.81 138.59 146.83 155.56 164.81 174.61 185 196 207.65 220 233.08 246.94 Columns 40 through 52 261.63 277.18 293.66 311.13 329.63 349.23 369.99 392 415.3 440 466.16 493.88 523.25 Columns 53 through 65 554.37 587.33 622.25 659.26 698.46 739.99 783.99 830.61 880 932.33 987.77 1046.5 1108.7 Columns 66 through 78 1174.7 1244.5 1318.5 1396.9 1480 1568 1661.2 1760 1864.7 1975.5 2093 2217.5 2349.3 Columns 79 through 88 2489 2637 2793.8 2960 3136 3322.4 3520 3729.3 3951.1 4186
maxind = 87; f = zeros(1, maxind); % preallocate, better performance and avoids mlint warnings for ii=1:maxind f(ii) = 27.5 * 2^(ii/12); end The reason I named the loop variable ii is because i is the name of a builtin function. So it's considered bad practice to use that as a variable name. Also, in your description you said you want to iterate 88 times, but the above loop only iterates 1 through 87 (both inclusive). If you want to iterate 88 times change maxind to 88.