Perl Multidimensional array column compare and display whole content with satisfied condition - arrays

I am facing little issue in taking array index for comparison and displaying the result. I have a tab delimited file with 9 columns and more than 100 rows. I want to compare the 8th column element of ith row with the 7th column element of i+1th row. If it is smaller than the 7th column element then print entire row else if it is greater than the 7th column element the compare the 6th element of both row and only print if the row if the 6th element is bigger.
Sample File
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 2.50E-30 104.7 57 167 Receptor
Furin-like PF00757.18 149 sp|P00533|EGFR_HUMAN 4.10E-29 101.3 185 338 Furin-like
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 3.60E-28 97.8 361 480 Receptor
GF_recep_IV PF14843.4 132 sp|P00533|EGFR_HUMAN 1.60E-46 157.2 505 636 Growth
Pkinase PF00069.23 264 sp|P00533|EGFR_HUMAN 2.70E-39 135 712 964 Protein
Pkinase_Tyr PF07714.15 260 sp|P00533|EGFR_HUMAN 8.40E-88 293.9 714 965 Protein
For example if we compare the last two row then 8th column element is bigger than the next row's 7th column element, then in this case it should compare the two 6th column element and print the only row which is bigger. So from this two row it should print only last row. For me the below code is only printing the values if it is smaller, but I want to ask how can I compare 6th element and print results if 8th column is bigger?
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open(IN,"<samplecode.txt");
my #Alifrom;
my #Alito;
my #data; ## multidimensional array
while(<IN>){
chomp $_;
#next if $_=undef;
my #line = split("\t", $_);
##my($a, $b, $c, $d, $e, $f, $g, $h, $i) = split(/\t/,$_); // catch data and storing into multiple scalar variable
push #data, [#line];
}
for (my $i = 0; $i < #data ; $i++){
if ($data[$i][7] gt $data[$i][6]){
for (my $j = 0; $j < #{$data[$i]}; $j++){
##Alifrom = map $data[$i][$j+6], #data;
print "$data[$i][$j]\t";
}
}
#else
print "\n";
}

The description in your question is not entirely clear, but I'm taking an educated guess.
First, you should not read the whole file into an array. If your file really only has 100 rows, it's not a problem, but if there are more rows this will consume a lot of memory.
You say you want to compare values in every row i to values in row i+1, so essentially in every row you want to look at values in the next row. That means you need to keep a maximum of two rows in memory at one time. Since that's linear, you can just read the first row, then read the second row, compare, and when you're done make the second row the new first row.
In your loop, you always read the second row, and keep around the first row from when you read it as the second row in the iteration before.
For that, it makes sense to turn the reading and splitting into a function. You can pass it a file handle. In my example above, I've used DATA with the __DATA__ section, but you can just open my $fh, '<', 'samplecode.txt' and pass $fh around.
Because you want to print the whole row in some cases, you should not just chomp and split it in a destructive manner, but rather keep around the actual full row including the line break. We therefore make the function to read and split return two values: the full row as a scalar string, and an array reference of the cleaned up columns.
If there are no more lines to read, we return an implicit undef, which will make the while loop stop. Therefore you can never process the last row of the file.
When comparing, note that list indexes in Perl always start on zero, so column 7 is index [6].
Here's an example implementation.
use strict;
use warnings;
# this function reads a line from the filehandle that's passed in and returns
# the row as a string and an array ref of all columns, or undef if there are
# no more lines to read
sub read_and_split {
my $fh = shift;
# read one line and return undef if there is no more data
my $row = <$fh>;
return unless defined $row;
# split into columns
my #cols = split /\s+/, $row; # Stack Overflow does not like tabs, use \t
# only chomp after splitting so we retain the original line for printing
chomp $cols[-1];
# return both things
return $row, \#cols;
}
# read the first line
my ( $row_i, $cols_i ) = read_and_split( \*DATA );
# read subsequent lines
while ( my ( $row_i_plus_one, $cols_i_plus_one ) = read_and_split( \*DATA ) ) {
# 7th col of i is smaller than 6th col of i+1
if ( $cols_i->[7] < $cols_i_plus_one->[6] ) {
print $row_i;
}
else {
# compare the 6th element of both row and only print
# if the row if the 6th element is bigger
if ( $cols_i->[5] > $cols_i_plus_one->[5] ) {
print $row_i;
}
}
# turn the current i+1 into i for the next iteration
$row_i = $row_i_plus_one;
$cols_i = $cols_i_plus_one;
}
__DATA__
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 2.50E-30 104.7 57 167 Receptor
Furin-like PF00757.18 149 sp|P00533|EGFR_HUMAN 4.10E-29 101.3 185 338 Furin-like
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 3.60E-28 97.8 361 480 Receptor
GF_recep_IV PF14843.4 132 sp|P00533|EGFR_HUMAN 1.60E-46 157.2 505 636 Growth
Pkinase PF00069.23 264 sp|P00533|EGFR_HUMAN 2.70E-39 135 712 964 Protein
Pkinase_Tyr PF07714.15 260 sp|P00533|EGFR_HUMAN 8.40E-88 293.9 714 965 Protein
It outputs these lines:
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 2.50E-30 104.7 57 167 Receptor
Furin-like PF00757.18 149 sp|P00533|EGFR_HUMAN 4.10E-29 101.3 185 338 Furin-like
Recep_L_domain PF01030.22 112 sp|P00533|EGFR_HUMAN 3.60E-28 97.8 361 480 Receptor
GF_recep_IV PF14843.4 132 sp|P00533|EGFR_HUMAN 1.60E-46 157.2 505 636 Growth
Note that the part about comparing columns six was not very clear in your question. I assumed we compare both columns six and print the one for row i if it's a match. If we were to print row i+1 we might end up printing that line twice.

Related

How is an array sliced?

I have some sample code where the array is sliced as follows:
A = X(:,2:300)
What does this mean about the slice of the array?
: stands for 'all' if used by itself and 2:300 gives an array of integers from 2 to 300 with a spacing of 1 (1 is implicit) in MATLAB. 2:300 is the same as 2:1:300 and you can even use any spacing you wish, for example 2:37:300 (result: [2 39 76 113 150 187 224 261 298]) to generate equally spaced numbers.
Your statement says - select every row of the matrix A and columns 2 to 300. Suggested reading

Issues with manipulating matrix transposed using bash script

I wrote a bash script that is supposed to calculate the statistical average and median of each columns of an input file. The input file format is shown below. Each number is separated by a tab.
1 2 3
3 2 8
3 4 2
My approach is to first transpose the matrix, so that rows become columns and vice versa. The transposed matrix is stored in a temporary text file. Then, I calculated the average and median for rows. However, the script gives me the wrong output. First of all, the array that holds the average and median for each column only produces one output. Secondly, the median value calculated is incorrect.
After a bit of code inspection and testing, I discovered that while the transposed matrix did get written to the text file, it is not read correctly by the script. Specifically, each line read only gives one number. Below is my script.
#if column is chosen instead
89 if [[ $initial == "-c" ]]
90 then
91 echo "Calculating column stats"
92
93 #transpose columns to row to make life easier
94 WORD=$(head -n 1 $filename | wc -w); #counts the number of columns
95 for((index=1; index<=$WORD; index++)) #loop it over the number of columns
96 do
97 awk '{print $'$index'}' $filename | tr '\n' ' ';echo; #compact way of performing a row-col transposition
98 #prints the column as determined by $index, and then translates new-line with a tab
99 done > tmp.txt
100
101 array=()
102 averageArray=()
103 medianArray=()
104 sortedArray=()
105
106 #calculate average and median, just like the one used for rows
107 while read -a cols
108 do
109 total=0
110 sum=0
111
112 for number in "${cols[#]}" #for every item in the transposed column
113 do
114 (( sum += $number )) #the total sum of the numbers in the column
115 (( total++ )) #the number of items in the column
116 array+=( $number )
117 done
118
119 sortedArray=( $( printf "%s\n" "${array[#]}" | sort -n) )
120 arrayLength=${#sortedArray[#]}
121 #echo sorted array is $sortedArray
122 #based on array length, construct the median array
123 if [[ $(( arrayLength % 2 )) -eq 0 ]]
124 then #even
125 upper=$(( arrayLength / 2 ))
126 lower=$(( (arrayLength/2) - 1 ))
127 median=$(( (${sortedArray[lower]} + ${sortedArray[upper]}) / 2 ))
128 #echo median is $median
129 medianArray+=$index
130 else #odd
131 middle=$(( (arrayLength) / 2 ))
132 median=${sortedArray[middle]}
133 #echo median is $median
134 medianArray+=$index
135 fi
136 averageArray+=( $((sum/total)) ) #the final row array of averages that is displayed
137
138 done < tmp.txt
139 fi
Thanks for the help.

creating an array of arrays in perl and deleting from the array

I'm writing this to avoid a O(n!) time complexity but I only have pseudocode right now because there are some things I'm unsure about implementing.
This is the format of the file that I want to pass into this script. The data is sorted by the third column -- the start position.
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
Explanation of the code:
I want to create an array of arrays to find when two pieces of information have overlapping lengths.
Columns 3 and 4 of the input file are start and stop positions on a single track line. If any row(x) has a position in column 3 that is shorter than the position in column 4 in any row(y) then this means that x starts before y ends and there is some overlap.
I want to find every row that overlaps with asnyrow without having to compare every row to every row. Because they are sorted I simply add a string to an inner array of the array which represents one row.
If the new row being looked at does not overlap with one of the rows already in the array then (because the array is sorted by the third column) no further row will be able to overlap with the row in the array and it can be removed.
This is what I have an idea of
#!/usr/bin/perl -w
use strict;
my #array
while (<>) {
my thisLoop = ($id, $name, $begin, $end) = split;
my #innerArray = split; # make an inner array with the current line, to
# have strings that will be printed after it
push #array(#innerArray)
for ( #array ) { # loop through the outer array being made to see if there
# are overlaps with the current item
if ( $begin > $innerArray[3]) # if there are no overlaps then print
# this inner array and remove it
# (because it is sorted and everything
# else cannot overlap because it is
# larger)
# print #array[4-]
# remove this item from the array
else
# add to array this string
"$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id : $begin, $end
# otherwise because there is overlap add a statement to the inner
# array explaining the overlap
The code should produce something like
87 overlap with 93 93: 1 82 87: 1 7982
76 overlap with 93 93: 1 82 76: 1 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530
This was tricky to explain so ask me if you have any questions
I am using the posted input and output files as a guide on what is required.
A note on complexity. In principle, each line has to be compared to all following lines. The number of operations actually carried out depends on the data. Since it is stated that the data is sorted on the field to be compared the inner loop iterations can be cut as soon as overlapping stops. A comment on complexity estimate is at the end.
This compares each line to the ones following it. For that all lines are first read into an array. If the data set is very large this should be changed to read line by line and then the procedure turned around, to compare the currently read line to all previous. This is a very basic approach. It may well be better to build auxiliary data structures first, possibly making use of suitable libraries.
use warnings;
use strict;
my $file = 'data_overlap.txt';
my #lines = do {
open my $fh, '<', $file or die "Can't open $file -- $!";
<$fh>;
};
# For each element compare all following ones, but cut out
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines)
{
my #ref_fields = split '\s+', $lines[$i];
for my $j ($i+1..$#lines)
{
my #curr_fields = split '\s+', $lines[$j];
if ( $ref_fields[-1] > $curr_fields[-2] ) {
print "$curr_fields[0] overlap with $ref_fields[0]\t" .
"$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
"$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
}
else { print "\tNo overlap, move on.\n"; last }
}
}
With the input in file 'data_overlap.txt' this prints
87 overlap with 93 93: 1 82 87: 1 7912
76 overlap with 93 93: 1 82 76: 2 20690
65 overlap with 93 93: 1 82 65: 2 170
No overlap, move on.
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
No overlap, move on.
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
No overlap, move on.
No overlap, move on.
No overlap, move on.
228 overlap with 166 166: 72503 123150 228: 72510 114530
A comment on complexity
Worst case Each element has to be compared to every other (they all overlap). This means that for each element we need N-1 comparisons, and we have N elements. This is O(N^2) complexity. This complexity is not good for operations that are used often and on potentially large data sets, like what libraries do. But it is not necessarily bad for a particular problem -- the data set still needs to be quite large for that to result in prohibitively long runtimes.
Best case Each element is compared only once (no overlap at all). This implies N comparisons, thus O(N) complexity.
Average Let us assume that each element overlaps with a "few" next ones, let us say 3 (three). This means that there would be 3N comparisons. This is still O(N) complexity. This holds as long as the number of comparisons does not depend on the length of the list (but is constant), which is a very reasonable typical scenario here. This is good.
Thanks to ikegami for bringing this up in the comment, along with the estimate.
Remember that the importance of the computational complexity of a technique depends on its use.
This produces exactly the output that you asked for given your sample data as input. It runs in well under one millisecond
Do you have other constraints that you haven't explained? Making your code run faster should never be an end in itself. There is nothing inherently wrong with an O(n!) time complexity: it is the execution time that you must consider, and if your code is fast enough then your job is done
use strict;
use warnings 'all';
my #data = map [ split ], grep /\S/, <DATA>;
for my $i1 ( 0 .. $#data ) {
my $v1 = $data[$i1];
for my $i2 ( $i1 .. $#data ) {
my $v2 = $data[$i2];
next if $v1 == $v2;
unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, #{$v1}[0, 2, 3], #{$v2}[0, 2, 3];
}
}
}
__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
output
87 overlap with 93 93: 1 82 87: 1 7912
76 overlap with 93 93: 1 82 76: 2 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530

MATLAB: Zeros appearing when storing values from different arrays to another array

I have two double arrays like the ones below:
K>> var_conx
var_conx =
1
3
127
129
216
217
252
253
302
303
342
343
and
K>> var_cony
var_cony =
2
126
128
216
217
252
253
302
303
342
343
My task is pretty simple, I only need to store in an another double array all the common values of the two arrays, let's call the other array "common_convar".
To be specific, for the two arrays above, i only want to store the values 216,217,252,253,302,303,342,343. For the other values, i do not care and i do not want them stored or whatever.
I have written the following code:
for i=1:length(var_conx)
for j=1:length(var_cony)
if var_conx(i)==var_cony(j)
common_convar(i,:)=[var_conx(i)];
end
end
end
The problem i encountered here is that the array common_convar also stores some zeros in the beginning:
K>> common_convar
common_convar =
0
0
0
0
216
217
252
253
302
303
342
343
How is it possible to get rid of zeros and only store the desired common values of the two arrays var_conx and var_cony?
Thanks in advance for your time.
Firstly you could find the elements common to both arrays, without having to do nested loops, using the Matlab set intersect function:
common_values= intersect(var_conx,var_cony);
Then you could find the nonzero elements of the common array via logical indexing:
common_values = common_values(common_values > 0);

Perl: Sort part of array

I have an array with many fields in each line spaced by different spacing like:
INDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
I'm trying to make sorting for this array by number in 3rd field so the desired output should be:
PIMP drawing 7 0 1444 718 437 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
INDDUMMY drawing2 139 30 1 0 0 0 0 0
I tried to make a split using regular expression within the sorting function like:
#sortedListOfLayers = sort {
split(m/\w+\s+(\d+)\s/gm,$a)
cmp
split(m/\w+\s+(\d+)\s/gm,$b)
}#listOfLayers;
but it doesn't work correctly. How I could make that type of sorting?
You need to expand out your sort function a little further. I'm also not sure that split is working the way you think it is. Split turns text into an array based on a delimiter.
I think your problem is that your regular expression - thanks to the gm flags - isn't matching what you think it's matching. I'd perhaps approach it slightly differently though:
#!/usr/bin/perl
use strict;
use warnings;
my #array = <DATA>;
sub sort_third_num {
my $a1 = (split ( ' ', $a ) )[2];
my $b1 = (split ( ' ', $b )) [2];
return $a1 <=> $b1;
}
print sort sort_third_num #array;
__DATA__
NDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
This does the trick, for example.
If you're set on doing a regex approach:
sub sort_third_num {
my ($a1) = $a =~ m/\s(\d+)/;
my ($b1) = $b =~ m/\s(\d+)/;
return $a1 <=> $b1;
}
not globally matching means only the first element is returned. And only the first match of 'whitespace-digits' is returned. We also compare numerically, rather than stringwise.
If you want to sort a list and the operation used in the sort block is expensive, an often used Perl idiom is the Schwartzian Transform: you apply the operation once to each list element and store the result alongside the original element, sort, then map back to your original format.
The classic textbook example is sorting files in a directory by size using the expensive -s file test. A naïve approach would be
my #sorted = sort { -s $a <=> -s $b } #unsorted;
which has to perform -s twice for each comparison operation.
Using the Schwartzian Transform, we map the file names into a list of array references, each referencing an array containing the list element and its size (which has to be determined only once per file), then sort by file size, and finally map the array references back to just the file names. This is all done in a single step:
my #sorted =
map $_->[0], # 3. map to file name
sort { a$->[1] <=> b$->[1] } # 2. sort by size
map [ $_, -s $_ ], # 1. evaluate size once for each file
#unsorted;
In your case, the question is how expensive it is to extract the third field of each array element. When in doubt, measure to compare different methods. The speedup in the file size example is dramatic at about a factor 10 for a few dozen files!
The Schwartzian Transform applied to your problem would look something like this:
my #sorted =
map $_->[0], # 3. Map to original array
sort { $a->[1] <=> $b->[1] } # 2. Sort by third column
map [ $_, ( split( ' ', $_ ) )[2] ], # 1. Use Sobrique's idea
#array;
If the operation used is so expensive that you want to avoid performing it more than once per value in case you have identical array elements, you can cache the results as outlined in this question; this is known as the Orcish Maneuver.

Resources