perl compare elements of arrays and grouping - arrays

I am back with another question. I have a list of data:
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
And I would like to compare the 3rd elements && 5th elements of each row, then group them if they have the same 3rd && 5th elements.
For example, with the data above, the results will be :
3: 3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
9: 9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
Fyi, in the actual data, the 3rd, 5th, 7th elements are very long. I have made them cut to see the whole.
This is what I have done, I know it is very clumsy, but as a beginner, I am doing my best.
And the problem is that it shows only the first set of 'same' group.
Could you show me where it went wrong and/or other pretty methods to solve this, please?
my $file = <>;
open(IN, $file)|| die "no $file: $!\n";
my #arr;
while (my $line=<IN>){
push #arr, [split (/\s+/, $line)] ;
}
close IN;
my (#temp1, #temp2,%hash1);
for (my $i=0;$i<=$#arr ;$i++) {
push #temp1, [$arr[$i][2], $arr[$i][4]];
for (my $j=$i+1;$j<=$#arr ;$j++) {
push #temp2, [$arr[$j][2], $arr[$j][4]];
if (($temp1[$i][0] eq $temp2[$j][0])&& ($temp1[$i][1] eq $temp2[$j][1])) {
push #{$hash1{$arr[$i][0]}}, $arr[$i], $arr[$j];
}
}
}
print Dumper \%hash1;

You appear to have overcomplicated this a bit more than it needs to be, but that's common for beginners. Think more about how you would do this manually:
Look at each line.
See whether the third and fifth fields are the same as the previous line.
If so, print them.
The looping and all that is completely unnecessary:
#!/usr/bin/env perl
use strict;
use warnings;
my ($previous_row, $third, $fifth) = ('') x 3;
while (<DATA>) {
my #fields = split;
if ($fields[2] eq $third && $fields[4] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_";
$previous_row = '';
} else {
$previous_row = $fields[0] . "\t" . $_;
$third = $fields[2];
$fifth = $fields[4];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
(Note that I changed line 10 slightly so that its third field will match line 9 in order to get the same groups in the output as specified.)
Edit: One line of code was duplicated by a copy/paste error.
Edit 2: In response to comments, here's a second version which doesn't assume that the lines which should be grouped are contiguous:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
push #lines, [ $_, split ];
}
# Sort #lines based on third and fifth fields (alphabetically), then on
# first field/line number (numerically) when third and fifth fields match
#lines = sort {
$a->[3] cmp $b->[3] || $a->[5] cmp $b->[5] || $a->[1] <=> $b->[1]
} #lines;
my ($previous_row, $third, $fifth) = ('') x 3;
for (#lines) {
if ($_->[3] eq $third && $_->[5] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_->[0]";
$previous_row = '';
} else {
$previous_row = $_->[1] . "\t" . $_->[0];
$third = $_->[3];
$fifth = $_->[5];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG

Slightly different approach:
#!/usr/bin/perl
use strict;
use warnings;
my %lines; # hash with 3rd and 5th elements as key
my %first_line_per_group; # stores in which line a group appeared first
while(my $line = <>) {
# remove line break
chomp $line;
# retrieve elements form line
my #elements = split /\s+/, $line;
# ignore invalid lines
next if #elements < 5;
# build key from elements 3 and 5 (array 0-based!)
my $key = $elements[2] . " " . $elements[4];
if(! $lines{key}) {
$first_line_per_group{$key} = $elements[0];
}
push #{ $lines{$key} }, $line;
}
# output
for my $key (keys %lines) {
print $first_line_per_group{$key} . ":\n";
print " $_\n" for #{ $lines{$key} };
}

Example:
use strict;
use warnings;
{ ... }
open my $fh, '<', $file or die "can't open $file: $!";
my %hash;
# read and save it
while(my $line = <$fh>){
my #line = split /\s+/, $line;
my $key = $line[2] . ' ' . $line[4];
$hash{$key} ||= [];
push #{$hash{$key}}, $line;
}
# remove single elements
for my $key (keys %hash){
delete $hash{$key} if #{$hash{$key}} < 2;
}
print Dumper \%hash;

Your approach shows a pretty solid grasp of Perl idiom and has merit, but still is not how I would do it.
I think that you will have an easier time with this if you structure your data slightly differently: Let %hash1 be something like
(
'ALQLTQSPSSLSAS' => {
'RITLKESGPPLVKPTCS' => [3, 4, 5],
'ABCXYZ' => [93, 95, 96],
},
'MPIMGSSVAVLAIL' => {
'DIVMTQSPTVTI' => [9, 10],
},
)
where I have added a datum ABCXYZ which is not in your example to show the data structure in its fullness.

You should be using the 3-argument form of open() and you can simplify reading in the data:
open my $fh, '<', $file
or die "Cannot open '$file': $!\n";
chomp(my #rows = <$fh>);
#rows = map {[split]} #rows;
close $fh;
To group the rows, you can use a hash with the 3rd and 5th fields concatenated as the keys. Edit: You have to add a separation character to eliminate invalid results "if different lines produce the same concatenation" (Qtax). Additional data, for example, the number of the individual data rows, can be stored as the hash value. Here, the row's fields are stored:
my %groups;
for (#rows) {
push #{ $groups{$_->[2] . ' ' . $_->[4]} }, $_
if #$_ >= 4;
}
Sort out single elements:
#{ $groups{$_} } < 2 && delete $groups{$_}
for keys %groups;
greets,
Matthias

Related

Perl remove same value back to back with splice

I am trying to remove, the same values twice in an array, it is located back to back, this is my code
#{$tmp_h->{'a'}} = qw/A B B C/;
print Dumper ($tmp_h);
my $j = 0;
foreach my $cur (#{$tmp_h->{'a'}}) {
if ($cur eq 'B') {
splice(#{$tmp_h->{'a'}}, $j, 1);
}
$j++;
}
print Dumper $tmp_h;
However what got is,
$VAR1 = {
'a' => [
'A',
'B',
'B',
'C'
]
};
$VAR1 = {
'a' => [
'A',
'B',
'C'
]
};
I am expecting both 'B' to be removed in this case, what could possibly went wrong?
That code is removing from an array while iterating over it, pulling the carpet from underneath itself; is that necessary?
Instead, iterate and put elements on another array if the adjacent ones aren't equal. So iterate over the index, looking up an element and the next (or previous) one.†
I presume that B is just an example while in fact it can be any value, equal to its adjacent one.
It's interesting that regex can help too, with its simple way to find repeated patterns using backreferences
my #ary = qw(a b b c d d e f f f g);
my $str_ary = join '', #ary;
$str_ary =~ s/(.)\g{-1}//g;
my #new_ary = split //, $str_ary;
say "#new_ary"; #--> a c e f g
This removes pairs of adjacent values, so if there is an odd number of equal adjacent values it leaves the odd one (f above). As a curiosity note that it can be written in one statement
my #new_ary = split //, join('', #ary) =~ s/(.)\g{-1}//gr;
The join-ed array, forming a string, is bound to the substitution operator where /r modifier is crucial, for allowing this and returning the changed string which is then split back into a list.
To change an array in place have it assign to itself.‡
But single-letter elements are only an example, likely. With multiple characters in elements we can't join them by empty string because we wouldn't know how to split that back into an array; we have to join by something that can't be in any one element, clearly a tricky proposition. A reasonable take is a line-feed, as one can expect to know whether elements are/not multiline strings
my #ary = qw(aa no no way bah bah bah go);
my $str_ary = join "\n", #ary ;
$str_ary =~ s/([^\n]+)\n\g{-1}//g;
my #new = grep { $_ } split /\n/, $str_ary;
say "#new"; #--> aa way bah go
This would still have edge cases with interesting elements, like spaces and empty strings (but then any approach would).
† For example
use warnings;
use strict;
use feature 'say';
my #ary = qw(a b b c d d e f f f g);
my #new_ary;
my $i = 0;
while (++$i <= $#ary) {
if ($ary[$i] ne $ary[$i-1]) {
push #new_ary, $ary[$i-1]
}
else { ++$i }
}
push #new_ary, $ary[-1] if $ary[-1] ne $ary[-2];
say "#new_ary"; #--> a c e f g
‡ Done for the arrayref in the question
#{ $hr->{a} } = qw/A B B C/;
#{$hr->{a}} = split //, join('', #{$hr->{a}}) =~ s/(.)\g{-1}//gr;
say "#{$hr->{a}}"; #--> A C
The Perl documentation tells you in perlsyn under Foreach Loops:
If any part of LIST is an array, foreach will get very confused if you
add or remove elements within the loop body, for example with splice. So
don't do that.
You can iterate over the indices instead, but don't forget to not increment the index when removing a value:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $tmp_h = {a => [qw[ A B B C ]]};
print Dumper($tmp_h);
my $j = 0;
while ($j <= $#{ $tmp_h->{a} }) {
my $cur = $tmp_h->{a}[$j];
if ($cur eq 'B') {
splice #{ $tmp_h->{a} }, $j, 1;
} else {
++$j;
}
}
print Dumper($tmp_h);
Or start from the right so you don't have to worry:
my $j = $#{ $tmp_h->{a} };
while ($j-- >= 0) {
my $cur = $tmp_h->{a}[$j];
splice #{ $tmp_h->{a} }, $j, 1 if $cur eq 'B';
}
But the most straight forward way is to use grep:
#{ $tmp_h->{a} } = grep $_ ne 'B', #{ $tmp_h->{a} };

Perl Array handling

I have a global array containing elements such as:
#myarray = ("A","B","C","D","E");
I'm reading a column line by line which has values like:
Row1: A
Row2: Z
Row3: B C
Row4: A B C
Row5: A B C Z
Row6: A C
Row7: E
Problem 1 : If Row1 is read and has "A" which is present in #myarray -> no action required, but in case of Row2 "Z" is not a part of #myarray it should fail with some message.
Some rows have multiple elements, it should check for all, for example row3 "A","B","C" all three are part of #myarray --> no action required, but incase of Row4 it should read "A" , "B","C", then comes "Z" which is not a valid element it should fail with some message.
First, create a hash so we can easily and efficiently lookup if a value is valid.
my %ok = map { $_ => 1 } #array;
Then, it's just a question of checking if all the values are in the hash.
while (<>) {
my ($hdr, $values) = /^([^:]+):\s*(.*)/
or do {
warn("Invalid input at \"$ARGV\" line $.\n");
next;
};
my #values = split(' ', $values);
if ( my #invalid = grep { !$ok{$_} } #values ) {
warn("Invalid values (#invalid) for $hdr at \"$ARGV\" line $.\n");
next;
}
}
See if this could help you.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(any);
my #array = qw/A B C D E/;
while(<DATA>){
chomp($_);
print "At line -> $_\n";
my #contents = split(' ', $_);
foreach my $each_element (#contents){
if (not (any { $_ eq $each_element } #array)) {
print "$each_element -> Not exists in array\n";
}
}
}
__DATA__
A
Z
B C
A B C
A B C Z
A C
E
As suggested by #ikegami, this could also work as per the expectation:
...
my #array = qw/A B C D E/;
my %skip_hash = map { $_ => 1 } #array;
while(<DATA>){
chomp($_);
print "At line -> $_\n";
my #contents = split(' ', $_);
foreach my $each_element (#contents){
if (not ($skip_hash{$each_element})) {
print "$each_element -> Not exists in array\n";
}
}
}

Referencing an element in a 2D array in Perl

I have the following code which reads in a 6x6 array from STDIN and saves it as an array of anonymous arrays. I am trying to print out each element with $arr[i][j], but the code below isn't working. It just prints out the first element over and over. How am I not accessing the element correctly?
#!/user/bin/perl
my $arr_i = 0;
my #arr = ();
while ($arr_i < 6){
my $arr_temp = <STDIN>;
my #arr_t = split / /, $arr_temp;
chomp #arr_t;
push #arr,\#arr_t;
$arr_i++;
}
foreach my $i (0..5){
foreach my $j (0..5){
print $arr[i][j] . "\n";
}
}
i and j are not the same as the variables you declared in the foreach lines. Change:
print $arr[i][j] . "\n";
to:
print $arr[$i][$j] . "\n";
warnings alerted me to this issue. You should add these lines to all your Perl code:
use warnings;
use strict;
To demonstrate the Perlish mantra that there's "more than one way to do it":
use 5.10.0; # so can use "say"
use strict;
use warnings qw(all);
sub get_data {
my ($cols, $rows) = #_;
my ($line, #rows);
my $i;
for ($i = 1; $i <= $rows and $line = <DATA>; $i++) {
chomp $line;
my $cells = [ split ' ', $line ];
die "Row $i had ", scalar(#$cells), " instead of $cols" if #$cells != $cols;
push #rows, $cells;
}
die "Not enough rows, got ", $i - 1, "\n" if $i != $rows + 1;
\#rows;
}
sub print_data {
my ($cols, $rows, $data) = #_;
for (my $i = 0; $i < $rows; $i++) {
for (my $j = 0; $j < $cols; $j++) {
say $data->[$i][$j];
}
}
}
my $data = get_data(6, 6);
print_data(6, 6, $data);
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
if we use say, that avoids unsightly print ..., "\n"
get_data is a function that can be called and/or reused, instead of just being part of the main script
get_data knows what data-shape it expects and throws an error if it doesn't get it
[ ... ] creates an anonymous array and returns a reference to it
get_data returns an array-reference so data isn't copied
print_data is a function too
both functions use a conventional for loop instead of making lists of numbers, which in Perl 5 needs to allocate memory
There is also a two-line version of the program (with surrounding bits, and test data):
use 5.10.0; # so can use "say"
my #lines = map { [ split ' ', <DATA> ] } (1..6);
map { say join ' ', map qq{"$_"}, #$_ } #lines;
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
using map is the premier way to iterate over lists of things where you don't need to know how many you've seen (otherwise, a for loop is needed)
the adding of " around the cell contents is only to prove they've been processed. Otherwise the second line could just be: map { say join ' ', #$_ } #lines;

Extract information from lines and columns in PERL

I have a huge file with multiple lines and columns. Each line has many columns and many lines have the same name in the same position. E.g.
A C Z Y X
A C E J
B E K L M
What is the best way to Find all lines that share the same items in a certain position? For instance, I would like to know that there are 2 A, 2 C, 1 D, etc., all ordered by column.
I am really new to Perl, and so I am struggling a lot to advance in this so any tips are appreciated.
I got to this point:
#!/usr/local/bin/perl -w
use strict;
my $path='My:\Path\To\My\File.txt';
my $columns;
my $line;
open (FILE,$path), print "Opened!\n" or die ("Error opening");
while (<FILE>)
{
#line=split('\t',$_);
}
close FILE;
The output of this can be another TSV, that examines the file only until the 5th column, ordered from top to bottom, like:
A 2
C 2
Z 1
Y 1
E 1
J 1
B 1
E 1
K 1
L 1
Note that the first items appear first and, when shared among lines, do not show again for subsequent lines.
Edit: as per the questions in the comments, I changed the dataset and output. Note that two E appear: one belonging to the third column, the other belonging to the second column.
Edit2: Alternatively, this could also be analyzed column by column, thus showing the results in the first column, then in the second, and so on, as long as they were clearly separated. Something like
"1st" "col"
A 2
B 1
"2nd" "col"
C 2
E 1
"3rd" "col"
Z 1
E 1
K 1
"4th" "col"
Y 1
J 1
L 1
I did not fully understand the formatting of your desired output, so the below script outputs all the data from the first col on the first row, and so on. This can easily be modified to the format that you desire, but is a quick starting point to how to acummulate the data first and then processing it.
use strict;
use warnings;
use autodie;
my $path='My:\Path\To\My\File.txt';
open my $fh, '<', $path;
my #data;
# while (<$fh>) { Switch these lines when ready for real data
while (<DATA>) {
my #row = split ' ';
for my $col (0..$#row) {
$data[$col]{$row[$col]}++;
}
}
for my $coldata (#data) {
for my $letter (sort keys %$coldata) {
print "$letter $coldata->{$letter} ";
}
print "\n";
}
close $fh;
__DATA__
A C Z Y X
A C D J
B E K L M
Outputs
A 2 B 1
C 2 E 1
D 1 K 1 Z 1
J 1 L 1 Y 1
M 1 X 1
Perhaps the following will be helpful:
use strict;
use warnings;
my $path = 'My:\Path\To\My\File.txt';
my %hash;
open my $fh, '<', $path or die $!;
while (<$fh>) {
my #cols = split ' ', $_, 5;
$hash{$_}{ $cols[$_] || '' }++ for 0 .. 3;
}
close $fh;
for my $key ( sort { $a <=> $b } keys %hash ) {
print "Col ", $key + 1, "\n";
print "$_ $hash{$key}{$_}\n"
for sort { $hash{$key}->{$b} <=> $hash{$key}->{$a} } grep $_,
keys %{ $hash{$key} };
}
Output on your dataset:
Col 1
A 2
B 1
Col 2
C 2
E 1
Col 3
Z 1
K 1
E 1
Col 4
J 1
L 1
Y 1

searching two array string for equal words

I am a beginner in Perl. I have two string arrays array1 and array2. I want to check the each and every element in 2nd array. if there is i want to give a relative value one to that particular element in the 2nd array. the relative values are store in an array.I try it out but it wont work and git gives a warning like" Use of uninitialized value in string eq at pjt.pl line 52, line 3".
while($i <= (scalar #resultarray-1))
{
while ($j <= (scalar #inputsymbl-1))
{
if ($resultarray[$i] eq $inputsymbl[$j])
{
$rel[$j]=1;
$i=$i+1;
$j=0;
}
else
{
$j=$j+1;
}
}
if($j==(scalar #inputsymbl))
{
$i=$i+1;
$j=0;
}
}
try this:
my $i = 0;
my $j = 0;
## walk each array element
foreach(#resultarray) {
my $result = $_;
foreach(#inputsymbl) {
my $symbl = $_;
if ($result eq $symbl) {
$rel[$j] = 1;
$i++;
} else {
$j++;
}
}
if ($j == (scalar #inputsymbl - 1)) {
$i++;
$j = 0;
}
}
provide more informations if you need detailed help.
From your question and code, it appears that you want to flag the indexes, by using a third array, of the two array's elements that are equal. By doing this, however, you're creating a sparse array. Also, if the two arrays don't have the same number of elements, a "Use of uninitialized value in string eq..." warning will eventually occur. Given these issues, consider using the smaller index of the two arrays (done using the ternary operator below) and pushing the indexes of the equal elements onto the third array:
use strict;
use warnings;
use Data::Dumper;
my #results;
my #arr1 = qw/A B C D E F G H I J/;
my #arr2 = qw/A D C H E K L H N J P Q R S T/;
# Equal: ^ ^ ^ ^ ^
# Index: 0 2 4 7 9
for my $i ( 0 .. ( $#arr1 <= $#arr2 ? $#arr1 : $#arr2 ) ) {
push #results, $i if $arr1[$i] eq $arr2[$i];
}
print Dumper \#results;
Output:
$VAR1 = [
0,
2,
4,
7,
9
];
Hope this helps!

Resources