compare an array of string with another array of strings in perl - arrays

I want to compare an array of string with another array of strings; if it matches, print matched.
Example:
#array = ("R-ID 1.0001", "RA-ID 61.02154", "TCA-ID 49.021456","RCID 61.02154","RB-ID 61.02154");
#var = ("TCA-ID 49", "R-ID 1");
for (my $x = 0; $x <= 4; $x++)
{
$array[$x] =~ /(.+?)\./;
if( ($var[0] eq $1) or ($var[1] eq $1) )
{
print "\n deleted rows are :#array\n";
}
else
{
print "printed rows are : #array \n";
push(#Matrix, \#array);
}
Then I need to compare #var with the #array; if it is matched, print the matched pattern.
Here the entire logic is in a hireartical for loop which gives a new #array in each iteration. so every time this logic is executed #array has different strings.
Then comes with #var it is user input field, this #var can be of any size. So in order to run the logic according to these constraints, I need to iterate the condition inside the if loop when the user input #var size is 3 for example.
So the goal is to match and delete the user input stings using the above mentioned logic. But unfortunately tis logic is not working. Could you please help me out in this issue.

The builtin grep keyword is a good place to start.
my $count = grep { $_ eq $var } #array;
This returns a count of items ($_) in the array which are equal (eq) to $var.
If you needed case-insensitive matching, you could use lc (or in Perl 5.16 or above, fc) to do that:
my $count = grep { lc($_) eq lc($var) } #array;
Now, a disadvantage to grep is that it is counting the matches. So after if finds the first match, it will keep on going until the end of the array. You don't seem to want that, but just want to know if any item in the array matches, in which case keeping on going might be slower than you need if it's a big array with thousands of elements.
So instead, use any from the List::Util module (which is bundled with Perl).
use List::Util qw( any );
my $matched = any { $_ eq $var } #array;
This will match as soon as it finds the first matching element, and skip searching the rest of the array.

Here is a couple of versions that allows multiple strings to be matched. Not clear what form $var takes when you want to store multiple, so assuming they are in an array #var for now.
The key point is this one is the use of the lookup hash to to the matching.
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
# create a lookup for the strings to match
my %lookup = map { $_ => 1} #var ;
for my $entry (#array)
{
print "$entry\n"
if $lookup{$entry} ;
}
running gives
RA-ID 61
TCA-ID 49
Next, using a regular expression to do the matching
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
my $re = join "|", map { quotemeta } #var;
print "$_\n" for grep { /^($re)$/ } #array ;
output is the same

Related

Extract number from array in Perl

I have a array which have certain elements. Each element have two char "BC" followed by a number
e.g - "BC6"
I want to extract the number which is present and store in a different array.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my #band = ("BC1", "BC3");
foreach my $elem(#band)
{
my #chars = split("", $elem);
foreach my $ele (#chars) {
looks_like_number($ele) ? 'push #band_array, $ele' : '';
}
}
After execution #band_array should contain (1,3)
Can someone please tell what I'm doing wrong? I am new to perl and still learning
To do this with a regular expression, you need a very simple pattern. /BC(\d)/ should be enough. The BC is literal. The () are a capture group. They save the match inside into a variable. The first group relates to $1 in Perl. The \d is a character group for digits. That's 0-9 (and others, but that's not relevant here).
In your program, it would look like this.
use strict;
use warnings;
use Data::Dumper;
my #band = ('BC1', 'BC2');
my #numbers;
foreach my $elem (#band) {
if ($elem =~ m/BC(\d)/) {
push #numbers, $1;
}
}
print Dumper #numbers;
This program prints:
$VAR1 = '1';
$VAR2 = '2';
Note that your code had several syntax errors. The main one is that you were using #band = [ ... ], which gives you an array that contains one array reference. But your program assumed there were strings in that array.
Just incase your naming contains characters other than BC this will exctract all numeric values from your list.
use strict;
use warnings;
my #band = ("AB1", "BC2", "CD3");
foreach my $str(#band) {
$str =~ s/[^0-9]//g;
print $str;
}
First, your array is an anonymous array reference; use () for a regular array.
Then, i would use grep to filter out the values into a new array
use strict;
use warnings;
my #band = ("BC1", "BC3");
my #band_array = grep {s/BC(\d+)/$1/} #band;
$"=" , "; # make printing of array nicer
print "#band_array\n"; # print array
grep works by passing each element of an array in the code in { } , just like a sub routine. $_ for each value in the array is passed. If the code returns true then the value of $_ after the passing placed in the new array.
In this case the s/// regex returns true if a substitution is made e.g., the regex must match. Here is link for more info on grep

How to search for overlapping matches for a regex pattern within a string

I have this string
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
and I want to find every substring starting with M and ending with * and add it to an array. This means that the above string would give me 6 elements in my array.
I have this code
foreach ( $line =~ m/M.*?\*/g ) {
push #ORF, $_;
}
but it only gives me two elements in my array since it ignores overlapping strings.
Is there any way to get all matches? I tried googling but could not find an answer.
Can use code within re and Backtracking control verbs for a little magic:
#!/usr/bin/env perl
use strict;
use warnings;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
local our #match;
$line =~ m/(M.*\*)(?{ push #match, $1 })(*FAIL)/;
use Data::Dump;
dd #match;
Outputs:
(
"MZEFSRGGRMEAZFE*MQZEFFMAEZF*",
"MZEFSRGGRMEAZFE*",
"MEAZFE*MQZEFFMAEZF*",
"MEAZFE*",
"MQZEFFMAEZF*",
"MAEZF*",
)
I don't believe it's possible to create a single regex pattern that will match all such substrings, because you're asking for both a greedy and a non-greedy match at the same time, and everything else in-between
I suggest you store all possible start and end positions of these substrings and use a double loop to combine all start positions with all end positions
This program demonstrates
use strict;
use warnings 'all';
use feature 'say';
my $line = 'MZEFSRGGRMEAZFE*MQZEFFMAEZF*';
my #orf;
{
my (#s, #e);
push #s, $-[0] while $line =~/M/g;
push #e, $+[0] while $line =~/\*/g;
for my $s ( #s ) {
for my $e ( #e ) {
push #orf, substr $line, $s, $e-$s if $e > $s;
}
}
}
say for #orf;
output
MZEFSRGGRMEAZFE*
MZEFSRGGRMEAZFE*MQZEFFMAEZF*
MEAZFE*
MEAZFE*MQZEFFMAEZF*
MQZEFFMAEZF*
MAEZF*

Comparing two strings line by line in Perl

I am looking for code in Perl similar to
my #lines1 = split /\n/, $str1;
my #lines2 = split /\n/, $str2;
for (int $i=0; $i<lines1.length; $i++)
{
if (lines1[$i] ~= lines2[$i])
print "difference in line $i \n";
}
To compare two strings line by line and show the lines at which there is any difference.
I know what I have written is mixture of C/Perl/Pseudo-code. How do I write it in the way that it works on Perl?
What you have written is sort of ok, except you cannot use that notation in Perl lines1.length, int $i, and ~= is not an operator, you mean =~, but that is the wrong tool here. Also if must have a block { } after it.
What you want is simply $i < #lines1 to get the array size, my $i to declare a lexical variable, and eq for string comparison. Along with if ( ... ) { ... }.
Technically you can use the binding operator to perform a string comparison, for example:
"foo" =~ "foobar"
But it is not a good idea when comparing literal strings, because you can get partial matches, and you need to escape meta characters. Therefore it is easier to just use eq.
Using C-style for loops is valid, but the more Perl-ish way is to use this notation:
for my $i (0 .. $#lines1)
Which will iterate over the range 0 to the max index of the array.
Perl allows you to open filehandles on strings by using a reference to the scalar variable that holds the string:
open my $string1_fh, '<', \$string1 or die '...';
open my $string2_fh, '<', \$string2 or die '...';
while( my $line1 = <$string1_fh> ) {
my $line2 = <$string2_fh>;
....
}
But, depending on what you mean by difference (does that include insertion or deletion of lines?), you might want something different.
There are several modules on CPAN that you can inspect for ideas, such as Test::LongString or Algorithm::Diff.
my #lines1 = split(/^/, $str1);
my #lines2 = split(/^/, $str2);
# splits at start of line
# use /\n/ if you want to ignore newline and trailing spaces
for ($i=0; $i < #lines1; $i++) {
print "difference in line $i \n" if (lines1[$i] ne lines2[$i]);
}
Comparing Arrays is a way easier if you create a Hashmap out of it...
#Searching the difference
#isect = ();
#diff = ();
%count = ();
foreach $item ( #array1, #array2 ) { $count{$item}++; }
foreach $item ( keys %count ) {
if ( $count{$item} == 2 ) {
push #isect, $item;
}
else {
push #diff, $item;
}
}
#Output
print "Different= #diff\n\n";
print "\nA Array = #array1\n";
print "\nB Array = #array2\n";
print "\nIntersect Array = #isect\n";
Even after spliting you could compare them as Array.

"Use of uninitialized value" when indexing an array

I get the following error from Perl when trying to run the code below
Use of uninitialized value within #words in concatenation (.) or string...
It references the line where I try to create an array made up of three-word sequences (the line that starts with $trigrams). Can anyone help me figure out the problem?
my %hash;
my #words;
my $word;
my #trigrams;
my $i = 0;
while (<>) {
#words = split;
foreach $word (#words) {
$hash{$word}++;
# Now trying to create the distinct six-grams in the 10-K.
$trigrams[$i] = join " ", $words[$i], $words[$i + 1], $words[$i + 2];
print "$words[$i]\n";
$i++;
}
}
All that is happening is that you are falling off the end of the array #words. You are executing the loop for each element of #words, so the value of $i goes from 0 to $#words, or the index of the final element of the array. So the line
join " ", $words[$i], $words[$i + 1], $words[$i + 2];
accesses the last element of the array $words[$i] and two elements beyond that which don't exist.
In this case, as with any loop which uses the current index of an array, it is easiest to iterate over the array indices instead of the contents. For the join to be valid you need to start at zero and stop at two elements before the end, so 0 .. $#words-2.
It is also neater to use an array slice to select the three elements for the trigram, and use the fact that interpolating an array into a string, as in "#array", will do the same as join ' ', #array. (More precisely, it does join $", #array, and $" is set to a single space by default.)
I suggest this fix. It is essential to use strict and use warnings at the start of every Perl program, and you should declare all your variables using my as late as possible.
use strict;
use warnings;
my %hash;
while (<>) {
my #words = split;
my #trigrams;
for my $i (0 .. $#words - 2) {
my $word = $words[$i];
++$hash{$word};
$trigrams[$i] = "#words[$i,$i+1,$i+2]";
print "$word\n";
}
}
Update
You may prefer this if it isn't too terse for you
use strict;
use warnings;
my %hash;
while (<>) {
my #words = split;
my #trigrams = map "#words[$_,$_+1,$_+2]", 0 .. $#words-2;
}

Remove elements from an array which have a substring that is itself an element of the array

In Perl, I'd like to remove all elements from an array where another element of the same array is a non-empty substring of said element.
Say I have the array
#itemlist = ("abcde", "ab", "khi", "jklm");
In this instance I would like to have the element "abcde" removed, because "ab" is a substring of "abcde".
I could make a copy of the array (maybe as a hash?), iterate over it, try to index with every element of the original array and remove it, but there has to be a more elegant way, no?
Thanks for your help!
Edited for clarity a bit.
You could construct a regex from all the items and throw out anything that matches:
$alternation = join('|', map(quotemeta, #itemlist));
#itemlist = grep !/($alternation).|.($alternation)/, #itemlist;
The ().|.() thing just ensures that an item doesn't match itself.
Well, I wouldn't call this elegant, but here goes:
#!usr/bin/perl
use strict;
use warnings;
my #itemlist = ("abcde", "ab", "khi", "jklm");
#itemlist = grep {
#itemlist ~~ sub {$_ !~ /(?:.\Q$_[0]\E|\Q$_[0]\E.)/}
} #itemlist;
print "#itemlist";
It relies on a rather obscure behavior of smart match: if the left argument is an array and the right argument a sub, it calls the sub for each element, and the final result is true only if the sub returns true for each element.
Explanation: for each element of the array, it checks that no other element is a substring of that element (requiring at least one additional character so that elements won't match themselves).
Note: wdebeaum's answer is probably the one I would prefer in the real world. Still, it is kind of interesting the strange things one can do with smart match.
wdebeaum's answer is the solution to use, not the one below, but I learned something by doing it and perhaps someone else will too. After I had written mine I decided to test it on lists of several thousand elements.
b.pl:
#!/usr/bin/perl
use strict;
use warnings;
my #itemlist = <>;
for(#itemlist) { chomp; }
my $regex;
if(defined $ENV{wdebeaum}) {
# wdebeaum's solution
my $alternation = join('|', map(quotemeta, #itemlist));
$regex = qr/(?:$alternation).|.(?:$alternation)/;
} else {
# my solution
$regex = join "|", map {qq{(?:\Q$_\E.)|(?:.\Q$_\E)}} #itemlist;
}
my #result = grep !/$regex/, #itemlist;
print scalar #itemlist, "\t", scalar #result, "\n";
I generated a list of 5000 random words.
sort -R /usr/share/dict/american-english|head -5000 > some-words
For small lists both solutions seem fine.
$ time head -200 some-words | wdebeaum=1 ./b.pl
200 198
real 0m0.012s
user 0m0.004s
sys 0m0.004s
$ time head -200 some-words | ./b.pl
200 198
real 0m0.068s
user 0m0.060s
sys 0m0.004s
But for larger lists, wdebeaum's is clearly better.
$ time cat some-words | wdebeaum=1 ./b.pl
5000 1947
real 0m0.068s
user 0m0.064s
sys 0m0.000s
$ time cat some-words | ./b.pl
5000 1947
real 0m8.305s
user 0m8.277s
sys 0m0.012s
I think the reason for the difference, is that even though both regular expressions have the same number of possible paths, my regex has more paths that have to be tried, since it has the same number of .s as paths, while wdebebaum's has only two.
You can use a hash to count substrings of all the words. Any word in the list that has a higher count than one is then a substring of another word. The minimum length of the substrings is two in this example:
use strict;
use warnings;
use feature 'say';
my #list = qw(abcde ab foo foobar de oba cd xs);
my %count;
for my $word (#list) {
my $len = length $word;
$count{$word}++;
for my $start (0 .. $len - 2) {
for my $long (2 .. $len - 2) {
my $sub = substr($word, $start, $long);
$count{$sub}++;
}
}
}
say for grep $count{$_} == 1, #list;
Output:
abcde
foobar
xs
The following will remove the substring from the array.
#!/usr/bin/perl
use strict;
use warnings;
my #ar=("asl","pwe","jsl","nxu","sl","baks","ak");
foreach my $i (#ar){
my $p = grep /$i/, #ar;
if ( $p == 1 ){
print "$i" , "\n";
}
}
I had the inverse problem: removing from the list strings which are substrings of other strings. Here is my not-too-elegant solution.
sub remove_substrings_from_list {
my #list = #_;
my #vals_without_superstrings;
my %hash_of_others;
for ( 0 .. $#list ) {
my $a = shift #list;
$hash_of_others{$a} = [ #list ];
push #list, $a;
}
foreach my $k ( keys %hash_of_others ) {
push #vals_without_superstrings, $k unless grep { index( $_, $k ) != -1 } #{ $hash_of_others{$k} };
}
return #vals_without_superstrings;
}

Resources