How do I find overlapping regex in a string? - arrays

I have this string:
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
I want to find every substring starting with M and ending with *, without * within them. this means that the above string would give me 4 elements in my final array.
#ORF= (MZEFSRGGRMEAZFE*,MEAZFE*, MQZEFFMAEZF*,MAEZF*)
A simple regex will not do since it does not find overlapping substrings. Is there a simple way to do this?

Regular expression matching consumes the pattern as it matches - that's by design.
You can use a lookahead expression to avoid this happening PerlMonks:
Using Look-ahead and Look-behind
So something like this will work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches = $line =~ m/(?=(M[^*]+))/g;
print Dumper \#matches;
Which gives you:
$VAR1 = [
'MZEFSRGGRMEAZFE',
'MEAZFE',
'MQZEFFMAEZF',
'MAEZF'
];

You can also use a recursive approach instead of an advanced-feature regex to do that. The program below takes each match and reparses the match, but omitting the starting M so it won't match the whole thing again.
use strict;
use warnings;
use Data::Printer;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches;
sub parse {
my ( $string ) = #_;
while ($string =~ m/(M[^*]+\*)/g ) {
push #matches, $1;
parse(substr $1, 1);
}
}
parse($line);
p #matches;
Here's the output:
[
[0] "MZEFSRGGRMEAZFE*",
[1] "MEAZFE*",
[2] "MQZEFFMAEZF*",
[3] "MAEZF*"
]

Related

compare an array of string with another array of strings in perl

I want to compare an array of string with another array of strings; if it matches, print matched.
Example:
#array = ("R-ID 1.0001", "RA-ID 61.02154", "TCA-ID 49.021456","RCID 61.02154","RB-ID 61.02154");
#var = ("TCA-ID 49", "R-ID 1");
for (my $x = 0; $x <= 4; $x++)
{
$array[$x] =~ /(.+?)\./;
if( ($var[0] eq $1) or ($var[1] eq $1) )
{
print "\n deleted rows are :#array\n";
}
else
{
print "printed rows are : #array \n";
push(#Matrix, \#array);
}
Then I need to compare #var with the #array; if it is matched, print the matched pattern.
Here the entire logic is in a hireartical for loop which gives a new #array in each iteration. so every time this logic is executed #array has different strings.
Then comes with #var it is user input field, this #var can be of any size. So in order to run the logic according to these constraints, I need to iterate the condition inside the if loop when the user input #var size is 3 for example.
So the goal is to match and delete the user input stings using the above mentioned logic. But unfortunately tis logic is not working. Could you please help me out in this issue.
The builtin grep keyword is a good place to start.
my $count = grep { $_ eq $var } #array;
This returns a count of items ($_) in the array which are equal (eq) to $var.
If you needed case-insensitive matching, you could use lc (or in Perl 5.16 or above, fc) to do that:
my $count = grep { lc($_) eq lc($var) } #array;
Now, a disadvantage to grep is that it is counting the matches. So after if finds the first match, it will keep on going until the end of the array. You don't seem to want that, but just want to know if any item in the array matches, in which case keeping on going might be slower than you need if it's a big array with thousands of elements.
So instead, use any from the List::Util module (which is bundled with Perl).
use List::Util qw( any );
my $matched = any { $_ eq $var } #array;
This will match as soon as it finds the first matching element, and skip searching the rest of the array.
Here is a couple of versions that allows multiple strings to be matched. Not clear what form $var takes when you want to store multiple, so assuming they are in an array #var for now.
The key point is this one is the use of the lookup hash to to the matching.
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
# create a lookup for the strings to match
my %lookup = map { $_ => 1} #var ;
for my $entry (#array)
{
print "$entry\n"
if $lookup{$entry} ;
}
running gives
RA-ID 61
TCA-ID 49
Next, using a regular expression to do the matching
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
my $re = join "|", map { quotemeta } #var;
print "$_\n" for grep { /^($re)$/ } #array ;
output is the same

Extract number from array in Perl

I have a array which have certain elements. Each element have two char "BC" followed by a number
e.g - "BC6"
I want to extract the number which is present and store in a different array.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my #band = ("BC1", "BC3");
foreach my $elem(#band)
{
my #chars = split("", $elem);
foreach my $ele (#chars) {
looks_like_number($ele) ? 'push #band_array, $ele' : '';
}
}
After execution #band_array should contain (1,3)
Can someone please tell what I'm doing wrong? I am new to perl and still learning
To do this with a regular expression, you need a very simple pattern. /BC(\d)/ should be enough. The BC is literal. The () are a capture group. They save the match inside into a variable. The first group relates to $1 in Perl. The \d is a character group for digits. That's 0-9 (and others, but that's not relevant here).
In your program, it would look like this.
use strict;
use warnings;
use Data::Dumper;
my #band = ('BC1', 'BC2');
my #numbers;
foreach my $elem (#band) {
if ($elem =~ m/BC(\d)/) {
push #numbers, $1;
}
}
print Dumper #numbers;
This program prints:
$VAR1 = '1';
$VAR2 = '2';
Note that your code had several syntax errors. The main one is that you were using #band = [ ... ], which gives you an array that contains one array reference. But your program assumed there were strings in that array.
Just incase your naming contains characters other than BC this will exctract all numeric values from your list.
use strict;
use warnings;
my #band = ("AB1", "BC2", "CD3");
foreach my $str(#band) {
$str =~ s/[^0-9]//g;
print $str;
}
First, your array is an anonymous array reference; use () for a regular array.
Then, i would use grep to filter out the values into a new array
use strict;
use warnings;
my #band = ("BC1", "BC3");
my #band_array = grep {s/BC(\d+)/$1/} #band;
$"=" , "; # make printing of array nicer
print "#band_array\n"; # print array
grep works by passing each element of an array in the code in { } , just like a sub routine. $_ for each value in the array is passed. If the code returns true then the value of $_ after the passing placed in the new array.
In this case the s/// regex returns true if a substitution is made e.g., the regex must match. Here is link for more info on grep

How to search for overlapping matches for a regex pattern within a string

I have this string
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
and I want to find every substring starting with M and ending with * and add it to an array. This means that the above string would give me 6 elements in my array.
I have this code
foreach ( $line =~ m/M.*?\*/g ) {
push #ORF, $_;
}
but it only gives me two elements in my array since it ignores overlapping strings.
Is there any way to get all matches? I tried googling but could not find an answer.
Can use code within re and Backtracking control verbs for a little magic:
#!/usr/bin/env perl
use strict;
use warnings;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
local our #match;
$line =~ m/(M.*\*)(?{ push #match, $1 })(*FAIL)/;
use Data::Dump;
dd #match;
Outputs:
(
"MZEFSRGGRMEAZFE*MQZEFFMAEZF*",
"MZEFSRGGRMEAZFE*",
"MEAZFE*MQZEFFMAEZF*",
"MEAZFE*",
"MQZEFFMAEZF*",
"MAEZF*",
)
I don't believe it's possible to create a single regex pattern that will match all such substrings, because you're asking for both a greedy and a non-greedy match at the same time, and everything else in-between
I suggest you store all possible start and end positions of these substrings and use a double loop to combine all start positions with all end positions
This program demonstrates
use strict;
use warnings 'all';
use feature 'say';
my $line = 'MZEFSRGGRMEAZFE*MQZEFFMAEZF*';
my #orf;
{
my (#s, #e);
push #s, $-[0] while $line =~/M/g;
push #e, $+[0] while $line =~/\*/g;
for my $s ( #s ) {
for my $e ( #e ) {
push #orf, substr $line, $s, $e-$s if $e > $s;
}
}
}
say for #orf;
output
MZEFSRGGRMEAZFE*
MZEFSRGGRMEAZFE*MQZEFFMAEZF*
MEAZFE*
MEAZFE*MQZEFFMAEZF*
MQZEFFMAEZF*
MAEZF*

How to split the entire string into array in Perl

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:
#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g; # remove white space from string
# split the string into fragments of 58 characters and store in array
my #array = $string =~ /[A-Z]{58}/g;
my $len = scalar #array;
print $len . "\n"; # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";
The code outputs the following:
3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space>
Notice that the rest of the string EDAVTKPELRPCPTP is not stored in #array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?
You've almost got it. You need to change your regex to allow for 1 to 58 characters.
my #array = $string =~ /[A-Z]{1,58}/g;
In addition, you have an error in your script using #prot_seq instead of #array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
# Strip whitespace.
$string =~ s/\s+//g;
# Split the string into fragments of 58 characters or less
my #fragments = $string =~ /[A-Z]{1,58}/g;
say "Num fragments: ".scalar #fragments;
say join "\n", #fragments;
What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:
/[A-Z]{58}|[A-Z]{1,57}\z/
Which I would prefer to write like this:
/\p{Upper}{58}|\p{Upper}{1,57}\z/
However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.
/\p{Upper}{1,58}/
Or, for reasons as Schwern mentions (such as avoiding any foreign letters)
/[A-Z]{1,58}/
You may prefer to use unpack, like this
$string =~ s/\s+//g;
my #fragments = unpack '(A58)*', $string;
Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write
my #fragments = unpack '(A58)*', $string =~ s/\s+//gr;
If you don't actually need regex character classes, this is how I'd do it:
use strict;
use warnings;
use Data::Dump;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g;
my #chunks;
while (length($string)) {
push(#chunks, substr($string, 0, 58, ''));
}
dd($string, \#chunks);
Output:
(
"",
[
"MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
"PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
"VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
"EDAVTKPELRPCPTP",
],
)

In Perl, how can I replace sequences of duplicates with one element in an array?

I have a string that I read in like:
a+c+c+b+v+f+d+d+d+c
I need to write the program so it splits at the + then deletes the duplicates so the output is:
acbvfdc
I've tried tr///cs; but I guess I'm not using it right?
#!/usr/bin/env perl
use strict; use warnings;
my #strings = qw(
a+c+c+b+v+f+d+d+d+c
alpha+bravo+bravo+bravo+charlie+delta+delta+delta+echo+delta
foo+food
bark+ark
);
for my $s (#strings) {
# Thanks #ikegami
$s =~ s/ (?<![^+]) ([^+]+) \K (?: [+] \1 )+ (?![^+]) //gx;
print "$s\n";
}
Output:
a+c+b+v+f+d+c
alpha+bravo+charlie+delta+echo+delta
foo+food
bark+ark
Now, you can split the string and have no sequences of duplicates using split /[+]/, $s because the first argument of split is a pattern.
Note to any who reads: this does not address the OP's question directly, though in my defense the question was worded ambiguously. :-) Still, it answers an interpretation of the question that others might have, so I'll leave it as-is.
Does order matter? If not, you can always try something like this:
use strict;
use warnings;
my $string = 'a+c+c+b+v+f+d+d+d+c';
# Extract unique 'words'
my #words = keys %{{map {$_ => 1} split /\+/, $string}};
print "$_\n" for #words;
Better yet, use List::MoreUtils from CPAN (which does preserve the order):
use strict;
use warnings;
use List::MoreUtils 'uniq';
my $string = 'a+c+c+b+v+f+d+d+d+c';
# Extract unique 'words'
my #words = uniq split /\+/, $string;
print "$_\n" for #words;
my $s="a+c+c+b+v+f+d+d+d+c";
$s =~ tr/+//d;
$s =~ tr/\0-\xff/\0-\xff/s;
print "$s\n"; # => acbvfdc

Resources