Related to regular expressions in Perl - arrays

I have a string written in some text file -
A[B[C[10]]]
I need to extract the information of arrays used in this string. Example, I have to store the information in an array,
#str = (A[], B[], C[10])
I want to accomplish this thing using regex in Perl.
I want solution that works for the every case of array inside array, like this
A[B[C[D[E[F[10]]]]]]
So, how to create that #str array?

You can use a regex pattern to parse this string recursively, and call Perl code to save intermediate values
Like this
use strict;
use warnings 'all';
use v5.14;
my #items;
my $i;
my $re = qr{
([A-Z]\[) (?{ $items[$i++] = $^N })
(?:
(?R)
|
(\d+) (?{ $items[$i-1] .= $^N })
)
\] (?{ $items[--$i] .= ']' })
}x;
my $s = 'A[B[C[D[E[F[10]]]]]]';
use Data::Dump;
(#items, $i) = ();
dd \#items if 'A[B[C[10]]]' =~ /$re/g;
(#items, $i) = ();
dd \#items if 'A[B[C[D[E[F[10]]]]]]' =~ /$re/g;
output
["A[]", "B[]", "C[10]"]
["A[]", "B[]", "C[]", "D[]", "E[]", "F[10]"]

Related

Extract number from array in Perl

I have a array which have certain elements. Each element have two char "BC" followed by a number
e.g - "BC6"
I want to extract the number which is present and store in a different array.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my #band = ("BC1", "BC3");
foreach my $elem(#band)
{
my #chars = split("", $elem);
foreach my $ele (#chars) {
looks_like_number($ele) ? 'push #band_array, $ele' : '';
}
}
After execution #band_array should contain (1,3)
Can someone please tell what I'm doing wrong? I am new to perl and still learning
To do this with a regular expression, you need a very simple pattern. /BC(\d)/ should be enough. The BC is literal. The () are a capture group. They save the match inside into a variable. The first group relates to $1 in Perl. The \d is a character group for digits. That's 0-9 (and others, but that's not relevant here).
In your program, it would look like this.
use strict;
use warnings;
use Data::Dumper;
my #band = ('BC1', 'BC2');
my #numbers;
foreach my $elem (#band) {
if ($elem =~ m/BC(\d)/) {
push #numbers, $1;
}
}
print Dumper #numbers;
This program prints:
$VAR1 = '1';
$VAR2 = '2';
Note that your code had several syntax errors. The main one is that you were using #band = [ ... ], which gives you an array that contains one array reference. But your program assumed there were strings in that array.
Just incase your naming contains characters other than BC this will exctract all numeric values from your list.
use strict;
use warnings;
my #band = ("AB1", "BC2", "CD3");
foreach my $str(#band) {
$str =~ s/[^0-9]//g;
print $str;
}
First, your array is an anonymous array reference; use () for a regular array.
Then, i would use grep to filter out the values into a new array
use strict;
use warnings;
my #band = ("BC1", "BC3");
my #band_array = grep {s/BC(\d+)/$1/} #band;
$"=" , "; # make printing of array nicer
print "#band_array\n"; # print array
grep works by passing each element of an array in the code in { } , just like a sub routine. $_ for each value in the array is passed. If the code returns true then the value of $_ after the passing placed in the new array.
In this case the s/// regex returns true if a substitution is made e.g., the regex must match. Here is link for more info on grep

How to search for overlapping matches for a regex pattern within a string

I have this string
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
and I want to find every substring starting with M and ending with * and add it to an array. This means that the above string would give me 6 elements in my array.
I have this code
foreach ( $line =~ m/M.*?\*/g ) {
push #ORF, $_;
}
but it only gives me two elements in my array since it ignores overlapping strings.
Is there any way to get all matches? I tried googling but could not find an answer.
Can use code within re and Backtracking control verbs for a little magic:
#!/usr/bin/env perl
use strict;
use warnings;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
local our #match;
$line =~ m/(M.*\*)(?{ push #match, $1 })(*FAIL)/;
use Data::Dump;
dd #match;
Outputs:
(
"MZEFSRGGRMEAZFE*MQZEFFMAEZF*",
"MZEFSRGGRMEAZFE*",
"MEAZFE*MQZEFFMAEZF*",
"MEAZFE*",
"MQZEFFMAEZF*",
"MAEZF*",
)
I don't believe it's possible to create a single regex pattern that will match all such substrings, because you're asking for both a greedy and a non-greedy match at the same time, and everything else in-between
I suggest you store all possible start and end positions of these substrings and use a double loop to combine all start positions with all end positions
This program demonstrates
use strict;
use warnings 'all';
use feature 'say';
my $line = 'MZEFSRGGRMEAZFE*MQZEFFMAEZF*';
my #orf;
{
my (#s, #e);
push #s, $-[0] while $line =~/M/g;
push #e, $+[0] while $line =~/\*/g;
for my $s ( #s ) {
for my $e ( #e ) {
push #orf, substr $line, $s, $e-$s if $e > $s;
}
}
}
say for #orf;
output
MZEFSRGGRMEAZFE*
MZEFSRGGRMEAZFE*MQZEFFMAEZF*
MEAZFE*
MEAZFE*MQZEFFMAEZF*
MQZEFFMAEZF*
MAEZF*

How do I find overlapping regex in a string?

I have this string:
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
I want to find every substring starting with M and ending with *, without * within them. this means that the above string would give me 4 elements in my final array.
#ORF= (MZEFSRGGRMEAZFE*,MEAZFE*, MQZEFFMAEZF*,MAEZF*)
A simple regex will not do since it does not find overlapping substrings. Is there a simple way to do this?
Regular expression matching consumes the pattern as it matches - that's by design.
You can use a lookahead expression to avoid this happening PerlMonks:
Using Look-ahead and Look-behind
So something like this will work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches = $line =~ m/(?=(M[^*]+))/g;
print Dumper \#matches;
Which gives you:
$VAR1 = [
'MZEFSRGGRMEAZFE',
'MEAZFE',
'MQZEFFMAEZF',
'MAEZF'
];
You can also use a recursive approach instead of an advanced-feature regex to do that. The program below takes each match and reparses the match, but omitting the starting M so it won't match the whole thing again.
use strict;
use warnings;
use Data::Printer;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches;
sub parse {
my ( $string ) = #_;
while ($string =~ m/(M[^*]+\*)/g ) {
push #matches, $1;
parse(substr $1, 1);
}
}
parse($line);
p #matches;
Here's the output:
[
[0] "MZEFSRGGRMEAZFE*",
[1] "MEAZFE*",
[2] "MQZEFFMAEZF*",
[3] "MAEZF*"
]

How to split the entire string into array in Perl

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:
#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g; # remove white space from string
# split the string into fragments of 58 characters and store in array
my #array = $string =~ /[A-Z]{58}/g;
my $len = scalar #array;
print $len . "\n"; # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";
The code outputs the following:
3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space>
Notice that the rest of the string EDAVTKPELRPCPTP is not stored in #array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?
You've almost got it. You need to change your regex to allow for 1 to 58 characters.
my #array = $string =~ /[A-Z]{1,58}/g;
In addition, you have an error in your script using #prot_seq instead of #array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
# Strip whitespace.
$string =~ s/\s+//g;
# Split the string into fragments of 58 characters or less
my #fragments = $string =~ /[A-Z]{1,58}/g;
say "Num fragments: ".scalar #fragments;
say join "\n", #fragments;
What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:
/[A-Z]{58}|[A-Z]{1,57}\z/
Which I would prefer to write like this:
/\p{Upper}{58}|\p{Upper}{1,57}\z/
However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.
/\p{Upper}{1,58}/
Or, for reasons as Schwern mentions (such as avoiding any foreign letters)
/[A-Z]{1,58}/
You may prefer to use unpack, like this
$string =~ s/\s+//g;
my #fragments = unpack '(A58)*', $string;
Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write
my #fragments = unpack '(A58)*', $string =~ s/\s+//gr;
If you don't actually need regex character classes, this is how I'd do it:
use strict;
use warnings;
use Data::Dump;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g;
my #chunks;
while (length($string)) {
push(#chunks, substr($string, 0, 58, ''));
}
dd($string, \#chunks);
Output:
(
"",
[
"MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
"PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
"VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
"EDAVTKPELRPCPTP",
],
)

Comparing two strings line by line in Perl

I am looking for code in Perl similar to
my #lines1 = split /\n/, $str1;
my #lines2 = split /\n/, $str2;
for (int $i=0; $i<lines1.length; $i++)
{
if (lines1[$i] ~= lines2[$i])
print "difference in line $i \n";
}
To compare two strings line by line and show the lines at which there is any difference.
I know what I have written is mixture of C/Perl/Pseudo-code. How do I write it in the way that it works on Perl?
What you have written is sort of ok, except you cannot use that notation in Perl lines1.length, int $i, and ~= is not an operator, you mean =~, but that is the wrong tool here. Also if must have a block { } after it.
What you want is simply $i < #lines1 to get the array size, my $i to declare a lexical variable, and eq for string comparison. Along with if ( ... ) { ... }.
Technically you can use the binding operator to perform a string comparison, for example:
"foo" =~ "foobar"
But it is not a good idea when comparing literal strings, because you can get partial matches, and you need to escape meta characters. Therefore it is easier to just use eq.
Using C-style for loops is valid, but the more Perl-ish way is to use this notation:
for my $i (0 .. $#lines1)
Which will iterate over the range 0 to the max index of the array.
Perl allows you to open filehandles on strings by using a reference to the scalar variable that holds the string:
open my $string1_fh, '<', \$string1 or die '...';
open my $string2_fh, '<', \$string2 or die '...';
while( my $line1 = <$string1_fh> ) {
my $line2 = <$string2_fh>;
....
}
But, depending on what you mean by difference (does that include insertion or deletion of lines?), you might want something different.
There are several modules on CPAN that you can inspect for ideas, such as Test::LongString or Algorithm::Diff.
my #lines1 = split(/^/, $str1);
my #lines2 = split(/^/, $str2);
# splits at start of line
# use /\n/ if you want to ignore newline and trailing spaces
for ($i=0; $i < #lines1; $i++) {
print "difference in line $i \n" if (lines1[$i] ne lines2[$i]);
}
Comparing Arrays is a way easier if you create a Hashmap out of it...
#Searching the difference
#isect = ();
#diff = ();
%count = ();
foreach $item ( #array1, #array2 ) { $count{$item}++; }
foreach $item ( keys %count ) {
if ( $count{$item} == 2 ) {
push #isect, $item;
}
else {
push #diff, $item;
}
}
#Output
print "Different= #diff\n\n";
print "\nA Array = #array1\n";
print "\nB Array = #array2\n";
print "\nIntersect Array = #isect\n";
Even after spliting you could compare them as Array.

Resources