How to split the entire string into array in Perl - arrays

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:
#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g; # remove white space from string
# split the string into fragments of 58 characters and store in array
my #array = $string =~ /[A-Z]{58}/g;
my $len = scalar #array;
print $len . "\n"; # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";
The code outputs the following:
3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space>
Notice that the rest of the string EDAVTKPELRPCPTP is not stored in #array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?

You've almost got it. You need to change your regex to allow for 1 to 58 characters.
my #array = $string =~ /[A-Z]{1,58}/g;
In addition, you have an error in your script using #prot_seq instead of #array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
# Strip whitespace.
$string =~ s/\s+//g;
# Split the string into fragments of 58 characters or less
my #fragments = $string =~ /[A-Z]{1,58}/g;
say "Num fragments: ".scalar #fragments;
say join "\n", #fragments;

What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:
/[A-Z]{58}|[A-Z]{1,57}\z/
Which I would prefer to write like this:
/\p{Upper}{58}|\p{Upper}{1,57}\z/
However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.
/\p{Upper}{1,58}/
Or, for reasons as Schwern mentions (such as avoiding any foreign letters)
/[A-Z]{1,58}/

You may prefer to use unpack, like this
$string =~ s/\s+//g;
my #fragments = unpack '(A58)*', $string;
Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write
my #fragments = unpack '(A58)*', $string =~ s/\s+//gr;

If you don't actually need regex character classes, this is how I'd do it:
use strict;
use warnings;
use Data::Dump;
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
LRDVVVGRHPLHLLEDAVTKPELRPCPTP";
$string =~ s/\s+//g;
my #chunks;
while (length($string)) {
push(#chunks, substr($string, 0, 58, ''));
}
dd($string, \#chunks);
Output:
(
"",
[
"MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
"PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
"VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
"EDAVTKPELRPCPTP",
],
)

Related

Related to regular expressions in Perl

I have a string written in some text file -
A[B[C[10]]]
I need to extract the information of arrays used in this string. Example, I have to store the information in an array,
#str = (A[], B[], C[10])
I want to accomplish this thing using regex in Perl.
I want solution that works for the every case of array inside array, like this
A[B[C[D[E[F[10]]]]]]
So, how to create that #str array?
You can use a regex pattern to parse this string recursively, and call Perl code to save intermediate values
Like this
use strict;
use warnings 'all';
use v5.14;
my #items;
my $i;
my $re = qr{
([A-Z]\[) (?{ $items[$i++] = $^N })
(?:
(?R)
|
(\d+) (?{ $items[$i-1] .= $^N })
)
\] (?{ $items[--$i] .= ']' })
}x;
my $s = 'A[B[C[D[E[F[10]]]]]]';
use Data::Dump;
(#items, $i) = ();
dd \#items if 'A[B[C[10]]]' =~ /$re/g;
(#items, $i) = ();
dd \#items if 'A[B[C[D[E[F[10]]]]]]' =~ /$re/g;
output
["A[]", "B[]", "C[10]"]
["A[]", "B[]", "C[]", "D[]", "E[]", "F[10]"]

How to search for overlapping matches for a regex pattern within a string

I have this string
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
and I want to find every substring starting with M and ending with * and add it to an array. This means that the above string would give me 6 elements in my array.
I have this code
foreach ( $line =~ m/M.*?\*/g ) {
push #ORF, $_;
}
but it only gives me two elements in my array since it ignores overlapping strings.
Is there any way to get all matches? I tried googling but could not find an answer.
Can use code within re and Backtracking control verbs for a little magic:
#!/usr/bin/env perl
use strict;
use warnings;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
local our #match;
$line =~ m/(M.*\*)(?{ push #match, $1 })(*FAIL)/;
use Data::Dump;
dd #match;
Outputs:
(
"MZEFSRGGRMEAZFE*MQZEFFMAEZF*",
"MZEFSRGGRMEAZFE*",
"MEAZFE*MQZEFFMAEZF*",
"MEAZFE*",
"MQZEFFMAEZF*",
"MAEZF*",
)
I don't believe it's possible to create a single regex pattern that will match all such substrings, because you're asking for both a greedy and a non-greedy match at the same time, and everything else in-between
I suggest you store all possible start and end positions of these substrings and use a double loop to combine all start positions with all end positions
This program demonstrates
use strict;
use warnings 'all';
use feature 'say';
my $line = 'MZEFSRGGRMEAZFE*MQZEFFMAEZF*';
my #orf;
{
my (#s, #e);
push #s, $-[0] while $line =~/M/g;
push #e, $+[0] while $line =~/\*/g;
for my $s ( #s ) {
for my $e ( #e ) {
push #orf, substr $line, $s, $e-$s if $e > $s;
}
}
}
say for #orf;
output
MZEFSRGGRMEAZFE*
MZEFSRGGRMEAZFE*MQZEFFMAEZF*
MEAZFE*
MEAZFE*MQZEFFMAEZF*
MQZEFFMAEZF*
MAEZF*

How do I find overlapping regex in a string?

I have this string:
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
I want to find every substring starting with M and ending with *, without * within them. this means that the above string would give me 4 elements in my final array.
#ORF= (MZEFSRGGRMEAZFE*,MEAZFE*, MQZEFFMAEZF*,MAEZF*)
A simple regex will not do since it does not find overlapping substrings. Is there a simple way to do this?
Regular expression matching consumes the pattern as it matches - that's by design.
You can use a lookahead expression to avoid this happening PerlMonks:
Using Look-ahead and Look-behind
So something like this will work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches = $line =~ m/(?=(M[^*]+))/g;
print Dumper \#matches;
Which gives you:
$VAR1 = [
'MZEFSRGGRMEAZFE',
'MEAZFE',
'MQZEFFMAEZF',
'MAEZF'
];
You can also use a recursive approach instead of an advanced-feature regex to do that. The program below takes each match and reparses the match, but omitting the starting M so it won't match the whole thing again.
use strict;
use warnings;
use Data::Printer;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
my #matches;
sub parse {
my ( $string ) = #_;
while ($string =~ m/(M[^*]+\*)/g ) {
push #matches, $1;
parse(substr $1, 1);
}
}
parse($line);
p #matches;
Here's the output:
[
[0] "MZEFSRGGRMEAZFE*",
[1] "MEAZFE*",
[2] "MQZEFFMAEZF*",
[3] "MAEZF*"
]

Perl: string to an array reference?

Say, there is a string "[1,2,3,4,5]", how could I change it to an array reference as [1,2,3,4,5]? Using split and recomposing the array is one way, but it looks like there should be a simpler way.
eval is the simplest way
$string = "[1,2,3,4,5]";
$ref = eval $string;
but this is insecure if you don't have control over the contents of $string.
Your input string is also valid JSON, though, so you could say
use JSON;
$ref = decode_json( $string );
You can use eval but this should definitely be avoided when the string in question comes from untrusted sources.
Otherwise you have to parse it yourself:
my #arr = split(/\s*,\s*/, substr($string, 1, -1));
my $ref = \#arr;
You really should avoid eval if you can. If the string comes from outside the program then untold damage can be done by simply applying eval to it.
If the contents of the array is just numbers then you can use a regex to extract the information you need.
Here's an example
use strict;
use warnings;
my $string = "[1,2,3,4,5]";
my $data = [ $string =~ /\d+/g ];
use Data::Dump;
dd $data;
output
[1 .. 5]

In Perl, how can I replace sequences of duplicates with one element in an array?

I have a string that I read in like:
a+c+c+b+v+f+d+d+d+c
I need to write the program so it splits at the + then deletes the duplicates so the output is:
acbvfdc
I've tried tr///cs; but I guess I'm not using it right?
#!/usr/bin/env perl
use strict; use warnings;
my #strings = qw(
a+c+c+b+v+f+d+d+d+c
alpha+bravo+bravo+bravo+charlie+delta+delta+delta+echo+delta
foo+food
bark+ark
);
for my $s (#strings) {
# Thanks #ikegami
$s =~ s/ (?<![^+]) ([^+]+) \K (?: [+] \1 )+ (?![^+]) //gx;
print "$s\n";
}
Output:
a+c+b+v+f+d+c
alpha+bravo+charlie+delta+echo+delta
foo+food
bark+ark
Now, you can split the string and have no sequences of duplicates using split /[+]/, $s because the first argument of split is a pattern.
Note to any who reads: this does not address the OP's question directly, though in my defense the question was worded ambiguously. :-) Still, it answers an interpretation of the question that others might have, so I'll leave it as-is.
Does order matter? If not, you can always try something like this:
use strict;
use warnings;
my $string = 'a+c+c+b+v+f+d+d+d+c';
# Extract unique 'words'
my #words = keys %{{map {$_ => 1} split /\+/, $string}};
print "$_\n" for #words;
Better yet, use List::MoreUtils from CPAN (which does preserve the order):
use strict;
use warnings;
use List::MoreUtils 'uniq';
my $string = 'a+c+c+b+v+f+d+d+d+c';
# Extract unique 'words'
my #words = uniq split /\+/, $string;
print "$_\n" for #words;
my $s="a+c+c+b+v+f+d+d+d+c";
$s =~ tr/+//d;
$s =~ tr/\0-\xff/\0-\xff/s;
print "$s\n"; # => acbvfdc

Resources