Perl regex not matching as expected - arrays

I'm trying to compare each word in a list to a string to find matching words, but I can't seem to get this to work.
Here is some sample code
my $sent = "this is a test line";
foreach (#keywords) { # array of words (contains the word 'test')
if ($sent =~ /$_/) {
print "match found";
}
}
It seems to work if I manually enter /test/ instead of $_, but I can't enter words manually.

Your code works fine. I hope you have use strict and use warnings in place in the real program? Here is an example where I have populated #keywords with a few items including test.
use strict;
use warnings;
my $sent = "this is a test line";
my #keywords = qw/ a b test d e /;
foreach (#keywords) {
if ($sent =~ /$_/) {
print "match found\n";
}
}
output
match found
match found
match found
So your array doesn't contain what you think it does. I would bet that you've read the data from a file or from the keyboard and forgot to remove the newline from the end of each word with chomp.
You can do that by simply writing
chomp #keywords
which will remove a newline (if there is one) from the end of all elements of #keywords. To see the real contents of #keywords, you can add these lines to your program
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper \#keywords;
You will also see that the elements a and e produce a match as well as test, which I guess you don't want. You could add a word boundary metacharacter \b before and after the value of $_, like this
foreach (#keywords) {
if ( $sent =~ /\b$_\b/ ) {
print "match found\n";
}
}
but a regular expression's definition of a word is very restrictive and allows only alphanumeric characters or an underscore _, so Roger's, "essay", 99%, and nicely-formatted are not "words" in this sense. Depending on your actual data you may want something different.
Finally, I would write this loop more compactly using for instead of foreach (they are identical in every respect) and the postfixed statement modifier form of if, like this
for (#keywords) {
print "match found\n" if $sent =~ /\b$_\b/;
}

Related

split string to some parts and replace a one sub-string with another string

I am trying to replace a string with a substring which is located in between other two parts.
To describe, I have a file which contains some text in. In this text file there is one word which some parts of it are written in different character, as an example like:
acc\E34rate
acc\?4rate
acc§54rate
.....
What I want to write as a code is, to lookup for for acc and then rate and then replace what is between them with u. Because all strings are in commen with the the first part and the last part.
I wonder how I can do it in Perl?
Thanks!
Update: including Code
well what I have written is:
use strict;
use warnings;
my #stringArray = ('acc\E34rate', 'acc\?4rate');
my $find = '\E34';
my $replace = 'u';
my #newArray;
foreach my $str(#stringArray)
{
my $pos = index($str, $find);
while($pos > -1) {
substr($str, $pos, length($find), $replace);
$pos = index($str, $find, $pos + length($replace));
}
push #newSrray, $str;
}
foreach(#newArray)
{
print "$_\r\n";
}
To simplify, I have added an array instead of a file. Because it works for only a proper word rather than the whole array/file.
I think this is what you want but the requirements are not clear. See perldoc perlre for more details.
#!/usr/bin/env perl
use strict;
use warnings;
my $begin = 'acc';
my $end = 'rate';
my $replace = 'u';
while( my $line = <DATA> ){
$line =~ s{ \Q$begin\E \S*? \Q$end\E }{$begin$replace$end}gmsx;
print $line;
}
__DATA__
acc\E34rate
acc\?4rate
acc§54rate
acc\E34rate acc\?4rate acc§54rate
accFOOacc
rateFOOrate
rateFOOrate accFOOacc
accFOOacc rateFOOrate
Try this:
$Text = "acc\E34rate
acc\?4rate
acc§54rate"; # This is the joined string (using enter key) after reading from the file
$Text =~ s/^acc.*?rate$/accutext/mg;
print $Text;
I've just tested it in my system and it is working fine.
Output:
accutext
accutext
accutext
m is to denote that the string is a multi line string and that each \n will be treated as an end of string character.
g is to replace all possible occurrences.
To get back as an array, split using \n.
Please note that the above code is written based on the assumption that each line in the file will begin and end with acc and text respectively and that there are no additional text after or before them in that line (ie, File is not having individual lines like "Driving acc\?4rate at 60kmph" and only "acc\?4rate").
In case this word is in between words in a sentence, replace below in the above code.
$Text =~ s/acc.*?rate/accutext/g;
Incidentally, this will work in all possible inputs too, including the code at the top.

Changing element's positions in Perl

So I have a problem and I can't solve it. If I read some words from a file in Perl, in that file the words aren't in order, but have a number (as a first character) that should be the element's position to form a sentence.The 0 means that position is correct, 1 means that the word should be in position [1] etc.
The file looks like: 0This 3a 4sentence 2be 1should, and the solution should look like 0This 1should 2be 3a 4sentence.
In a for loop I get through the words array that i get from the file, and this is how i get the first character(the number) $firstCharacter = substr $words[$i], 0, 1;, but i don't know how to properly change the array.
Here's the code that I use
#!/usr/bin/perl -w
$arg = $ARGV[0];
open FILE, "< $arg" or die "Can't open file: $!\n";
$/ = ".\n";
while($row = <FILE>)
{
chomp $row;
#words = split(' ',$row);
}
for($i = 0; $i < scalar #words; $i++)
{
$firstCharacter = substr $words[$i], 0, 1;
if($firstCharacter != 0)
{
}
}
Just use sort. You can use a match in list context to extract the numbers, using \d+ will work even for numbers > 9:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #words = qw( 0This 3a 4sentence 2be 1should );
say join ' ', sort { ($a =~ /\d+/g)[0] <=> ($b =~ /\d+/g)[0] } #words;
If you don't mind the warnings, or you are willing to turn them off, you can use numeric comparison directly on the words, Perl will extract the numeric prefixes itself:
no warnings 'numeric';
say join ' ', sort { $a <=> $b } #words;
Assuming you have an array like this:
my #words = ('0This', '3a', '4sentence', '2be', '1should');
And you want it sorted like so:
('0This', '1should', '2be', '3a', '4sentence');
There's two steps to this. First is extracting the leading number. Then sorting by that number.
You can't use substr, because you don't know how long the number might be. For example, ('9Second', '12345First'). If you only looked at the first character you'd get 9 and 1 and sort them incorrectly.
Instead, you'd use a regex to capture the number.
my($num) = $word =~ /^(\d+)/;
See perlretut for more on how that works, particularly Extracting Matches.
Now that you can capture the numbers, you can sort by them. Rather than doing it in loop yourself, sort handles the sorting for you. All you have to do is supply the criterion for the sorting. In this case we capture the number from each word (assigned to $a and $b by sort) and compare them as numbers.
#words = sort {
# Capture the number from each word.
my($anum) = $a =~ /^(\d+)/;
my($bnum) = $b =~ /^(\d+)/;
# Compare the numbers.
$anum <=> $bnum
} #words;
There are various ways to make this more efficient, in particular the Schwartzian Transform.
You can also cheat a bit.
If you ask Perl to treat something as a number, it will do its damnedest to comply. If the string starts with a number, it will use that and ignore the rest, though it will complain.
$ perl -wle 'print "23foo" + "42bar"'
Argument "42bar" isn't numeric in addition (+) at -e line 1.
Argument "23foo" isn't numeric in addition (+) at -e line 1.
65
We can take advantage of that to simplify the sort by just comparing the words as numbers directly.
{
no warnings 'numeric';
#words = sort { $a <=> $b } #words;
}
Note that I turned off the warning about using a word as a number. use warnings and no warnings only has effect within the current block, so by putting the no warnings 'numeric' and the sort in their own block I've only turned off the warning for that one sort statement.
Finally, if the words are in a file you can use the Unix sort utility from the command line. Use -n for "numeric sorting" and it will do the same trick as above.
$ cat test.data
00This
3a
123sentence
2be
1should
$ sort -n test.data
00This
1should
2be
3a
123sentence
You should be able to split on the spaces, which will make the numbers the first character of the word. With that assumption, you can simply compare using the numerical comparison operator (<=>) as opposed to the string comparison (cmp).
The operators are important because if you compare strings, the first character is used, meaning 10, 11, and 12 would be out of order, and listed near the 1 (1,10,11,12,2,3,4… instead of 1,2,3,4…10,11,12).
Split, Then Sort
Note: #schwern commented an important point. If you use warnings -- and you should -- you will receive warnings. This is because the values of the internal comparison variables, $a and $b, aren't numbers, but strings (e.g., `"0this", "3a"). I've update the following Codepad and provided more suitable alternatives to avoid this issue.
http://codepad.org/xs2GH9xT
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/,$line;
my #sorted = sort {$a <=> $b} #words;
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
Alternatives
One method is to ignore the warning using no warnings 'numeric' as in Schwern's answer. As he has shown, turning off the warnings in a block will re-enable it afterwards, which may be a little foolproof compared to Choroba's answer, which applies it to the broader scope.
Choroba's solution works by parsing the digits from the those values internally. This is much fewer lines of code, but I would generally advise against that for performance reasons. The regex isn't only run once per word, but multiple times over the sorting process.
Another method is to strip the numbers out and use them for the sort comparison. I attempt to do this below by creating a hash, where the key will be the number and the value will be the word.
Hash Mapping / Key Sort
Once you have an array where the values are the words prefixed by the numbers, you could just as easily split those number/word combo into a hash that has the key as the number and value as the word. This is accomplished by using split.
The important thing to note about the split statement is that a limit is passed (in this case 2), which limits the maximum number of fields the string is split into.
The two values are then used in the map to build the key/value assignment. Thus "0This" is split into "0" and "This" to be used in the hash as "0"=>"This"
http://codepad.org/kY8wwajc
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/, $line; # [ '0This', '3a', ... ]
my %mapped = map { split /(?=\D)/, $_, 2 } #words; # { '0'=>'This, '3'=>'a', ... }
my #sorted = #mapped{ sort { $a <=> $b } keys %mapped }; # [ 'This', 'should', 'be', ... ]
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
This also can be further optimized, but uses multiple variables to illustrate the steps in the process.

Perl matching multidimensional array elements

Im not getting any output, anyone get where the issue lies,
matching or calling?
(The two subarrays in the multidimensional array have the same length.)
//Multidimensional array,
//Idarray = Fasta ID, Seqarray = "ATTGTTGGT" sequences
#ordarray = (\#idarray, \#seqarray);
//This calling works
print $ordarray[0][0] , "\n";
print $ordarray[1][0] , "\n", "\n";
// Ordarray output = "TTGTGGCACATAATTTGTTTAATCCAGAT....."
User inputs a search string, loop iterates the sequence dimension,
and counts amount of matches. Prints number of matches and the corresponding ID from the ID dimension.
//The user input-searchstring
$sestri = <>;
for($r=0;$r<#idarray;$r++) {
if ($sestri =~ $ordarray[1][$r] ){
print $ordarray[0][$r] , "\n";
$counts = () = $ordarray[0][$r] =~ /$sestri/g;
print "number of counts: ", $counts ;
}
I think the problem lies with this:
$sestri = <>;
That may well not be doing what you intended - your comment says "user specified search string" but that's not what that operator does.
What it does, is open the filename you specifed on the command line, and 'return' the first line.
I would suggest that if you want to grab a search string from command line you want to do it via #ARGV
E.g.
my ( $sestri ) = #ARGV; # will give first word.
However, please please please switch on use strict and use warnings. You should always do this prior to posting on a forum for assistance.
I would also question quite why you need a two dimensional array with two elements in it though. It seems unnecessary.
Why not instead make a hash, and key your "fasta ids" to the sequence?
E.g.
my %id_of;
#id_of{#seqarray} = #idarray;
my %seq_of;
#seq_of{#id_array} = #seqarray;
I think this would suit your code a bit better, because then you don't have to worry about the array indicies at all.
use strict;
use warnings;
my ($sestri) = #ARGV;
my %id_of;
#id_of{#seqarray} = #idarray;
foreach my $sequence ( keys %id_of ) {
##NB - this is a pattern match, and will be 'true'
## if $sestri is a substring of $sequence
if ( $sequence =~ m/$sestri/ ) {
print $id_of{$sequence}, "\n";
my $count = () = $sequence =~ m/$sestri/g;
print "number of counts: ", $count, "\n";
}
}
I've rewritten it a bit, because I'm not entirely understanding what your code is doing. It looks like it's substring matching in #seqarray but then returning the count of matching elements in #idarray I don't think that makes sense, but if it does, then amend according to your needs.

How to find index of string in array Perl without iterating

I need to find value in array without iterating through whole array.
I get array of strings from file, and I need to get index of some value in this array, I have tried this code, but it doesn't work.
my #array =<$file>;
my $search = "SomeValue";
my $index = first { $array[$_] eq $search } 0 .. $#array;
print "index of $search = $index\n";
Please suggest how can I get index of value, or it is better to get all indexes of line if there are more than one entry.
Thx in advance.
What does "it doesn't work" mean?
The code you have will work fine, except that an element in the array is going to be "SomeValue\n", not "SomeValue". You can remove the newlines with chomp(#array) or include a newline in your $search string.
Your initial question: "I need to find value in array without iterating through whole array."
You can't. It is impossible to check every element of an array, without checking every element of an array. The very best you can do is stop looking once you've found it - but you indicate in your question multiple matches.
There are various options that will do this for you - like List::Util and grep. But they are still doing a loop, they're just hiding it behind the scenes.
The reason first doesn't work for you, is probably because you need to load it from List::Util first. Alternatively - you forgot to chomp, which means your list includes line feeds, where your search pattern doesn't.
Anyway - in the interests of actually giving something that'll do the job:
while ( my $line = <$file> ) {
chomp ( $line );
#could use regular expression based matching for e.g. substrings.
if ( $line eq $search ) { print "Match on line $.\n"; last; }
}
If you want want every match - omit the last;
Alternatively - you can match with:
if ( $line =~ m/\Q$search\E/ ) {
Which will substring match (Which in turn means the line feeds are irrelevant).
So you can do this instead:
while ( <$file> ) {
print "Match on line $.\n" if m/\Q$search\E/;
}

How to compare each element of an array using regex?

I am using the Lingua::EN::Tagger Perl module in order to tag parts of speech from a user's input. That portion of my code works perfect. However, the problem is that I only want to keep the input that has the noun tags which are "NN, NNS, NNP, NNPS", and store these words in a separate array #nounArray. The user will be inputting a question such as "what is your name?" Each element of the question will be tagged: What/WP is/is your/PN name/NN
my #UserInput = $readable_text;
my #nounArray;
foreach my $UserInput (#UserInput){
if ($UserInput =~ m/NN|NNS$|NNP$|NNPS$/){
$UserInput = #nounArray;
}
print #nounArray;
}
However, nothing occurs when I run the code. The goal is to have the nouns of the user's input be placed in a separate array after separating them from the original array. I do not want to print the array, but i do this in order to see if the code was working.
Since you want to iterate over the words in $readable_text you can split them first into array,
my $readable_text = "What/WP is/is your/PN name/NN";
my #UserInput = split ' ', $readable_text;
my #nounArray;
foreach my $UserInput (#UserInput) {
if ($UserInput =~ m/NN|NNS$|NNP$|NNPS$/) {
# print "$UserInput\n";
push #nounArray, $UserInput;
}
}
print #nounArray;
$ matches at the end of the string. I suppose your strings have at least a \n at the end, which would prevent them from matching.
But as you point out in your comment, it looks like you're trying to match word boundaries here, so just replace all $ in your expression with \b.
First, split your words by whitespace:
my #UserInput = split /\s+/, $UserInput;
Then grep for the nouns:
my #nouns = grep { m%/N% } #UserInput; # only noun tags include /N

Resources