search for string in file using an aray elements perl - arrays

I have a array which contains set of unique elements my_array= [aab, abc def, fgh,]
I have a file which containing these elements(repeated also)
I want to count each unique element has how many repetitions if no repetition then count is 1
example of file :
i want to have aab but no i dont want abc
i want to have aab but no i dont want def
output should be
aab - 2
abc - 1
def - 1
I tried to search first and print it its not woking
use strict;
use warnings;
my #my_array;
#my_array =("abc", "aab", "def");
open (my $file, '<', 'filename.txt') or die;
my $value;
foreach $value (#my_array) {
while(<$file>) {
if ($_ =~ /$value/){
print "found : $value\n";
}
}
}
Also tried 2nd method
use strict;
use warnings;
my #my_array;
#my_array =("abc", "aab", "def");
open (my $file, '<', 'filename.txt') or die;
while (<$file>) {
my $k=0;
if ($_ =~ /$my_array[$k]/) {
print "$my_array[$k]”;
}
}

Sample input data does not specify if lookup words repeat in the line or not.
Following demo code assumes that lookup words do not repeat in the line.
If this statement above does not true then the line should be split into tokens and each token must be inspected to get correct count of lookup words.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my(%count,#lookup);
#lookup =('abc', 'aab', 'def');
while( my $line = <DATA> ) {
for ( #lookup ) {
$count{$_}++ if $line =~ /\b$_\b/;
}
}
say Dumper(\%count);
exit 0;
__DATA__
i want to have aab but no i dont want abc
i want to have aab but no i dont want def
Output
$VAR1 = {
'aab' => 2,
'abc' => 1,
'def' => 1
};

I'm a fan of the Algorithm::AhoCorasick::XS module for performing efficient searches for multiple strings at once. An example:
#!/usr/bin/env perl
use warnings;
use strict;
use Algorithm::AhoCorasick::XS;
my #words = qw/abc aab def/;
my $aho = Algorithm::AhoCorasick::XS->new(\#words);
my %counts;
while (my $line = <DATA>) {
$counts{$_}++ for $aho->matches($line);
}
for my $word (#words) {
printf "%s - %d\n", $word, $counts{$word}//1;
}
__DATA__
i want to have aab but no i dont want abc
i want to have aab but no i dont want def
outputs
abc - 1
aab - 2
def - 1
The $counts{$word}//1 bit in the output will give you a 1 if that word doesn't exist in the hash because it wasn't encountered in the text.

Can build an alternation pattern from the keywords and so match all that are on the line in one regex run, then populate a frequency hash with the matches
use warnings;
use strict;
use feature 'say';
use Data::Dumper;
my #keywords = qw(aab abc def fgh);
my $re_w = join '|', #keywords;
my %freq;
while (<>) {
++$freq{$_} for /($re_w)/g
}
say Dumper \%freq;
The <> operator reads line by line the files with names given on the command line, so the program is used as prog.pl file. (Or open the file "manually" in the program.)
The for loop imposes list context on its expression, so that regex returns the list of matches (captures), as the match operator does in the list context, and the ++$freq{$_} expression works with them one at a time.
The code counts all instances of keywords that repeat on a line. If that's not desired please clarify (can add a call to List::Util::uniq before feeding the list of matches to the for loop).
There are a number of other details that may need closer attention.
One example: if there are overlapping keywords, which one takes precedence? For instance, with keywords the and there, once the word there is encountered in the text should it be matched by there or by the? If it is there then keywords in the alternation pattern should be ordered from longest to shortest,
my $re_w = join '|', sort { length $b <=> length $a } #w;
Please clarify if there are additional considerations.

Related

compare an array of string with another array of strings in perl

I want to compare an array of string with another array of strings; if it matches, print matched.
Example:
#array = ("R-ID 1.0001", "RA-ID 61.02154", "TCA-ID 49.021456","RCID 61.02154","RB-ID 61.02154");
#var = ("TCA-ID 49", "R-ID 1");
for (my $x = 0; $x <= 4; $x++)
{
$array[$x] =~ /(.+?)\./;
if( ($var[0] eq $1) or ($var[1] eq $1) )
{
print "\n deleted rows are :#array\n";
}
else
{
print "printed rows are : #array \n";
push(#Matrix, \#array);
}
Then I need to compare #var with the #array; if it is matched, print the matched pattern.
Here the entire logic is in a hireartical for loop which gives a new #array in each iteration. so every time this logic is executed #array has different strings.
Then comes with #var it is user input field, this #var can be of any size. So in order to run the logic according to these constraints, I need to iterate the condition inside the if loop when the user input #var size is 3 for example.
So the goal is to match and delete the user input stings using the above mentioned logic. But unfortunately tis logic is not working. Could you please help me out in this issue.
The builtin grep keyword is a good place to start.
my $count = grep { $_ eq $var } #array;
This returns a count of items ($_) in the array which are equal (eq) to $var.
If you needed case-insensitive matching, you could use lc (or in Perl 5.16 or above, fc) to do that:
my $count = grep { lc($_) eq lc($var) } #array;
Now, a disadvantage to grep is that it is counting the matches. So after if finds the first match, it will keep on going until the end of the array. You don't seem to want that, but just want to know if any item in the array matches, in which case keeping on going might be slower than you need if it's a big array with thousands of elements.
So instead, use any from the List::Util module (which is bundled with Perl).
use List::Util qw( any );
my $matched = any { $_ eq $var } #array;
This will match as soon as it finds the first matching element, and skip searching the rest of the array.
Here is a couple of versions that allows multiple strings to be matched. Not clear what form $var takes when you want to store multiple, so assuming they are in an array #var for now.
The key point is this one is the use of the lookup hash to to the matching.
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
# create a lookup for the strings to match
my %lookup = map { $_ => 1} #var ;
for my $entry (#array)
{
print "$entry\n"
if $lookup{$entry} ;
}
running gives
RA-ID 61
TCA-ID 49
Next, using a regular expression to do the matching
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
my $re = join "|", map { quotemeta } #var;
print "$_\n" for grep { /^($re)$/ } #array ;
output is the same

How to read a .txt file and store it into an array

I know this is a fairly simple question, but I cannot figure out how to store all of the values in my array the way I want to.
Here is a small portion what the .txt file looks like:
0 A R N D
A 2 -2 0 0
R -2 6 0 -1
N 0 0 2 2
D 0 -1 2 4
Each value is delimited by either two spaces - if the next value is positive - or a space and a '-' - if the next value is negative
Here is the code:
use strict;
use warnings;
open my $infile, '<', 'PAM250.txt' or die $!;
my $line;
my #array;
while($line = <$infile>)
{
$line =~ /^$/ and die "Blank line detected at $.\n";
$line =~ /^#/ and next; #skips the commented lines at the beginning
#array = $line;
print "#array"; #Prints the array after each line is read
};
print "\n\n#array"; #only prints the last line of the array ?
I understand that #array only holds the last line that was passed to it. Is there a way where I can get #array to hold all of the lines?
You are looking for push.
push #array, $line;
You undoubtedly want to precede this with chomp to snip any newlines, first.
If file is small as compared to available memory of your machine then you can simply use below method to read content of file in to an array
open my $infile, '<', 'PAM250.txt' or die $!;
my #array = <$infile>;
close $infile;
If you are going to read a very large file then it is better to read it line by line as you are doing but use PUSH to add each line at end of array.
push(#array,$line);
I will suggest you also read about some more array manipulating functions in perl
You're unclear to what you want to achieve.
Is every line an element of your array?
Is every line an array in your array and your "words" are the elements of this array?
Anyhow.
Here is how you can achieve both:
use strict;
use warnings;
use Data::Dumper;
# Read all lines into your array, after removing the \n
my #array= map { chomp; $_ } <>;
# show it
print Dumper \#array;
# Make each line an array so that you have an array of arrays
$_= [ split ] foreach #array;
# show it
print Dumper \#array;
try this...
sub room
{
my $result = "";
open(FILE, <$_[0]);
while (<FILE>) { $return .= $_; }
close(FILE);
return $result;
}
so you have a basic functionality without great words. the suggest before contains the risk to fail on large files. fastest safe way is that. call it as you like...
my #array = &room('/etc/passwd');
print room('/etc/passwd');
you can shorten, rename as your convinience believes.
to the kidding ducks nearby: by this way the the push was replaced by simplictiy. a text-file contains linebreaks. the traditional push removes the linebreak and pushing up just the line. the construction of an array is a simple string with linebreaks. now contain the steplength...

Extract number from array in Perl

I have a array which have certain elements. Each element have two char "BC" followed by a number
e.g - "BC6"
I want to extract the number which is present and store in a different array.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my #band = ("BC1", "BC3");
foreach my $elem(#band)
{
my #chars = split("", $elem);
foreach my $ele (#chars) {
looks_like_number($ele) ? 'push #band_array, $ele' : '';
}
}
After execution #band_array should contain (1,3)
Can someone please tell what I'm doing wrong? I am new to perl and still learning
To do this with a regular expression, you need a very simple pattern. /BC(\d)/ should be enough. The BC is literal. The () are a capture group. They save the match inside into a variable. The first group relates to $1 in Perl. The \d is a character group for digits. That's 0-9 (and others, but that's not relevant here).
In your program, it would look like this.
use strict;
use warnings;
use Data::Dumper;
my #band = ('BC1', 'BC2');
my #numbers;
foreach my $elem (#band) {
if ($elem =~ m/BC(\d)/) {
push #numbers, $1;
}
}
print Dumper #numbers;
This program prints:
$VAR1 = '1';
$VAR2 = '2';
Note that your code had several syntax errors. The main one is that you were using #band = [ ... ], which gives you an array that contains one array reference. But your program assumed there were strings in that array.
Just incase your naming contains characters other than BC this will exctract all numeric values from your list.
use strict;
use warnings;
my #band = ("AB1", "BC2", "CD3");
foreach my $str(#band) {
$str =~ s/[^0-9]//g;
print $str;
}
First, your array is an anonymous array reference; use () for a regular array.
Then, i would use grep to filter out the values into a new array
use strict;
use warnings;
my #band = ("BC1", "BC3");
my #band_array = grep {s/BC(\d+)/$1/} #band;
$"=" , "; # make printing of array nicer
print "#band_array\n"; # print array
grep works by passing each element of an array in the code in { } , just like a sub routine. $_ for each value in the array is passed. If the code returns true then the value of $_ after the passing placed in the new array.
In this case the s/// regex returns true if a substitution is made e.g., the regex must match. Here is link for more info on grep

Using reference file to find desired data in another file - Perl

I used a Perl script to compare two arrays and retrieve data of interest. Now I want to use that list of retrieved data to get desired information from another list, using only the first part as an identifier but pulling all the information in that line.
Example data:
Reference:
apple
orange
pear
Search list:
apple 439
plum 657
orange 455
Result:
apple 439
orange 455
I've tried doing this with Array::Compare but haven't had any luck as it compares the entire line not just the first portion.
Thanks!
EDIT: Thanks to DVK I now have the following code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
#Convert the first file into an array of keys #keys
my #keys = read_file('Matching_strains.txt');
#Convert the second file into an array of lines #lines2
my #lines = read_file('output2.txt');
#Convert that array of lines into a hash using map and split
my %data = map { split(/\s+/, $, 2) } #lines; # 2 limits # of entries
#Get a list of data for which keys are in the first list
my %final = map { exists $data{$_} ? ( $_=>$data{$_} ) : () } #keys;
#Print that hash out
print "%final\n";
But I'm getting a number found where operator expected for the my %data, I've consulted perldoc but I'm not sure what number its referring to.
Thanks!
Convert the first file into an array of keys #keys
Left as excercise for the reader
Convert the second file into an array of lines #lines2
Left as excercise for the reader
Convert that array of lines into a hash using map and split
my %data = map { split(/\s+/, $, 2) } #lines; # 2 limits # of entries
Get a list of data for which keys are in the first list
my %final = map { exists $data{$_} ? ( $_=>$data{$_} ) : () } #keys;
Print that hash out
This sort of thing should achieve what you're after:
use warnings;
use strict;
open my $file1, '<', 'in.txt' or die $!;
open my $file2, '<', 'in.2.txt' or die $!;
my (%keys, %data);
while(<$file1>){
chomp;
$keys{$_} = 1;
}
while(<$file2>){
chomp;
my #split = split/\s/;
$data{$split[0]} = $split[1];
}
foreach (keys %keys){
print "$_ $data{$_}\n" if exists $data{$_};
}
apple 439
orange 455

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources