Most efficient process of matching array elements in Perl? - arrays

I have two arrays (Array1 and Array2). Array1 contains approximately 20,000 non-unique elements. Array2 contains approximately 90,000 unique elements. I am trying to count the number of elements in Array1 that also show up as elements of Array2. The perl script below does this successfully but slowly. Is there another method for counting the number of Array1 elements that exist in Array2 that is likely to be faster than this?
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
# This program counts the total number of elements in one array that exist in a separate array.
my $Original_Array_File = "U:/Perl/MasterArray.txt";
# Read in the master file into an array.
open my $file, '<', $Original_Array_File or die $!;
my #Array2 = <$file>;
close $file;
my $path = "C:/Files by Year/1993";
chdir($path) or die "Cant chdir to $path $!";
for my $new_file ( grep -f, glob('*.txt') ) {
open my ($new_fh), '<', $new_file;
my #Array1 = <$new_file>;
my #matched_array_count ;
foreach #Array1 {
++$matched_array_count if ($_ ~~ #Array2 ) ;
}

The primary source of your performance problem is:
foreach #Array1 {
++$matched_array_count if ($_ ~~ #Array2 );
}
According to the smartmatch documentation, the behavior of
$string ~~ #array
is like
grep { $string ~~ $_ } #array
and
$string1 ~~ $string2
is like
$string1 eq $string2
Putting these together with your original code snippet, we get something like:
foreach my $string1 #Array1 {
++$matched_array_count if grep { $string1 eq $_ } #Array2;
}
In other words, take the first element in #Array1 and compare it to every single element in #Array2, then take the second element in #Array1 and compare it to every single element in #Array2, and so on. This works out to
scalar #Array1 * scalar #Array2
comparisons, or roughly 1.8 billion total.
The way this is usually done in Perl is with a hash. It is significantly faster to do a single hash lookup than to search through every element in an array. The basic algorithm is:
Load your "haystack" (what you're searching in) into a hash.
Loop through your "needles" (what you're searching for) and use exists to see if there's a matching key in the hash.
In your particular case, you would load the contents of U:/Perl/MasterArray.txt into a hash and then perform
scalar #Array1
hash lookups, or ~20,000. That will be significantly faster than what you have now.
Here is an example of searching for words in a Linux dictionary file:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $file = '/usr/share/dict/linux.words';
open my $fh, '<', $file or die "Failed to open '$file': $!";
my %haystack = map { chomp; $_ => 1 } <$fh>;
my $count;
while (<DATA>) {
chomp;
$count++ if exists $haystack{$_};
}
say $count;
__DATA__
foo
bar
foobar
fubar
Output:
3
(Apparently, "foobar" is a word. "FUBAR" is a word, but "fubar" in lowercase is not.)
Smartmatch is experimental
You should also be aware that as of Perl 5.18.0, the smartmatch family of features are now experimental:
Smart match, added in v5.10.0 and significantly revised in v5.10.1, has been a regular point of complaint. Although there are a number of ways in which it is useful, it has also proven problematic and confusing for both users and implementors of Perl. There have been a number of proposals on how to best address the problem. It is clear that smartmatch is almost certainly either going to change or go away in the future. Relying on its current behavior is not recommended. (emphasis added)

It looks like you want to compare many files.
You might split this up to use a pool of threads with one thread per file. Or if you are more memory constrained you might split one of the arrays into parts and have each part sent to a different thread.

Related

compare an array of string with another array of strings in perl

I want to compare an array of string with another array of strings; if it matches, print matched.
Example:
#array = ("R-ID 1.0001", "RA-ID 61.02154", "TCA-ID 49.021456","RCID 61.02154","RB-ID 61.02154");
#var = ("TCA-ID 49", "R-ID 1");
for (my $x = 0; $x <= 4; $x++)
{
$array[$x] =~ /(.+?)\./;
if( ($var[0] eq $1) or ($var[1] eq $1) )
{
print "\n deleted rows are :#array\n";
}
else
{
print "printed rows are : #array \n";
push(#Matrix, \#array);
}
Then I need to compare #var with the #array; if it is matched, print the matched pattern.
Here the entire logic is in a hireartical for loop which gives a new #array in each iteration. so every time this logic is executed #array has different strings.
Then comes with #var it is user input field, this #var can be of any size. So in order to run the logic according to these constraints, I need to iterate the condition inside the if loop when the user input #var size is 3 for example.
So the goal is to match and delete the user input stings using the above mentioned logic. But unfortunately tis logic is not working. Could you please help me out in this issue.
The builtin grep keyword is a good place to start.
my $count = grep { $_ eq $var } #array;
This returns a count of items ($_) in the array which are equal (eq) to $var.
If you needed case-insensitive matching, you could use lc (or in Perl 5.16 or above, fc) to do that:
my $count = grep { lc($_) eq lc($var) } #array;
Now, a disadvantage to grep is that it is counting the matches. So after if finds the first match, it will keep on going until the end of the array. You don't seem to want that, but just want to know if any item in the array matches, in which case keeping on going might be slower than you need if it's a big array with thousands of elements.
So instead, use any from the List::Util module (which is bundled with Perl).
use List::Util qw( any );
my $matched = any { $_ eq $var } #array;
This will match as soon as it finds the first matching element, and skip searching the rest of the array.
Here is a couple of versions that allows multiple strings to be matched. Not clear what form $var takes when you want to store multiple, so assuming they are in an array #var for now.
The key point is this one is the use of the lookup hash to to the matching.
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
# create a lookup for the strings to match
my %lookup = map { $_ => 1} #var ;
for my $entry (#array)
{
print "$entry\n"
if $lookup{$entry} ;
}
running gives
RA-ID 61
TCA-ID 49
Next, using a regular expression to do the matching
use strict;
use warnings;
my #var = ("TCA-ID 49", "RA-ID 61");
my #array = ("R-ID 1", "RA-ID 61", "TCA-ID 49");
my $re = join "|", map { quotemeta } #var;
print "$_\n" for grep { /^($re)$/ } #array ;
output is the same

How to read a .txt file and store it into an array

I know this is a fairly simple question, but I cannot figure out how to store all of the values in my array the way I want to.
Here is a small portion what the .txt file looks like:
0 A R N D
A 2 -2 0 0
R -2 6 0 -1
N 0 0 2 2
D 0 -1 2 4
Each value is delimited by either two spaces - if the next value is positive - or a space and a '-' - if the next value is negative
Here is the code:
use strict;
use warnings;
open my $infile, '<', 'PAM250.txt' or die $!;
my $line;
my #array;
while($line = <$infile>)
{
$line =~ /^$/ and die "Blank line detected at $.\n";
$line =~ /^#/ and next; #skips the commented lines at the beginning
#array = $line;
print "#array"; #Prints the array after each line is read
};
print "\n\n#array"; #only prints the last line of the array ?
I understand that #array only holds the last line that was passed to it. Is there a way where I can get #array to hold all of the lines?
You are looking for push.
push #array, $line;
You undoubtedly want to precede this with chomp to snip any newlines, first.
If file is small as compared to available memory of your machine then you can simply use below method to read content of file in to an array
open my $infile, '<', 'PAM250.txt' or die $!;
my #array = <$infile>;
close $infile;
If you are going to read a very large file then it is better to read it line by line as you are doing but use PUSH to add each line at end of array.
push(#array,$line);
I will suggest you also read about some more array manipulating functions in perl
You're unclear to what you want to achieve.
Is every line an element of your array?
Is every line an array in your array and your "words" are the elements of this array?
Anyhow.
Here is how you can achieve both:
use strict;
use warnings;
use Data::Dumper;
# Read all lines into your array, after removing the \n
my #array= map { chomp; $_ } <>;
# show it
print Dumper \#array;
# Make each line an array so that you have an array of arrays
$_= [ split ] foreach #array;
# show it
print Dumper \#array;
try this...
sub room
{
my $result = "";
open(FILE, <$_[0]);
while (<FILE>) { $return .= $_; }
close(FILE);
return $result;
}
so you have a basic functionality without great words. the suggest before contains the risk to fail on large files. fastest safe way is that. call it as you like...
my #array = &room('/etc/passwd');
print room('/etc/passwd');
you can shorten, rename as your convinience believes.
to the kidding ducks nearby: by this way the the push was replaced by simplictiy. a text-file contains linebreaks. the traditional push removes the linebreak and pushing up just the line. the construction of an array is a simple string with linebreaks. now contain the steplength...

Changing element's positions in Perl

So I have a problem and I can't solve it. If I read some words from a file in Perl, in that file the words aren't in order, but have a number (as a first character) that should be the element's position to form a sentence.The 0 means that position is correct, 1 means that the word should be in position [1] etc.
The file looks like: 0This 3a 4sentence 2be 1should, and the solution should look like 0This 1should 2be 3a 4sentence.
In a for loop I get through the words array that i get from the file, and this is how i get the first character(the number) $firstCharacter = substr $words[$i], 0, 1;, but i don't know how to properly change the array.
Here's the code that I use
#!/usr/bin/perl -w
$arg = $ARGV[0];
open FILE, "< $arg" or die "Can't open file: $!\n";
$/ = ".\n";
while($row = <FILE>)
{
chomp $row;
#words = split(' ',$row);
}
for($i = 0; $i < scalar #words; $i++)
{
$firstCharacter = substr $words[$i], 0, 1;
if($firstCharacter != 0)
{
}
}
Just use sort. You can use a match in list context to extract the numbers, using \d+ will work even for numbers > 9:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #words = qw( 0This 3a 4sentence 2be 1should );
say join ' ', sort { ($a =~ /\d+/g)[0] <=> ($b =~ /\d+/g)[0] } #words;
If you don't mind the warnings, or you are willing to turn them off, you can use numeric comparison directly on the words, Perl will extract the numeric prefixes itself:
no warnings 'numeric';
say join ' ', sort { $a <=> $b } #words;
Assuming you have an array like this:
my #words = ('0This', '3a', '4sentence', '2be', '1should');
And you want it sorted like so:
('0This', '1should', '2be', '3a', '4sentence');
There's two steps to this. First is extracting the leading number. Then sorting by that number.
You can't use substr, because you don't know how long the number might be. For example, ('9Second', '12345First'). If you only looked at the first character you'd get 9 and 1 and sort them incorrectly.
Instead, you'd use a regex to capture the number.
my($num) = $word =~ /^(\d+)/;
See perlretut for more on how that works, particularly Extracting Matches.
Now that you can capture the numbers, you can sort by them. Rather than doing it in loop yourself, sort handles the sorting for you. All you have to do is supply the criterion for the sorting. In this case we capture the number from each word (assigned to $a and $b by sort) and compare them as numbers.
#words = sort {
# Capture the number from each word.
my($anum) = $a =~ /^(\d+)/;
my($bnum) = $b =~ /^(\d+)/;
# Compare the numbers.
$anum <=> $bnum
} #words;
There are various ways to make this more efficient, in particular the Schwartzian Transform.
You can also cheat a bit.
If you ask Perl to treat something as a number, it will do its damnedest to comply. If the string starts with a number, it will use that and ignore the rest, though it will complain.
$ perl -wle 'print "23foo" + "42bar"'
Argument "42bar" isn't numeric in addition (+) at -e line 1.
Argument "23foo" isn't numeric in addition (+) at -e line 1.
65
We can take advantage of that to simplify the sort by just comparing the words as numbers directly.
{
no warnings 'numeric';
#words = sort { $a <=> $b } #words;
}
Note that I turned off the warning about using a word as a number. use warnings and no warnings only has effect within the current block, so by putting the no warnings 'numeric' and the sort in their own block I've only turned off the warning for that one sort statement.
Finally, if the words are in a file you can use the Unix sort utility from the command line. Use -n for "numeric sorting" and it will do the same trick as above.
$ cat test.data
00This
3a
123sentence
2be
1should
$ sort -n test.data
00This
1should
2be
3a
123sentence
You should be able to split on the spaces, which will make the numbers the first character of the word. With that assumption, you can simply compare using the numerical comparison operator (<=>) as opposed to the string comparison (cmp).
The operators are important because if you compare strings, the first character is used, meaning 10, 11, and 12 would be out of order, and listed near the 1 (1,10,11,12,2,3,4… instead of 1,2,3,4…10,11,12).
Split, Then Sort
Note: #schwern commented an important point. If you use warnings -- and you should -- you will receive warnings. This is because the values of the internal comparison variables, $a and $b, aren't numbers, but strings (e.g., `"0this", "3a"). I've update the following Codepad and provided more suitable alternatives to avoid this issue.
http://codepad.org/xs2GH9xT
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/,$line;
my #sorted = sort {$a <=> $b} #words;
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
Alternatives
One method is to ignore the warning using no warnings 'numeric' as in Schwern's answer. As he has shown, turning off the warnings in a block will re-enable it afterwards, which may be a little foolproof compared to Choroba's answer, which applies it to the broader scope.
Choroba's solution works by parsing the digits from the those values internally. This is much fewer lines of code, but I would generally advise against that for performance reasons. The regex isn't only run once per word, but multiple times over the sorting process.
Another method is to strip the numbers out and use them for the sort comparison. I attempt to do this below by creating a hash, where the key will be the number and the value will be the word.
Hash Mapping / Key Sort
Once you have an array where the values are the words prefixed by the numbers, you could just as easily split those number/word combo into a hash that has the key as the number and value as the word. This is accomplished by using split.
The important thing to note about the split statement is that a limit is passed (in this case 2), which limits the maximum number of fields the string is split into.
The two values are then used in the map to build the key/value assignment. Thus "0This" is split into "0" and "This" to be used in the hash as "0"=>"This"
http://codepad.org/kY8wwajc
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/, $line; # [ '0This', '3a', ... ]
my %mapped = map { split /(?=\D)/, $_, 2 } #words; # { '0'=>'This, '3'=>'a', ... }
my #sorted = #mapped{ sort { $a <=> $b } keys %mapped }; # [ 'This', 'should', 'be', ... ]
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
This also can be further optimized, but uses multiple variables to illustrate the steps in the process.

Is there a way to create variable Arrays in Perl?

So, well I am trying around again and now I am stuck.
while (<KOERGEBNIS>){
my $counter = 0;
my $curline = $_;
for (my $run = 0; $run < $arrayvalue; $run++){
if ($curline =~ m/#tidgef[$counter]/){
my $row = substr($curline, 0, 140);
push #array$counter, $row;
print "Row $. was saved in ID: #filtered[$counter]\n";
}
$counter++;
}
}
Background is that I want to save all lines beginning with the same 8 characters in the same array so I can count the lines and start working with those arrays. The only thing I could think of right now is with switch and cases but I thought I'd ask first before throwing this code to garbage.
Example:
if theres a line in a .txt like this:
50004000_xxxxxxxxxxxxxx31
50004000_xxxxxxxxxxxxxx33
60004001_xxxxxxxxxxxxxx11
60004001_xxxxxxxxxxxxxx45
I took the first 8 chars of each line and used uniq to filter duplicates and saved them in the array #tidgef, now I want to save Line1 and Line2 in #array1 or even better #array50004000 and Line4 and Line4 to #array2 or #array60004001.
I hope I explained my problem well enough! thank you guys
You're hovering dangerously close to an idea called "symbolic references" (also known as "use a variable to get a variable's name"). It's a very bad idea, for all sorts of reasons.
It's a much better idea to use this as an excuse to learn about complex data structures in Perl. It's not really clear what you want to do with this data, but this example should get you started:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %lines;
while (<DATA>) {
chomp;
my $key = substr($_, 0, 8);
push #{$lines{$key}}, $_;
}
say Dumper \%lines;
__DATA__
50004000_xxxxxxxxxxxxxx31
50004000_xxxxxxxxxxxxxx33
60004001_xxxxxxxxxxxxxx11
60004001_xxxxxxxxxxxxxx45
You should think carefully about why you want arrays called #array50004000 #array60004001. Your program could create them, but you have no way of knowing what those names are. While the code is running, unless you are stepping through it with the debugger, they may be called #x and #y for all you know. You can't even dump their contents because you have no idea what to dump
What you're looking for is a hash, specifically a hash of arrays. Unlike the symbol table, there are operators like keys, values and each that will allow you to enquire what values have been stored in a hash
Your code would look something like this. I have used the example data from your question and put it into myfile
use strict;
use warnings 'all';
my %data;
open KOERGEBNIS, '<', 'myfile' or die $!;
while ( <KOERGEBNIS> ) {
chomp;
my ($key) = split /_/;
push #{ $data{$key} }, $_;
}
for my $key ( sort keys %data ) {
my $val = $data{$key};
print $key, "\n";
print " $_\n" for #$val;
print "\n";
}
output
50004000
50004000_xxxxxxxxxxxxxx31
50004000_xxxxxxxxxxxxxx33
60004001
60004001_xxxxxxxxxxxxxx11
60004001_xxxxxxxxxxxxxx45

Perl - initialization of hash

I'm not sure how to correctly initialize my hash - I'm trying to create a key/value pair for values in coupled lines in my input file.
For example, my input looks like this:
#cluster t.18
46421 ../../../output###.txt/
#cluster t.34
41554 ../../../output###.txt/
I'm extracting the t number from line 1 (#cluster line) and matching it to output###.txt in the second line (line starting with 46421). However, I can't seem to get these values into my hash with the script that I have written.
#!/usr/bin/perl
use warnings;
use strict;
my $key;
my $value;
my %hash;
my $filename = 'input.txt';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^\#cluster/) {
my #fields = split /(\d+)/, $line;
my $key = $fields[1];
}
elsif ($line =~ m/^(\d+)/) {
my #output = split /\//, $line;
my $value = $output[5];
}
$hash{$key} = $value;
}
It's a good idea, but your $key that is created with my in the if block is a local variable scoped to that block, masking the global $key. Inside the if block the symbol $key has nothing to do with the one you nicely declared upfront. See my in perlsub.
This local $key goes out of scope as soon as if is done and does not exist outside the if block. The global $key is again available after the if, being visible elsewhere in the loop, but is undefined since it has never been assigned to. The same goes for $value in the elsif block.
Just drop the my declaration inside the loop, thus assign to those global variables (as intended?). So, $key = ... and $value = ..., and the hash will be assigned correctly.
Note -- this is about how to get that hash assignment right. I don't know how your actual data looks and whether the line is parsed correctly. Here is a toy input.txt
#cluster t.1
1111 ../../../output1.1.txt/
#cluster t.2
2222 ../../../output2.2.txt/
I pick the 4th field instead of the 6th, $value = $output[3];, and add
print "$_ => $hash{$_}\n" for keys %hash;
after the loop. This prints
1 => output1.1.txt
2 => output2.2.txt
I am not sure whether this is what you want but the hash is built fine.
A comment on choice of tools in parsing
You parse the lines for numbers, by using the property of split to return the separators as well, when they are captured. That is neat, but in some sense it reverses its main purpose, which is to extract other components from the string, as delimited by the pattern. Thus it may make the purpose of the code a little bit convoluted, and you also have to index very precisely to retrieve what you need.
Instead of using split to extract the delimiter itself, which is given by a regex, why not extract it by a regex? That makes the intention crystal clear, too. For example, with input
#cluster t.10 has 4319 elements, 0 subclusters
37652 ../../../../clust/output43888.txt 1.397428
the parsing can go as
if ($line =~ m/^\#cluster/) {
($key) = $line =~ /t\.(\d+)/;
}
elsif ($line =~ m/^(\d+)/) {
($value) = $line =~ m|.*/(\w+\.txt)|;
}
$hash{$key} = $value if defined $key and defined $value;
where t\. and \.txt are added to more precisely specify the targets. If the target strings aren't certain to have that precise form, just capture \d+, and in the second case all non-space after the last /, say by m|^\d+.*/(\S+)|. We use the greediness of .*, which matches everything possible up to the thing that comes after it (a /), thus all the way to the very last /.
Then you can also reduce it to a single regex for each line, for example
if ($line =~ m/^\#cluster\s+t\.(\d+)/) {
$key = $1;
}
elsif ($line =~ m|^\d+.*/(\w+\.txt)|) {
$value = $1;
}
Note that I've added a condition to the hash assignment. The original code in fact assigns an undef on the first iteration, since no $value had yet been seen at that point. This is overwritten on the next iteration and we don't see it if we only print the hash afterwards. The condition also guards you against failed matches, for malformatted lines or such. Of course, far better checks can be run.

Resources