Fastest way to check duplicate columns in a line in perl

Fastest way to check duplicate columns in a line in perl - arrays

I have a file with 1 million lines like this
aaa,111
bbb,222
...
...
a3z,222 (line# 500,000)
...
...
bz1,444 (last line# 1 million)
What I need to check is whether second value after comma is unique or not. If not then print out the line number. In example above it should print out
Duplicate: line: 500000 value: a3z,222
For this I am using perl and storing value of second column in an array. If I don't find a value in the array I add it to it. If the value already exists then I print it out as a duplicate.
The problem is the logic I am using is super slow. It takes anywhere from 2-3 hours to complete. Is there a way I can speed this up? I don't want to create an array if I don't have to. I just want to check duplicate values in column 2 of the file.
If there is a faster way to do it in a batch-file I am open to it.
Here's my working code.
# header
use warnings;
use DateTime;
use strict;
use POSIX qw(strftime);
use File::Find;
use File::Slurp;
use File::Spec;
use List::MoreUtils qw(uniq);
print "Perl Starting ... \n\n";
# Open the file for read access:
open my $filehandle, '<', 'my_huge_input_file.txt';
my $counter = 0;
my #uniqueArray;
# Loop through each line:
while (defined(my $recordLine = <$filehandle>))
{
# Keep track of line numbers
$counter++;
# Strip the linebreak character at the end.
chomp $recordLine;
my #fields = split(/,/, $recordLine);
my $val1=$fields[0];
my $val2=$fields[1];
if ( !($val2 ~~ #uniqueArray) && ($val2 ne "") )
{
push(#uniqueArray, $val2);
}
else
{
print ("DUP line: $counter - val1: $val1 - val2: $val2 \n");
}
}
print "\nPerl End ... \n\n";

That's one of the things a hash is for
use feature qw(say);
...
my %second_field_value;
while (defined(my $recordLine = <$filehandle>))
{
chomp $recordLine;
my #fields = split /,/, $recordLine;
if (exists $second_field_value{$fields[1]}) {
say "DUP line: $. -- #fields[0,1]";
}
++$second_field_value{$fields[1]};
}
This will accumulate all possible values for this field, as it must. We can also store suitable info about dupes as they are found, depending on what needs to be reported about them.
The line number (of the last read filehandle) is available in $. variable; no need for $counter.
Note that a check and a flag/counter setting can be done in one expression, for
if ($second_field_values{$fields[1]}++) { say ... } # already seen before
which is an idiom when checking for duplicates. Thanks to ikegami for bringing it up. This works by having the post-increment in the condition (so the check is done with the old value, and the count is up to date in the block).
I have to comment on the smart-match operator (~~) as well. It is widely understood that it has great problems in its current form and it is practically certain that it will suffer major changes, or worse. Thus, simply put, I'd say: don't use it. The code with it has every chance of breaking at some unspecified point in the future, possibly without a warning, and perhaps quietly.
Note on performance and "computational complexity," raised in comments.
Searching through an array on every line has O(n m) complexity (n lines, m values), what is really O(n2) here since the array gets a new value on each line (so m = n-1); further, the whole array gets searched for (practically) every line as there normally aren't dupes. With the hash the complexity is O(n), as we have a constant-time lookup on each line.
The crucial thing is that all that is about the size of input. For a file of a few hundred lines we can't tell a difference. With a million lines, the reported run times are "anywhere from 2-3 hours" with array and "under 5 seconds" with hash.
The fact that "complexity" assessment deals with input size has practical implications.
For one, code with carelessly built algorithms which "runs great" may break miserably for unexpectedly large inputs -- or, rather, for realistic data once it comes to production runs.
On the other hand, it is often quite satisfactory to run with code that is cleaner and simpler even as it has worse "complexity" -- when we understand its use cases.
Generally, the complexity tells us how the runtime depends on size, not what exactly it is. So an O(n2) algorithm may well run faster than an O(n log n) one for small enough input. This has great practical importance and is used widely in choosing algorithms.

Use a hash. Arrays are good for storing sequential data, and hashes are good for storing random-access data. Your search of #uniqueArray is O(n) on each search, which is done once per line, making your algorithm O(n^2). A hash solution would be O(1) (more or less) on each search, which is done once per line, making it O(n) overall.
Also, use $. for line numbers - perl tracks it for you.
my %seen;
while(<$filehandle>)
{
chomp;
my ($val1, $val2) = split /,/;
# track all values and their line numbers.
push #{$seen{$val2}}, [$., $val1];
}
# now go through the full list, looking for anything that was seen
# more than once.
for my $val2 (grep { #{$seen{$_}} > 1 } keys %seen)
{
print "DUP line: $val2 was seen on lines ", join ", ", map { "$_[0] ($_[1]) " } #{$seen{$val2}};
print "\n";
}
This is all O(n). Much faster.

The hash answer you've accepted would be the standard approach here. But I wonder if using an array would be a little faster. (I've also switched to using $_ as I think it makes the code cleaner.)
use feature qw(say);
...
my #second_field_value;
while (<$filehandle>))
{
chomp;
my #fields = split /,/;
if ($second_field_value[$fields[1]]) {
say "DIP line: $. -- #fields";
}
++$second_field_value[$fields[1]];
}
It would be a pretty sparse array, but it might still be faster than the hash version. (I'm afraid I don't have the time to benchmark it.)
Update: I ran some basic tests. The array version is faster. But not by enough that it's worth worrying about.

Related

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.

I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.

I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).

Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).

There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

Perl: Create index for tab file with two columns

I have got a huge tab-seperated file with up to 200 million rows (normally around 20 million) and two columns: the first column contains an ASCII word with up to 40 chars the second contains an integer.
I would like to do the following steps:
sort by first column
delete duplicate rows to make all rows unique
read out all rows for given entry in first column
I have a memory limit of 3 GB (so read all data into a hash won't work), unlimited hard disk space and want to run the script on a single core. I am intending to run several scripts in parallel, so the read and write operations on the hard disk shouldn't be to high.
How should proceed with the implementation of my script (in Perl) considering the size of the file?
Which algorithm do you recommend for the first step considering the size of the file?
Step 3 is the most complex part I think. How should I handle this? I am not familiar with indexing algorithms. Could you suggest one that is best for the problem? Are there any Perl modules that I could use?
Does it make sense to first convert the file into a binary file (Like converting SAM to BAM)? If yes, have you got any instructions or algorithms for converting and handling such files?

Reading the entire file into a SQLite database would be my first attempt.
Define the table like:
create table mytuples (
mykey varchar(40),
myval integer,
constraint tuple_pk primary key(mykey, myval) on conflict ignore
);
A simple script using DBI which ignores insert errors should do it.
Untested, and error checks omitted
#!/usr/bin/env perl
use strict; use warnings;
use autodie;
use DBI;
my ($infile) = (#ARGV);
open my $in, '<', $infile;
my $dbh = DBI->connect('dbi:SQLite:some.db', undef, undef, {
AutoCommit => 0,
RaiseError => 0,
},
);
while (my $line = <$in>) {
my ($key, $val) = split ' ', $line;
$dbh->do(q{INSERT INTO mytuples VALUES(?, ?)}, undef, $key, $val);
}
$dbh->commit;
$dbh->disconnect;
This may end up slower than sort and grep on the command line for the initial processing, but you may appreciating the flexibility of having SQL at your disposal.

Use the system sort to sort the file. The latest GNU Sort has a parallel option. Run uniq and then reading the sorted file one line at a time and noticing when the first column changes is easy. The sort uses a sort/merge algorithm which splits the file up into smaller chunks to sort and then merge, so memory is not an issue except for speed as long as you have plenty of disk.

Problems with Arrays in Perl

I am new to Perl and having some difficulty with arrays in Perl. Can somebody will explain to me as to why I am not able to print the value of an array in the script below.
$sum=();
$min = 999;
$LogEntry = '';
foreach $item (1, 2, 3, 4, 5)
{
$min = $item if $min > $item;
if ($LogEntry eq '') {
push(#sum,"1"); }
print "debugging the IF condition\n";
}
print "Array is: $sum\n";
print "Min = $min\n";
The output I get is:
debugging the IF condition
debugging the IF condition
debugging the IF condition
debugging the IF condition
debugging the IF condition
Array is:
Min = 1
Shouldn't I get Array is: 1 1 1 1 1 (5 times).
Can somebody please help?
Thanks.

You need two things:
use strict;
use warnings;
at which point the bug in your code ($sum instead of #sum) should become obvious...

$sum is not the same variable as #sum.
In this case you would benefit from starting your script with:
use strict;
use warnings;
Strict forces you to declare all variables, and warnings gives warnings..
In the meantime, change the first line to:
#sum = ();
and the second-to-last line to:
print "Array is: " . join (', ', #sum) . "\n";
See join.

As others have noted, you need to understand the way Perl uses sigils ($, #, %) to denote data structures and the access of the data in them.
You are using a scalar sigil ($), which will simply try to access a scalar variable named $sum, that has nothing to do with a completely distinct array variable named #sum - and you obviously want the latter.
What confuses you is likely the fact that, once the array variable #sum exists, you can access individual values in the array using $sum[0] syntax, but here the sigil+braces ($[]) act as a "unified" syntactic constract.
The first thing you need to do (after using strict and warnings) is to read the following documentation on sigils in Perl (aside from good Perl book):
https://stackoverflow.com/a/2732643/119280 - brian d. foy's excellent summary
The rest of the answers to the same question
This SO answer
The best summary I can give you on the syntax of accessing data structures in Perl is (quoting from my older comment)
the sigil represents the amount of data from the data structure that you are retrieving ($ of 1 element, # for a list of elements, % for entire hash)
whereas the brace style represent what your data structure is (square for array, curly for hash).
As a special case, when there are NO braces, the sigil will represent BOTH the amount of data, as well as what the data structure is.
Please note that in your specific case, it's the last bullet point that matters. Since you're referring to the array as a whole, you won't have braces, and therefore the sigil will represent the data structure type - since it's an array, you must use the # sigil.

You push the values into the array #sum, then finish up by printing the scalar $sum. #sum and $sum are two completely independent variables. If you print "#sum\n" instead, you should get the output "11111".

print "Array is: $sum\n";
will print a non-existent scalar variable called $sum, not the array #sum and not the first item of the array.
If you 'use strict' it will flag the user of un-initialized variables like this.

You should definitly add use strict; and use warnings; to your script. That would have complained about the print "Array is: $sum\n"; line (among others).
And you initialize an array with my #sum=(); not with my $sum=();

Like CFL_Jeff mentions, you can't just do a quick print. Instead, do something like:
print "Array is ".join(', ',#array);

Still would like to add some details to this picture. )
See, Perl is well-known as a Very High Level Language. And this is not just because you can replace (1,2,3,4,5) with (1..5) and get the same result.
And not because you may leave your variables without (explicitly) assigning some initial values to them: my #arr is as good as my #arr = (), and my $scal (instead of my $scal = 'some filler value') may actually save you an hour or two one day. Perl is usually (with use warnings, yes) good at spotting undefined values in unusual places - but not so lucky with 'filler values'...
The true point of VHLL is that, in my opinion, you can express a solution in Perl code just like in any human language available (and some may be even less suitable for that case).
Don't believe me? Ok, check your code - or rather your set of tasks, for example.
Need to find the lowest element in a array? Or a sum of all values in array? List::Util module is to your command:
use List::Util qw( min sum );
my #array_of_values = (1..10);
my $min_value = min( #array_of_values );
my $sum_of_values = sum( #array_of_values );
say "In the beginning was... #array_of_values";
say "With lowest point at $min_value";
say "Collected they give $sum_of_values";
Need to construct an array from another array, filtering out unneeded values? grep is here to save the day:
#filtered_array = grep { $filter_condition } #source_array;
See the pattern? Don't try to code your solution into some machine-codish mumbo-jumbo. ) Find a solution in your own language, then just find means to translate THAT solution into Perl code instead. It's easier than you thought. )
Disclaimer: I do understand that reinventing the wheel may be good for learning why wheels are so useful at first place. ) But I do see how often wheels are reimplemented - becoming uglier and slower in process - in production code, just because people got used to this mode of thinking.

Perl: Good way to test if a value is in an array?

If I have an array:
#int_array = (7,101,80,22,42);
How can I check if the integer value 80 is in the array without looping through every element?

You can't without looping. That's part of what it means to be an array. You can use an implicit loop using grep or smartmatch, but there's still a loop. If you want to avoid the loop, use a hash instead (or in addition).
# grep
if ( grep $_ == 80, #int_array ) ...
# smartmatch
use 5.010001;
if ( 80 ~~ #int_array ) ...
Before using smartmatch, note:
http://search.cpan.org/dist/perl-5.18.0/pod/perldelta.pod#The_smartmatch_family_of_features_are_now_experimental:
The smartmatch family of features are now experimental
Smart match, added in v5.10.0 and significantly revised in v5.10.1, has been a regular point of complaint. Although there are a number of ways in which it is useful, it has also proven problematic and confusing for both users and implementors of Perl. There have been a number of proposals on how to best address the problem. It is clear that smartmatch is almost certainly either going to change or go away in the future. Relying on its current behavior is not recommended.
Warnings will now be issued when the parser sees ~~, given, or when. To disable these warnings, you can add this line to the appropriate scope

CPAN solution: use List::MoreUtils
use List::MoreUtils qw{any};
print "found!\n" if any { $_ == 7 } (7,101,80,22,42);
If you need to do MANY MANY lookups in the same array, a more efficient way is to store the array in a hash once and look up in the hash:
#int_array{#int_array} = 1;
foreach my $lookup_value (#lookup_values) {
print "found $lookup_value\n" if exists $int_array{$lookup_value}
}
Why use this solution over the alternatives?
Can't use smart match in Perl before 5.10. According to this SO post by brian d foy]2, smart match is short circuiting, so it's as good as "any" solution for 5.10.
grep solution loops through the entire list even if the first element of 1,000,000 long list matches. any will short-circuit and quit the moment the first match is found, thus it is more efficient. Original poster explicitly said "without looping through every element"
If you need to do LOTs of lookups, the one-time sunk cost of hash creation makes the hash lookup method a LOT more efficient than any other. See this SO post for details

Yet another way to check for a number in an array:
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util 'first';
my #int_array = qw( 7 101 80 22 42 );
my $number_to_check = 80;
if ( first { $_ == $number_to_check } #int_array ) {
print "$number_to_check exists in ", join ', ', #int_array;
}
See List::Util.

if ( grep /^80$/, #int_array ) {
...
}

If you are using Perl 5.10 or later, you can use the smart match operator ~~:
my $found = (80 ~~ $in_array);

What's the most efficient way to check for duplicates in an array of data using Perl?

I need to see if there are duplicates in an array of strings, what's the most time-efficient way of doing it?

One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.
use strict;
use warnings;
my #array = qw/yes no maybe true false false perhaps no/;
my %seen;
foreach my $string (#array) {
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
Output
'false' is duplicated.
'no' is duplicated.

Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.
#!/usr/bin/perl
use strict;
use warnings;
my %count;
my %dups;
for(#array) {
$dups{$_}++ if $count{$_}++;
}
A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.
# not exactly sort in place, but Perl does a decent job optimizing it
#array = sort #array;
my $last;
my %dups;
for my $entry (#array) {
$dups{$entry}++ if defined $last and $entry eq $last;
$last = $entry;
}
This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.
Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:
use strict;
use warnings;
use List::MoreUtils 'uniq';
my #array = qw(1 1 2 3 fibonacci!);
my #array_uniq = uniq #array;
print ((scalar(#array) == scalar(#array_uniq)) ? "no dupes" : "dupes") . " found!\n";
Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:
my %uniq_elements;
foreach my $element (#array)
{
die "dupe found!" if $uniq_elements{$element}++;
}

Create a hash or a set or use a collections.Counter().
As you encounter each string/input check to see if there's an instance of that in the hash. If so, it's a duplicate (do whatever you want about those). Otherwise add a value (such as, oh, say, the numeral one) to the hash, using the string as the key.
Example (using Python collections.Counter):
#!python
import collections
counts = collections.Counter(mylist)
uniq = [i for i,c in counts.iteritems() if c==1]
dupes = [i for i, c in counts.iteritems() if c>1]
These Counters are built around dictionaries (Pythons name for hashed mapping collections).
This is time efficient because hash keys are indexed. In most cases the lookup and insertion time for keys is done in near constant time. (In fact Perl "hashes" are so-called because they are implemented using an algorithmic trick called "hashing" --- a sort of checksum chosen for its extremely low probability of collision when fed arbitrary inputs).
If you initialize values to integers, starting with 1, then you can increment each value as you find its key already in the hash. This is just about the most efficient general purpose means of counting strings.

Not a direct answer, but this will return an array without duplicates:
#!/usr/bin/perl
use strict;
use warnings;
my #arr = ('a','a','a','b','b','c');
my %count;
my #arr_no_dups = grep { !$count{$_}++ } #arr;
print #arr_no_dups, "\n";

Please don't ask about the most time efficient way to do something unless you have some specific requirements, such as "I have to dedupe a list of 100,000 integers in under a second." Otherwise, you're worrying about how long something takes for no reason.

similar to #Schwern's second solution, but checks for duplicates a little earlier from within the comparison function of sort:
use strict;
use warnings;
#_ = sort { print "dup = $a$/" if $a eq $b; $a cmp $b } #ARGV;
it won't be as fast as the hashing solutions, but it requires less memory and is pretty darn cute