Perl: Create index for tab file with two columns - database

I have got a huge tab-seperated file with up to 200 million rows (normally around 20 million) and two columns: the first column contains an ASCII word with up to 40 chars the second contains an integer.
I would like to do the following steps:
sort by first column
delete duplicate rows to make all rows unique
read out all rows for given entry in first column
I have a memory limit of 3 GB (so read all data into a hash won't work), unlimited hard disk space and want to run the script on a single core. I am intending to run several scripts in parallel, so the read and write operations on the hard disk shouldn't be to high.
How should proceed with the implementation of my script (in Perl) considering the size of the file?
Which algorithm do you recommend for the first step considering the size of the file?
Step 3 is the most complex part I think. How should I handle this? I am not familiar with indexing algorithms. Could you suggest one that is best for the problem? Are there any Perl modules that I could use?
Does it make sense to first convert the file into a binary file (Like converting SAM to BAM)? If yes, have you got any instructions or algorithms for converting and handling such files?

Reading the entire file into a SQLite database would be my first attempt.
Define the table like:
create table mytuples (
mykey varchar(40),
myval integer,
constraint tuple_pk primary key(mykey, myval) on conflict ignore
);
A simple script using DBI which ignores insert errors should do it.
Untested, and error checks omitted
#!/usr/bin/env perl
use strict; use warnings;
use autodie;
use DBI;
my ($infile) = (#ARGV);
open my $in, '<', $infile;
my $dbh = DBI->connect('dbi:SQLite:some.db', undef, undef, {
AutoCommit => 0,
RaiseError => 0,
},
);
while (my $line = <$in>) {
my ($key, $val) = split ' ', $line;
$dbh->do(q{INSERT INTO mytuples VALUES(?, ?)}, undef, $key, $val);
}
$dbh->commit;
$dbh->disconnect;
This may end up slower than sort and grep on the command line for the initial processing, but you may appreciating the flexibility of having SQL at your disposal.

Use the system sort to sort the file. The latest GNU Sort has a parallel option. Run uniq and then reading the sorted file one line at a time and noticing when the first column changes is easy. The sort uses a sort/merge algorithm which splits the file up into smaller chunks to sort and then merge, so memory is not an issue except for speed as long as you have plenty of disk.

Related

How can I speed up the search on a file of about 2 GB in C under Linux

I'm developing a C program under Linux which make a search on file which is large ~2GB.
The file consists of text rows terminated by '\n', each row consists of five fields '|' separated like a|b|c|d|e|.
Then i need to parse every row to accomplish the search.
The file is sorted by the field a, but the search is done, mainly, using field b and c as search keys !
I tried to use mapped file to speed up the search trough the file, but I did not get satisfactory results,
mainly - i think - for the reasons explained above.
Now I think to use an array, in which I insert the data already parsed in the struct, then I sort the array by
the fields b and c and the apply binary search only if the search keys are b and c, otherwise I use a sequential search.
Is it useful to use mapped memory to fill an array from a sequential file ?
Is it a good way to improve research?
Any suggestions are appreciated

Fastest way to check duplicate columns in a line in perl

I have a file with 1 million lines like this
aaa,111
bbb,222
...
...
a3z,222 (line# 500,000)
...
...
bz1,444 (last line# 1 million)
What I need to check is whether second value after comma is unique or not. If not then print out the line number. In example above it should print out
Duplicate: line: 500000 value: a3z,222
For this I am using perl and storing value of second column in an array. If I don't find a value in the array I add it to it. If the value already exists then I print it out as a duplicate.
The problem is the logic I am using is super slow. It takes anywhere from 2-3 hours to complete. Is there a way I can speed this up? I don't want to create an array if I don't have to. I just want to check duplicate values in column 2 of the file.
If there is a faster way to do it in a batch-file I am open to it.
Here's my working code.
# header
use warnings;
use DateTime;
use strict;
use POSIX qw(strftime);
use File::Find;
use File::Slurp;
use File::Spec;
use List::MoreUtils qw(uniq);
print "Perl Starting ... \n\n";
# Open the file for read access:
open my $filehandle, '<', 'my_huge_input_file.txt';
my $counter = 0;
my #uniqueArray;
# Loop through each line:
while (defined(my $recordLine = <$filehandle>))
{
# Keep track of line numbers
$counter++;
# Strip the linebreak character at the end.
chomp $recordLine;
my #fields = split(/,/, $recordLine);
my $val1=$fields[0];
my $val2=$fields[1];
if ( !($val2 ~~ #uniqueArray) && ($val2 ne "") )
{
push(#uniqueArray, $val2);
}
else
{
print ("DUP line: $counter - val1: $val1 - val2: $val2 \n");
}
}
print "\nPerl End ... \n\n";
That's one of the things a hash is for
use feature qw(say);
...
my %second_field_value;
while (defined(my $recordLine = <$filehandle>))
{
chomp $recordLine;
my #fields = split /,/, $recordLine;
if (exists $second_field_value{$fields[1]}) {
say "DUP line: $. -- #fields[0,1]";
}
++$second_field_value{$fields[1]};
}
This will accumulate all possible values for this field, as it must. We can also store suitable info about dupes as they are found, depending on what needs to be reported about them.
The line number (of the last read filehandle) is available in $. variable; no need for $counter.
Note that a check and a flag/counter setting can be done in one expression, for
if ($second_field_values{$fields[1]}++) { say ... } # already seen before
which is an idiom when checking for duplicates. Thanks to ikegami for bringing it up. This works by having the post-increment in the condition (so the check is done with the old value, and the count is up to date in the block).
I have to comment on the smart-match operator (~~) as well. It is widely understood that it has great problems in its current form and it is practically certain that it will suffer major changes, or worse. Thus, simply put, I'd say: don't use it. The code with it has every chance of breaking at some unspecified point in the future, possibly without a warning, and perhaps quietly.
Note on performance and "computational complexity," raised in comments.
Searching through an array on every line has O(n m) complexity (n lines, m values), what is really O(n2) here since the array gets a new value on each line (so m = n-1); further, the whole array gets searched for (practically) every line as there normally aren't dupes. With the hash the complexity is O(n), as we have a constant-time lookup on each line.
The crucial thing is that all that is about the size of input. For a file of a few hundred lines we can't tell a difference. With a million lines, the reported run times are "anywhere from 2-3 hours" with array and "under 5 seconds" with hash.
The fact that "complexity" assessment deals with input size has practical implications.
For one, code with carelessly built algorithms which "runs great" may break miserably for unexpectedly large inputs -- or, rather, for realistic data once it comes to production runs.
On the other hand, it is often quite satisfactory to run with code that is cleaner and simpler even as it has worse "complexity" -- when we understand its use cases.
Generally, the complexity tells us how the runtime depends on size, not what exactly it is. So an O(n2) algorithm may well run faster than an O(n log n) one for small enough input. This has great practical importance and is used widely in choosing algorithms.
Use a hash. Arrays are good for storing sequential data, and hashes are good for storing random-access data. Your search of #uniqueArray is O(n) on each search, which is done once per line, making your algorithm O(n^2). A hash solution would be O(1) (more or less) on each search, which is done once per line, making it O(n) overall.
Also, use $. for line numbers - perl tracks it for you.
my %seen;
while(<$filehandle>)
{
chomp;
my ($val1, $val2) = split /,/;
# track all values and their line numbers.
push #{$seen{$val2}}, [$., $val1];
}
# now go through the full list, looking for anything that was seen
# more than once.
for my $val2 (grep { #{$seen{$_}} > 1 } keys %seen)
{
print "DUP line: $val2 was seen on lines ", join ", ", map { "$_[0] ($_[1]) " } #{$seen{$val2}};
print "\n";
}
This is all O(n). Much faster.
The hash answer you've accepted would be the standard approach here. But I wonder if using an array would be a little faster. (I've also switched to using $_ as I think it makes the code cleaner.)
use feature qw(say);
...
my #second_field_value;
while (<$filehandle>))
{
chomp;
my #fields = split /,/;
if ($second_field_value[$fields[1]]) {
say "DIP line: $. -- #fields";
}
++$second_field_value[$fields[1]];
}
It would be a pretty sparse array, but it might still be faster than the hash version. (I'm afraid I don't have the time to benchmark it.)
Update: I ran some basic tests. The array version is faster. But not by enough that it's worth worrying about.

perl - searching a large /sorted/ array for index of a string

I have a large array of approx 100,000 items, and a small array of approx 1000 items. I need to search the large array for each of the strings in the small array, and I need the index of the string returned. (So I need to search the 100k array 1000 times)
The large array has been sorted so I guess some kind of binary chop type search would be a lot more efficient than using a foreach loop (using 'last' to break the loop when found) which is what I started with. (this first attempt results in some 30m comparisons!)
Is there a built in search method that would produce a more efficient result, or am I going to have to manually code a binary search? I also want to avoid using external modules.
For the purposes of the question, just assume that I need to find the index of a single string in the large sorted array. (I only mention the 1000 items to give an idea of the scale)
This sounds like classic hash use case scenario,
my %index_for = map { $large_array[$_] => $_ } 0 .. $#large_array;
print "index in large array:", $index_for{ $small_array[1000] };
Using a binary search is probably optimal here. Binary search only needs O(log n) comparisions (here ~ 17 comparisons per lookup).
Alternatively, you can create a hash table that maps items to their indices:
my %positions;
$positions{ $large_array[$_] } = $_ for 0 .. $#large_array;
for my $item (#small_array) {
say "$item has position $positions{$item}";
}
While now each lookup is possible in O(1) without any comparisons, you do have to create the hash table first. This may or may not be faster. Note that hashes can only use strings for keys. If your items are complex objects with their own concept of equality, you will have to derive a suitable key first.

Perl - Associative Array with index

Ok, I'm new to Perl but I thins this question is for Perl Gurus only :)
I need a well explain example of how to store and keep control of data read from a file.
I want to store them using an Associative array with index and then use a for loop to go over the array and print it to the screen.
For example:
my %array;
$array{$1} = [0]
foreach $array (sort values $array)
print "$value";
Something like this.
First, Perl refers to associative arrays as "hashes". Here's a simple example of reading the lines of a file and storing them in a hash to print them in reverse order. We use the line number $. of the file as the hash key and simply assign the line itself ($_) as the hash value.
#!/usr/bin/env perl
use strict;
use warnings;
my %hash_of_lines;
while (<>) {
chomp;
$hash_of_lines{$.} = $_;
}
for my $lineno ( sort { $b <=> $a } keys %hash_of_lines ) {
printf "%3d %s\n", $lineno, $hash_of_lines{$lineno};
}
Or even easier via the slurp method from IO::All:
#lines = io('file.txt')->slurp;
If you are reading a larger file you will probably want to lock the file to prevent racing conditions and IO::All makes it really easy to lock files while you are working with them.
You most likely do not want to use a hash (associative array is not a perl construct any longer) at all. What you are describing is using an array. Hashes are used for storing data connected with unique keys, arrays for serial data.
open my $fh, "<", $inputfile or die $!;
my #array = <$fh>;
print #array; # will preserve the order of the lines from the file
Of course, if you want the data sorted, you can do that with print sort #array.
Now, if that had been done with a hash, you'd do something like:
my %hash = map { $_ => 1 } <$fh>;
print sort keys %hash; # will not preserve order
And as you can see, the end result is only that you do not preserve the original order of the file, but instead have to sort it, or get a semi-random order. All the while, you're running the risk of overwriting keys, if you have identical lines. This is good for de-duping data, but not for true representation of file content.
You might think "But hey, what if I use another kind of key and store the line as the value?" Well, sure, you might take JRFerguson's advice and use a numerical index. But then you are using an array, just forsaking the natural benefits of using a proper array. You do not actually gain anything by doing this, only lose things.

What's the most efficient way to check for duplicates in an array of data using Perl?

I need to see if there are duplicates in an array of strings, what's the most time-efficient way of doing it?
One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.
use strict;
use warnings;
my #array = qw/yes no maybe true false false perhaps no/;
my %seen;
foreach my $string (#array) {
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
Output
'false' is duplicated.
'no' is duplicated.
Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.
#!/usr/bin/perl
use strict;
use warnings;
my %count;
my %dups;
for(#array) {
$dups{$_}++ if $count{$_}++;
}
A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.
# not exactly sort in place, but Perl does a decent job optimizing it
#array = sort #array;
my $last;
my %dups;
for my $entry (#array) {
$dups{$entry}++ if defined $last and $entry eq $last;
$last = $entry;
}
This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.
Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.
If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:
use strict;
use warnings;
use List::MoreUtils 'uniq';
my #array = qw(1 1 2 3 fibonacci!);
my #array_uniq = uniq #array;
print ((scalar(#array) == scalar(#array_uniq)) ? "no dupes" : "dupes") . " found!\n";
Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:
my %uniq_elements;
foreach my $element (#array)
{
die "dupe found!" if $uniq_elements{$element}++;
}
Create a hash or a set or use a collections.Counter().
As you encounter each string/input check to see if there's an instance of that in the hash. If so, it's a duplicate (do whatever you want about those). Otherwise add a value (such as, oh, say, the numeral one) to the hash, using the string as the key.
Example (using Python collections.Counter):
#!python
import collections
counts = collections.Counter(mylist)
uniq = [i for i,c in counts.iteritems() if c==1]
dupes = [i for i, c in counts.iteritems() if c>1]
These Counters are built around dictionaries (Pythons name for hashed mapping collections).
This is time efficient because hash keys are indexed. In most cases the lookup and insertion time for keys is done in near constant time. (In fact Perl "hashes" are so-called because they are implemented using an algorithmic trick called "hashing" --- a sort of checksum chosen for its extremely low probability of collision when fed arbitrary inputs).
If you initialize values to integers, starting with 1, then you can increment each value as you find its key already in the hash. This is just about the most efficient general purpose means of counting strings.
Not a direct answer, but this will return an array without duplicates:
#!/usr/bin/perl
use strict;
use warnings;
my #arr = ('a','a','a','b','b','c');
my %count;
my #arr_no_dups = grep { !$count{$_}++ } #arr;
print #arr_no_dups, "\n";
Please don't ask about the most time efficient way to do something unless you have some specific requirements, such as "I have to dedupe a list of 100,000 integers in under a second." Otherwise, you're worrying about how long something takes for no reason.
similar to #Schwern's second solution, but checks for duplicates a little earlier from within the comparison function of sort:
use strict;
use warnings;
#_ = sort { print "dup = $a$/" if $a eq $b; $a cmp $b } #ARGV;
it won't be as fast as the hashing solutions, but it requires less memory and is pretty darn cute

Resources