Auto increment numeric key values in a perl hash? - arrays

I have a perl script in which I am reading files from a given directory, and then placing those files into an array. I then want to be able to move those array elements into a perl hash, with the array elements being the hash value, and automatically assigning numeric keys to each hash value.
Here's the code:
# Open the current users directory and get all the builds. If you can open the dir
# then die.
opendir(D, "$userBuildLocation") || die "Can't opedir $userBuildLocation: $!\n";
# Put build files into an array.
my #builds = readdir(D);
closedir(D);
print join("\n", #builds, "\n");
This print out:
test.dlp
test1.dlp
I want to take those value and insert them into a hash that looks just like this:
my %hash (
1 => test.dlp
2 => test1.dlp
);
I want the numbered keys to be auto incrementing based on how many files I may find in a given directory.
I'm just not sure how to get the auto-incrementing keys to be set to unique numeric values for each item in the hash.

I am not sure to understand the need, but this should do
my $i = 0;
my %hash = map { ++$i => $_ } #builds;
another way to do it
my $i = 0;
for( #builds ) {
$hash{++$i} = $_;
}

The most straightforward and boring way:
my %hash;
for (my $i=0; $i<#builds; ++$i) {
$hash{$i+1} = $builds[$i];
}
or if you prefer:
foreach my $i (0 .. $#builds) {
$hash{$i+1} = $builds[$i];
}
I like this approach:
#hash{1..#builds} = #builds;

Another:
my %hash = map { $_+1, $builds[$_] } 0..$#builds;
or:
my %hash = map { $_, $builds[$_-1] } 1..#builds;

Related

Perl: Load file into hash

I'm struggling to understand logic behind hashes in Perl. Task is to load file in to hash and assign values to keys which are created using this file.
File contains alphabet with each letter on its own line:
a
b
c
d
e
and etc,.
When using array instead of hash, logic is simple: load file into array and then print each element with corresponding number using some counter ($counter++).
But now my question is, how can I read file into my hash, assign automatically generated values and sort it in that way where output is printed like this:
a:1
b:2
c:3
I've tried to first create array and then link it to hash using
%hash = #array
but it makes my hash non-sortable.
There are a number of ways to approach this. The most direct would be to load the data into the hash as you read through the file.
my %hash;
while(<>)
{
chomp;
$hash{$_} = $.; #Use the line number as your autogenerated counter.
}
You can also perform simliar logic if you already have a populated array.
for (0..$#array)
{
$hash{$array[$_]} = $_;
}
Although, if you are in that situation, map is the perlier way of doing things.
%hash = map { $array[$_] => $_ } #array;
Think of a hash as a set of pairs (key, value), where the keys must be unique. You want to read the file one line at a time, and add a pair to the hash:
$record = <$file_handle>;
$hash{$record} = $counter++;
Of course, you could read the entire file into an array at once and then assign to your hash. But the solution is not:
#records = <$file_handle>;
%hash = #records;
... as you found out. If you think in terms of (key, value) pairs, you will see that the above is equivalent to:
$hash{a} = 'b';
$hash{c} = 'd';
$hash{e} = 'f';
...
and so on. You still are going to need a loop, either an explicit one like this:
foreach my $rec (#records)
{
$hash{$rec} = $counter++;
}
or an implicit one like one of these:
%hash = map {$_ => $counter++} #records;
# or:
$hash{$_} = $counter++ for #records;
This code should generate the proper output, where my-text-file is the path to your data file:
my %hash;
my $counter = 0;
open(FILE, "my-text-file");
while (<FILE>) {
chomp;
$counter++;
$hash{$_} = $counter;
}
# Now to sort
foreach $key (sort(keys(%hash))) {
print $key . ":" . $hash{$key} . "\n";
}
I assume you want to sort the hash aplhabetically. keys(%hash) and values(%hash) return the keys and values of %hash as an array, respectively. Run the program on this file:
f
a
b
d
e
c
And we get:
a:2
b:3
c:6
d:4
e:5
f:1
I hope this helps you.

Comparing two strings line by line in Perl

I am looking for code in Perl similar to
my #lines1 = split /\n/, $str1;
my #lines2 = split /\n/, $str2;
for (int $i=0; $i<lines1.length; $i++)
{
if (lines1[$i] ~= lines2[$i])
print "difference in line $i \n";
}
To compare two strings line by line and show the lines at which there is any difference.
I know what I have written is mixture of C/Perl/Pseudo-code. How do I write it in the way that it works on Perl?
What you have written is sort of ok, except you cannot use that notation in Perl lines1.length, int $i, and ~= is not an operator, you mean =~, but that is the wrong tool here. Also if must have a block { } after it.
What you want is simply $i < #lines1 to get the array size, my $i to declare a lexical variable, and eq for string comparison. Along with if ( ... ) { ... }.
Technically you can use the binding operator to perform a string comparison, for example:
"foo" =~ "foobar"
But it is not a good idea when comparing literal strings, because you can get partial matches, and you need to escape meta characters. Therefore it is easier to just use eq.
Using C-style for loops is valid, but the more Perl-ish way is to use this notation:
for my $i (0 .. $#lines1)
Which will iterate over the range 0 to the max index of the array.
Perl allows you to open filehandles on strings by using a reference to the scalar variable that holds the string:
open my $string1_fh, '<', \$string1 or die '...';
open my $string2_fh, '<', \$string2 or die '...';
while( my $line1 = <$string1_fh> ) {
my $line2 = <$string2_fh>;
....
}
But, depending on what you mean by difference (does that include insertion or deletion of lines?), you might want something different.
There are several modules on CPAN that you can inspect for ideas, such as Test::LongString or Algorithm::Diff.
my #lines1 = split(/^/, $str1);
my #lines2 = split(/^/, $str2);
# splits at start of line
# use /\n/ if you want to ignore newline and trailing spaces
for ($i=0; $i < #lines1; $i++) {
print "difference in line $i \n" if (lines1[$i] ne lines2[$i]);
}
Comparing Arrays is a way easier if you create a Hashmap out of it...
#Searching the difference
#isect = ();
#diff = ();
%count = ();
foreach $item ( #array1, #array2 ) { $count{$item}++; }
foreach $item ( keys %count ) {
if ( $count{$item} == 2 ) {
push #isect, $item;
}
else {
push #diff, $item;
}
}
#Output
print "Different= #diff\n\n";
print "\nA Array = #array1\n";
print "\nB Array = #array2\n";
print "\nIntersect Array = #isect\n";
Even after spliting you could compare them as Array.

How to search through array elements for match in hash keys

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}

Trouble converting array to hash

I have an array where elements of the array have values that are separated by tabs.
For example:
client_name \t owner \t date \t port_number.
I need to convert that into a hash so it can be dumped into a MySQL database.
Something like:
my %foo = ();
$foo{date} = "111208";
$foo{port} = "2222";
$foo{owner} = "ownername";
$foo{name} = "clientname";
The problem I have is that there are duplicate client names but they exist on different port numbers. If I convert it directly to a hash using client_name as a key it will delete duplicate client names. The MySQL table is indexed based on {name} and {port}.
Is there any way I can convert this into a hash without losing duplicate client names?
You would go through your file, build up the hash like you've done, then push a reference to that hash onto an array. Something like:
foreach my $line ( #lines ) {
# Make your %foo hash.
push #clients, \%foo;
}
Then afterwards, when you're inserting into your DB, you just iterate through the elements in #clients:
foreach my $client ( #clients ) {
$date = $client->{'date'};
...
}
Edit: If you want to turn this into a hash of hashes, then as you loop through the list of lines, you'd do something like:
foreach my $line ( #lines ) {
# Make your %foo hash.
$clients{$foo{'port'}} = \%foo;
}
Then you'll have a hash of hashes using the port number as the key.
Why not just store it in a list (array)?
my #records = ();
while (my $line = <INFILE>) {
chomp $line;
my #fields = split /\t/ $line;
push #records => { date => $fields[2],
name => $fields[0],
port => $fields[3],
owner => $fields[1] };
}
for my $record (#records) {
$insert_query->execute (%$record);
}
my #record_list;
while ( <$generic_input> ) {
my $foo = {};
#$foo{ qw<date port owner name> } = split /\t/;
push #record_list, \%foo;
}
As a "pipeline" you could do this:
use List::MoreUtils qw<pairwise>;
my #fields = qw<date port owner name>;
my #records
= map {; { pairwise { $a => $b } #fields, #{[ split /\t/ ]}}}
<$input>
;

How would I know if there is any similar element in my array?

I have this code which lists all files in my directory:
$dir = '/var/www/corpIDD/rawFile/';
opendir DIR, $dir or die "cannot open dir $dir: $!";
my #file= readdir DIR;
closedir DIR;
which returns an array containing something like this:
$array (0 => 'ipax3_2011_01_27.txt', 1 => 'ipax3_2011_02_01.txt', 2 => 'ipax3_2011_02_03.txt')
My problem here is, how will I store elements 1 => 'ipax3_2011_02_01.txt' and 2 => 'ipax3_2011_02_03.txt' to separate variable as they belong to the same month and year(2011_02)?
Thanks!
In Perl, when you need to use a string as the key in a data structure, you are looking for the HASH builtin type, designated by the % sigil. A nice feature of Perl's hashes is that you do not have to pre-declare a complex data structure. You can use it, and Perl will infer the structure from that usage.
my #file = qw(ipax3_2011_01_27.txt ipax3_2011_02_01.txt ipax3_2011_02_03.txt);
my %ipax3;
for (#file) {
if (/^ipax3_(\d{4}_\d{2})_(\d{2}).txt$/) {
$ipax3{$1}{$2} = $_
}
else {
warn "bad file: $_\n"
}
}
for my $year_month (keys %ipax3) {
my $days = keys %{ $ipax3{$year_month} };
if ($days > 1) {
print "$year_month has $days files\n";
}
else {
print "$year_month has 1 file\n";
}
}
which prints:
2011_01 has 1 file
2011_02 has 2 files
To get at the individual files:
my $year_month = '2011_02';
my $day = '01';
my $file = $ipax3{$year_month}{$day};
Above I used the return value of the keys function as both the list to iterate over, and as the number of days. This is possible because keys will return all of the keys when in list context, and will return the number of keys in scalar context. Context is provided by the surrounding code:
my $number = keys %ipax3; # number of year_month entries
my #keys = keys %ipax3; # contains ('2011_01', '2011_02')
my #days = keys %{ $ipax{$year_month} };
In the last example, each value in %ipax is a reference to a hash. Since keys takes a literal hash, you need to wrap $ipax{$year_month} in %{ ... }. In perl v5.13.7+ you can omit the %{ ... } around arguments to keys and a few other data structure access functions.
People are responding really fast here :) anyway, I'll post mine, just for your reference. Basically, I'm also using a hash.
use warnings qw(all);
use strict;
my ($dir, $hdir) = 'C:\Work';
opendir($hdir, $dir) || die "Can't open dir \"$dir\" because $!\n";
my (#files) = readdir($hdir);
closedir($hdir);
my %yearmonths;
foreach(#files)
{
my ($year, $month);
next unless(($year, $month) = ($_ =~ /ipax3_(\d{4})_(\d{2})/));
$year += 0;
--$month; #assuming that months are in range 1-12
my $key = ($year * 12) + $month;
++$yearmonths{ $key };
}
foreach(keys %yearmonths)
{
next if($yearmonths{ $_ } < 2);
my $year = $_ / 12;
my $month = 1 + ($_ % 12);
printf "There were %d files from year %d, month %d\n", $yearmonths{$_}, $year, $month;
}

Resources