Perl: Load file into hash - arrays

I'm struggling to understand logic behind hashes in Perl. Task is to load file in to hash and assign values to keys which are created using this file.
File contains alphabet with each letter on its own line:
a
b
c
d
e
and etc,.
When using array instead of hash, logic is simple: load file into array and then print each element with corresponding number using some counter ($counter++).
But now my question is, how can I read file into my hash, assign automatically generated values and sort it in that way where output is printed like this:
a:1
b:2
c:3
I've tried to first create array and then link it to hash using
%hash = #array
but it makes my hash non-sortable.

There are a number of ways to approach this. The most direct would be to load the data into the hash as you read through the file.
my %hash;
while(<>)
{
chomp;
$hash{$_} = $.; #Use the line number as your autogenerated counter.
}
You can also perform simliar logic if you already have a populated array.
for (0..$#array)
{
$hash{$array[$_]} = $_;
}
Although, if you are in that situation, map is the perlier way of doing things.
%hash = map { $array[$_] => $_ } #array;

Think of a hash as a set of pairs (key, value), where the keys must be unique. You want to read the file one line at a time, and add a pair to the hash:
$record = <$file_handle>;
$hash{$record} = $counter++;
Of course, you could read the entire file into an array at once and then assign to your hash. But the solution is not:
#records = <$file_handle>;
%hash = #records;
... as you found out. If you think in terms of (key, value) pairs, you will see that the above is equivalent to:
$hash{a} = 'b';
$hash{c} = 'd';
$hash{e} = 'f';
...
and so on. You still are going to need a loop, either an explicit one like this:
foreach my $rec (#records)
{
$hash{$rec} = $counter++;
}
or an implicit one like one of these:
%hash = map {$_ => $counter++} #records;
# or:
$hash{$_} = $counter++ for #records;

This code should generate the proper output, where my-text-file is the path to your data file:
my %hash;
my $counter = 0;
open(FILE, "my-text-file");
while (<FILE>) {
chomp;
$counter++;
$hash{$_} = $counter;
}
# Now to sort
foreach $key (sort(keys(%hash))) {
print $key . ":" . $hash{$key} . "\n";
}
I assume you want to sort the hash aplhabetically. keys(%hash) and values(%hash) return the keys and values of %hash as an array, respectively. Run the program on this file:
f
a
b
d
e
c
And we get:
a:2
b:3
c:6
d:4
e:5
f:1
I hope this helps you.

Related

Generate array and subarray dynamically (perl)

I have a several files that have contain product install information.
I am able to grep for the elements that I want (for example, "version") from the files. And I end up with something like this:
instancefile1:ABC_version=1.2.3
instancefile1:XYZ_version=2.5
instancefile2:ABC_version=3.4.5
instancefile3:XYZ_version=1.1
Where the components are named ABC or XYZ.
What I'd like to do is take this grep output and parse it through perl to build arrays on the file.
The first array would be composed of the instance number (pulled from the filenames) - above, I'd have an array that would have 1,2,3 as elements.
And inside each of those arrays would be the components that that particular instance has.
Full expected arrays and components from above:
array[0] = instancefile1 # can keep this named like the filename,
or assign name. Does not matter
array[0][0] = ABC_version=1.2.3
array[0][1] = XYZ_version=2.5
array[1] = instancefile2
array[1][0] = ABC_version=3.4.5
array[2] = instancefile3
array[2][0] = XYZ_version=1.1
(I know my notation for referencing subarrays is not correct - I'm rusty on my perl.)
How might I do that?
(I've been doing it with just bash arrays and grep and then reiterating through the initial grep output with my first array and doing another grep to fill another array - but this seems like it is going through the data more than one time, instead of building it on the fly.)
The idea is for it to build each array as it sees it. It sees "fileinstance1", it stores the values to the right in that array, as it sees it. Then if it sees "fileinstance2", it creates that array and populates with those values, all in one pass. I believe perl is the best tool for this?
Unless you can guaranteed the records with the same key will be next to each other, it's easier to start with a HoA.
my %data;
while (<>) {
chomp;
my ($key, $val) = split /:/;
push #{ $data{$key} }, $val;
}
Then convert to AoA:
my #data = values(%data);
Order-preserving:
my %data;
my #data;
while (<>) {
chomp;
my ($key, $val) = split /:/;
push #data, $data{$key} = []
if !$data{$key};
push #{ $data{$key} }, $val;
}

perl: using push() on an array inside a hash

Is it possible to use Perl's push() function on an array inside a hash?
Below is what I believe to be the offending part of a larger program that I am working on.
my %domains = ();
open (TABLE, "placeholder.foo") || die "cannot read domtblout file\n";
while ($line = <TABLE>)
{
if (!($line =~ /^#/))
{
#split_line = split(/\t/, $line); # splits on tabs - some entries contain whitespace
if ($split_line[13] >= $domain_cutoff)
{
push($domains{$split_line[0]}[0], $split_line[19]); # adds "env from" coordinate to array
push($domains{$split_line[0]}[1], $split_line[20]); # adds "env to" coordinate to array
# %domains is a hash, but $domains{identifier}[0] and $domains{$identifier}[1] are both arrays
# this way, all domains from one sequence are stored with the same hash key, but can easily be processed iteratively
}
}
}
Later I try to interact with these arrays using
for ($i = 0, $i <= $domains{$identifier}[0], $i++)
{
$from = $domains{$identifier}[0][$i];
$to = $domains{$identifier}[1][$i];
$length = ($to - $from);
$tmp_seq =~ /.{$from}(.{$length})/;
print("$header"."$1");
}
but it appears as if the arrays I created are empty.
If $domains{$identifier}[0] is an array, then why can I not use the push statement to add an element to it?
$domains{identifier}[0] is not an array.
$domains{identifier}[0] is an array element, a scalar.
$domains{identifier}[0] is a reference to an array.
If it's
#array
when you have an array, it's
#{ ... }
when you have a reference to an array, so
push(#{ $domains{ $split_line[0] }[0] }, $split_line[19]);
References:
Mini-Tutorial: Dereferencing Syntax
References quick reference
perlref
perlreftut
perldsc
perllol

Auto increment numeric key values in a perl hash?

I have a perl script in which I am reading files from a given directory, and then placing those files into an array. I then want to be able to move those array elements into a perl hash, with the array elements being the hash value, and automatically assigning numeric keys to each hash value.
Here's the code:
# Open the current users directory and get all the builds. If you can open the dir
# then die.
opendir(D, "$userBuildLocation") || die "Can't opedir $userBuildLocation: $!\n";
# Put build files into an array.
my #builds = readdir(D);
closedir(D);
print join("\n", #builds, "\n");
This print out:
test.dlp
test1.dlp
I want to take those value and insert them into a hash that looks just like this:
my %hash (
1 => test.dlp
2 => test1.dlp
);
I want the numbered keys to be auto incrementing based on how many files I may find in a given directory.
I'm just not sure how to get the auto-incrementing keys to be set to unique numeric values for each item in the hash.
I am not sure to understand the need, but this should do
my $i = 0;
my %hash = map { ++$i => $_ } #builds;
another way to do it
my $i = 0;
for( #builds ) {
$hash{++$i} = $_;
}
The most straightforward and boring way:
my %hash;
for (my $i=0; $i<#builds; ++$i) {
$hash{$i+1} = $builds[$i];
}
or if you prefer:
foreach my $i (0 .. $#builds) {
$hash{$i+1} = $builds[$i];
}
I like this approach:
#hash{1..#builds} = #builds;
Another:
my %hash = map { $_+1, $builds[$_] } 0..$#builds;
or:
my %hash = map { $_, $builds[$_-1] } 1..#builds;

How to search through array elements for match in hash keys

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources