How to search through array elements for match in hash keys - arrays

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.

You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}

Related

Eliminating unitialized values in my Perl hash of arrays

I successfully create a hash of arrays, and I am using it to calculate log-odds scores for each DNA sequence from a file (Creating a hash of arrays for DNA sequences, Perl has input file format). I get a score for each sequence, but I get a warning for each calculation. Naturally, I want to clear up the warning. The warning is: Use of uninitialized value in string eq at line 148.
Here is a summarized version of the code (I can post the full code if necessary):
use strict;
use warnings;
use Data::Dumper;
#USER SPECIFICATIONS
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number, used for calculating tables that have been omitted for space
$loscore{$id} .= 0; #Creates a hash with each id number to have a log-odds score
}
}
close FILE;
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Default width is 3 (arbitrary number chosen)
if ($width eq '')
{
$width = 3;
}
#Omitting code about $width<=0, creation of log-odds score hash to save space
foreach $id (keys %HoA, %loscore)
{
for my $pos (0..($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos] eq $base) #ERROR OCCURS HERE
{
$loscore{$id} += $logodds{$base}[$pos];
}
elsif ( ! defined $HoA{$id}[$pos])
{
print "$pos\n";
}
}
}
}
print Dumper(\%loscore);
The output I get is:
Use of uninitialized value in string eq at line 148, <STDIN> line 2.
2
(This error repeats 4 times for each position - most likely due to matching to each $base?)
$VAR1 = {
'Sequence 15' => '-1.27764697876093',
'Sequence 4' => '0.437512962981119',
(continues for 29 sequences)
}
To summarize, I want to calculate the log-odds score of each sequence. I have a log-odds score hash %loscore that contains the score of a base at each location within a motif. The log-odds score is calculated by summing the referenced values. For example, if the log-odds table was
A 4 3 2
C 7 2 1
G 6 9 2
T 1 0 3
The log-odds score of the sequence CAG would be 7+3+2=12.
At the moment, I believe that the error occurs because of the way I split the strings of DNA to be put into the hash of arrays. As I previously stated, if you want all the code so you can copy-paste, I can provide it. I think the solution is pretty simple, and I just need someone to point me in the right direction. Any and all help is appreciated, and I can clarify as questions arise. Also, any tips that could help me to post more concise questions are appreciated (I know this one is lengthy, I just want to provide enough background information).
Here is the code that I am using to iterate through the `%HoA. It calculates a log-odds score for each sequence, then works through each sequence to find a maximum score for each sequence. Big thanks to everyone for helping out!
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-1)
{
for my $pos2 ($pos1..$pos1+($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base)
{
for my $pos3 (0..$width-1)
{
$loscore{$id} += $logodds{$base}[$pos3];
if ($loscore{$id} > $maxscore{$id})
{
$maxscore{$id} = $loscore{$id};
}
}
}
elsif ( ! defined $HoA{$id}[$pos2])
{
print "$pos2\n";
}
}
}
}
}

Separating CSV file into key and array

I am new to perl, and I am trying to separate a csv file (has 10 comma-separated items per line) into a key (first item) and an array (9 items) to put in a hash. Eventually, I want to use an if function to match another variable to the key in the hash and print out the elements in the array.
Here's the code I have, which doesn't work right.
use strict;
use warnings;
my %hash;
my $in2 = "metadata1.csv";
open IN2, "<$in2" or die "Cannot open the file: $!";
while (my $line = <IN2>) {
my ($key, #value) = split (/,/, $line, 2);
%hash = (
$key => #value
);
}
foreach my $key (keys %hash)
{
print "The key is $key and the array is $hash{$key}\n";
}
Thank you for any help!
Don't use 2 as the third argument to split: it will split the line to only two elements, so there'll be just one #value.
Also, by doing %hash =, you're overwriting the hash in each iteration of the loop. Just add a new key/value pair:
$hash{$key} = \#value;
Note the \ sign: you can't store an array directly as a hash value, you have to store a reference to it. When printing the value, you have to dereference it back:
#! /usr/bin/perl
use warnings;
use strict;
my %hash;
while (<DATA>) {
my ($key, #values) = split /,/;
$hash{$key} = \#values;
}
for my $key (keys %hash) {
print "$key => #{ $hash{$key} }";
}
__DATA__
id0,1,2,a
id1,3,4,b
id2,5,6,c
If your CSV file contains quoted or escaped commas, you should use Text::CSV.
First of all hash can have only one unique key, so when you have lines like these in your CSV file:
key1,val11,val12,val13,val14,val15,val16,val17,val18,val19
key1,val21,val22,val23,val24,val25,val26,val27,val28,val29
after adding both key/value pairs with 'key1' key to the hash, you'll get just one pair saved in the hash, the one that were added to the hash later.
So to keep all records, the result you probably need array of hashes structure, where value of each hash is an array reference, like this:
#result = (
{ 'key1' => ['val11','val12','val13','val14','val15','val16','val17','val18','val19'] },
{ 'key1' => ['val21','val22','val23','val24','val25','val26','val27','val28','val29'] },
{ 'and' => ['so on'] },
);
In order to achieve that your code should become like this:
use strict;
use warnings;
my #AoH; # array of hashes containing data from CSV
my $in2 = "metadata1.csv";
open IN2, "<$in2" or die "Cannot open the file: $!";
while (my $line = <IN2>) {
my #string_bits = split (/,/, $line);
my $key = $string_bits[0]; # first element - key
my $value = [ #string_bits[1 .. $#string_bits] ]; # rest get into arr ref
push #AoH, {$key => $value}; # array of hashes structure
}
foreach my $hash_ref (#AoH)
{
my $key = (keys %$hash_ref)[0]; # get hash key
my $value = join ', ', #{ $hash_ref->{$key} }; # join array into str
print "The key is '$key' and the array is '$value'\n";
}

Perl: Load file into hash

I'm struggling to understand logic behind hashes in Perl. Task is to load file in to hash and assign values to keys which are created using this file.
File contains alphabet with each letter on its own line:
a
b
c
d
e
and etc,.
When using array instead of hash, logic is simple: load file into array and then print each element with corresponding number using some counter ($counter++).
But now my question is, how can I read file into my hash, assign automatically generated values and sort it in that way where output is printed like this:
a:1
b:2
c:3
I've tried to first create array and then link it to hash using
%hash = #array
but it makes my hash non-sortable.
There are a number of ways to approach this. The most direct would be to load the data into the hash as you read through the file.
my %hash;
while(<>)
{
chomp;
$hash{$_} = $.; #Use the line number as your autogenerated counter.
}
You can also perform simliar logic if you already have a populated array.
for (0..$#array)
{
$hash{$array[$_]} = $_;
}
Although, if you are in that situation, map is the perlier way of doing things.
%hash = map { $array[$_] => $_ } #array;
Think of a hash as a set of pairs (key, value), where the keys must be unique. You want to read the file one line at a time, and add a pair to the hash:
$record = <$file_handle>;
$hash{$record} = $counter++;
Of course, you could read the entire file into an array at once and then assign to your hash. But the solution is not:
#records = <$file_handle>;
%hash = #records;
... as you found out. If you think in terms of (key, value) pairs, you will see that the above is equivalent to:
$hash{a} = 'b';
$hash{c} = 'd';
$hash{e} = 'f';
...
and so on. You still are going to need a loop, either an explicit one like this:
foreach my $rec (#records)
{
$hash{$rec} = $counter++;
}
or an implicit one like one of these:
%hash = map {$_ => $counter++} #records;
# or:
$hash{$_} = $counter++ for #records;
This code should generate the proper output, where my-text-file is the path to your data file:
my %hash;
my $counter = 0;
open(FILE, "my-text-file");
while (<FILE>) {
chomp;
$counter++;
$hash{$_} = $counter;
}
# Now to sort
foreach $key (sort(keys(%hash))) {
print $key . ":" . $hash{$key} . "\n";
}
I assume you want to sort the hash aplhabetically. keys(%hash) and values(%hash) return the keys and values of %hash as an array, respectively. Run the program on this file:
f
a
b
d
e
c
And we get:
a:2
b:3
c:6
d:4
e:5
f:1
I hope this helps you.

Auto increment numeric key values in a perl hash?

I have a perl script in which I am reading files from a given directory, and then placing those files into an array. I then want to be able to move those array elements into a perl hash, with the array elements being the hash value, and automatically assigning numeric keys to each hash value.
Here's the code:
# Open the current users directory and get all the builds. If you can open the dir
# then die.
opendir(D, "$userBuildLocation") || die "Can't opedir $userBuildLocation: $!\n";
# Put build files into an array.
my #builds = readdir(D);
closedir(D);
print join("\n", #builds, "\n");
This print out:
test.dlp
test1.dlp
I want to take those value and insert them into a hash that looks just like this:
my %hash (
1 => test.dlp
2 => test1.dlp
);
I want the numbered keys to be auto incrementing based on how many files I may find in a given directory.
I'm just not sure how to get the auto-incrementing keys to be set to unique numeric values for each item in the hash.
I am not sure to understand the need, but this should do
my $i = 0;
my %hash = map { ++$i => $_ } #builds;
another way to do it
my $i = 0;
for( #builds ) {
$hash{++$i} = $_;
}
The most straightforward and boring way:
my %hash;
for (my $i=0; $i<#builds; ++$i) {
$hash{$i+1} = $builds[$i];
}
or if you prefer:
foreach my $i (0 .. $#builds) {
$hash{$i+1} = $builds[$i];
}
I like this approach:
#hash{1..#builds} = #builds;
Another:
my %hash = map { $_+1, $builds[$_] } 0..$#builds;
or:
my %hash = map { $_, $builds[$_-1] } 1..#builds;

How to generate an array with a counter in Perl?

I want to generate a list of unique IDs. Because some of the IDs are duplicates, I need to add a number at the end to make them unique, like so:
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
Here's what I have so far.
while (loop through the IDs) {
# if $id is an exon, then increment the counter by one and add it
# to the end of the ID
if ($id =~ m/exon/) {
my $exon_count = 0;
my #exon = $exon_count++; #3
$number = pop #exon; # removes the first element of the list
$id = $id.$number;
print $id."/n"
}
}
Basically I want to dynamically generate an array with a counter. It's supposed to create an array (1, 2, 3, 4, ... ) for the total number of exons, then remove the elements and add it to the string. This code doesn't work properly. I think there's something wrong with line #3. Do you guys know? Any ideas? thank you
Is this what you need? The counter needs to retain its value, so you can't keep resetting it as you are:
use v5.10;
my $exon_count = 0;
while( my $id = <DATA> ) {
chomp $id;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
__END__
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
The output looks like:
ID=exon00001.000
ID=exon00002.001
ID=exon00003.002
ID=exon00004.003
If you're on 5.10 or later, you can use state to declare the variable inside the loop but let it keep its value:
use v5.10;
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 0;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
I figure you are new to Perl since your code looks like a mishmash of unrelated things that probably do something much different than you think they do. There's a Perl tutorial for biologists, "Unix and Perl". There's also my Learning Perl book.
Joel asked about using a string as the additional tag. That's fine; Perl lets you increment a string, but only on the ranges a-z and A-Z. We can mix numbers and letters by having a numeric tag that we present in base 36:
use v5.10;
use Math::Base36 'encode_base36';
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 30;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%-5s", $id, encode_base36($exon_count++);
}
say $id;
}
Now you have tags like this:
ID=exon00003.1Q
ID=exon00004.1R
ID=exon00001.1S
ID=exon00002.1T
ID=exon00003.1U
ID=exon00004.1V
As noted in my comment, your code does not compile, and does not work. Start by counting the duplicates, then print the correct count of duplicates based on the ids found. Using printf will be suitable for formatting your number.
my %seen;
my #ids = ( bunch of ids );
map $seen{$_}++, #ids; # count the duplicates
for my $id (keys %seen) {
for my $num (1 .. $seen{$id}) {
printf "%s%05d\n", $id, $num;
}
}
You want to generate a list of unique ids for these exons (to output into a GFF file?).
You have to be sure to initialize the counter outside of the loop. I'm not sure what you wanted to accomplish with the array. However, the program below will generate unique exon ids according to the format you posted (exon00001, etc).
my $exon_count=0;
while(my $id=<SOMEINPUT>){
if($id=~m/exon/){
$exon_count++;
my $num='0' x (5 - length $exon_count) . $exon_count;
print "$id$num\n";
}
}

Resources