I want to generate a list of unique IDs. Because some of the IDs are duplicates, I need to add a number at the end to make them unique, like so:
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
Here's what I have so far.
while (loop through the IDs) {
# if $id is an exon, then increment the counter by one and add it
# to the end of the ID
if ($id =~ m/exon/) {
my $exon_count = 0;
my #exon = $exon_count++; #3
$number = pop #exon; # removes the first element of the list
$id = $id.$number;
print $id."/n"
}
}
Basically I want to dynamically generate an array with a counter. It's supposed to create an array (1, 2, 3, 4, ... ) for the total number of exons, then remove the elements and add it to the string. This code doesn't work properly. I think there's something wrong with line #3. Do you guys know? Any ideas? thank you
Is this what you need? The counter needs to retain its value, so you can't keep resetting it as you are:
use v5.10;
my $exon_count = 0;
while( my $id = <DATA> ) {
chomp $id;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
__END__
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
The output looks like:
ID=exon00001.000
ID=exon00002.001
ID=exon00003.002
ID=exon00004.003
If you're on 5.10 or later, you can use state to declare the variable inside the loop but let it keep its value:
use v5.10;
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 0;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
I figure you are new to Perl since your code looks like a mishmash of unrelated things that probably do something much different than you think they do. There's a Perl tutorial for biologists, "Unix and Perl". There's also my Learning Perl book.
Joel asked about using a string as the additional tag. That's fine; Perl lets you increment a string, but only on the ranges a-z and A-Z. We can mix numbers and letters by having a numeric tag that we present in base 36:
use v5.10;
use Math::Base36 'encode_base36';
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 30;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%-5s", $id, encode_base36($exon_count++);
}
say $id;
}
Now you have tags like this:
ID=exon00003.1Q
ID=exon00004.1R
ID=exon00001.1S
ID=exon00002.1T
ID=exon00003.1U
ID=exon00004.1V
As noted in my comment, your code does not compile, and does not work. Start by counting the duplicates, then print the correct count of duplicates based on the ids found. Using printf will be suitable for formatting your number.
my %seen;
my #ids = ( bunch of ids );
map $seen{$_}++, #ids; # count the duplicates
for my $id (keys %seen) {
for my $num (1 .. $seen{$id}) {
printf "%s%05d\n", $id, $num;
}
}
You want to generate a list of unique ids for these exons (to output into a GFF file?).
You have to be sure to initialize the counter outside of the loop. I'm not sure what you wanted to accomplish with the array. However, the program below will generate unique exon ids according to the format you posted (exon00001, etc).
my $exon_count=0;
while(my $id=<SOMEINPUT>){
if($id=~m/exon/){
$exon_count++;
my $num='0' x (5 - length $exon_count) . $exon_count;
print "$id$num\n";
}
}
Related
I have the following code that uses references to hashes:
sub readAll {
my ( $main, $dbh ) = #_;
my #SessSeq = ();
my $sql;
my $rec = 0;
$sql = "SELECT * FROM sys_table ";
my $sth = PrepAndExecuteQuery( $dbh, $sql );
while ( my $result = $sth->fetchrow_hashref() ){
push #SessSeq, $result;
$rec++;
}
$$main{_SessSeq} = \#SessSeq;
}
Above code works. i get an array of hashes in the main hash
I'm struggling to retrieve the data due to my lack of knowledge.
this doesnt seem to work:
foreach my $ses ( #($$main{_SessSeq}) ){
print STDERR Dumper $ses;
}
what am i doing wrong?
Assuming that your foreach my $ses loop is inside the readAll subroutine, the only thing you are doing wring is to use use parentheses ( ... ) instead of braces { ... }in your dereferencing
$$main{_SessSeq} would be very much better written in modern terms as $main->{_SessSeq}, but both evaluate to the same thing: a reference to an array of hashes with the data from a database table row in each hash
The correct loop will look like this
for my $ses ( #{ $main->{_SessSeq} } ) {
print STDERR Dumper $ses;
}
but there is no real advantage over writing just
print STDERR Dumper $main->{_SessSeq}
You say nothing about what this doesn't seem to work might mean, but print Dumper $ses will output the contents of a table row each time around the loop
If you need any further help then you must describe the problem properly, showing the output that you're getting and describing carefully what you expect
I have been struggling with this for a while in a Perl script I have. Probably a slam dunk for you Perl experts, and probably should be easier, but I can't quite crack the nut on this. I might be needing to split this, not sure.
My array code as is follows.
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ){
push #query_output, $row;
push (#{portfo->{label}},$row->{data},$row->{label});
}
And then my print of the array is as follows:
print "array here--";
print "[";
foreach (#{portfo->{label}}) {
#(#{portfo->{label}},$row->{data});
print "\{\"data\":";
print "".$_.",";
print "\"label\":";
print "\"".$row[1]."\"\},";
}
print "]";
print "\n";
And then my output looks like this:
[{"data":2943,"label":""},{"data":CDI3,"label":""},
{"data":1,"label":""},{"data":COS-COS2,"label":""},
{"data":1087,"label":""},{"data":COS1,"label":""},
{"data":5183,"label":""},{"data":COS2,"label":""},
{"data":2731,"label":""},{"data":CULB,"label":""},{"data":1,"label":""},
{"data":EQUIT,"label":""},{"data":4474,"label":""},
{"data":Network,"label":""},]
I am trying to make the apha-num string array items like CDI3, COS1, COS2, etc in quotes, in the label part. Somehow I'm getting it separated. Meanwhile, I do want the numeric values left with the "data" name pair.
[{"data":2943,"label":""},{"data":"CDI3","label":""},
{"data":1,"label":""},{"data":"COS-COS2","label":""},
{"data":1087,"label":""},{"data":"COS1","label":""},
{"data":5183,"label":""},{"data":"COS2","label":""},
{"data":2731,"label":""},{"data":"CULB","label":""},{"data":1,"label":""},
{"data":"EQUIT","label":""},{"data":4474,"label":""},
{"data":"Network","label":""}]
I'm sure it's a simpler fix that I'm making it but so far no luck. Any tips would be greatly appreciated!
Thanks!
use JSON::XS qw( encode_json );
my #data;
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ) {
# If $row->{data} is a number,
# make sure it's stored as a number
# so that it gets serialized as a number.
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
push #data, $row;
}
print(encode_json(\#data));
Or
my $data = $query_handle->fetchall_arrayref({ data => 1, label => 1 });
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
Or if you ensure the fields names are returned as lowercase[1],
my $data = $query_handle->fetchall_arrayref({});
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
This can be done using $dbh->{FetchHashKeyName} = 'NAME_lc'; or AS `label`.
I'm struggling to understand logic behind hashes in Perl. Task is to load file in to hash and assign values to keys which are created using this file.
File contains alphabet with each letter on its own line:
a
b
c
d
e
and etc,.
When using array instead of hash, logic is simple: load file into array and then print each element with corresponding number using some counter ($counter++).
But now my question is, how can I read file into my hash, assign automatically generated values and sort it in that way where output is printed like this:
a:1
b:2
c:3
I've tried to first create array and then link it to hash using
%hash = #array
but it makes my hash non-sortable.
There are a number of ways to approach this. The most direct would be to load the data into the hash as you read through the file.
my %hash;
while(<>)
{
chomp;
$hash{$_} = $.; #Use the line number as your autogenerated counter.
}
You can also perform simliar logic if you already have a populated array.
for (0..$#array)
{
$hash{$array[$_]} = $_;
}
Although, if you are in that situation, map is the perlier way of doing things.
%hash = map { $array[$_] => $_ } #array;
Think of a hash as a set of pairs (key, value), where the keys must be unique. You want to read the file one line at a time, and add a pair to the hash:
$record = <$file_handle>;
$hash{$record} = $counter++;
Of course, you could read the entire file into an array at once and then assign to your hash. But the solution is not:
#records = <$file_handle>;
%hash = #records;
... as you found out. If you think in terms of (key, value) pairs, you will see that the above is equivalent to:
$hash{a} = 'b';
$hash{c} = 'd';
$hash{e} = 'f';
...
and so on. You still are going to need a loop, either an explicit one like this:
foreach my $rec (#records)
{
$hash{$rec} = $counter++;
}
or an implicit one like one of these:
%hash = map {$_ => $counter++} #records;
# or:
$hash{$_} = $counter++ for #records;
This code should generate the proper output, where my-text-file is the path to your data file:
my %hash;
my $counter = 0;
open(FILE, "my-text-file");
while (<FILE>) {
chomp;
$counter++;
$hash{$_} = $counter;
}
# Now to sort
foreach $key (sort(keys(%hash))) {
print $key . ":" . $hash{$key} . "\n";
}
I assume you want to sort the hash aplhabetically. keys(%hash) and values(%hash) return the keys and values of %hash as an array, respectively. Run the program on this file:
f
a
b
d
e
c
And we get:
a:2
b:3
c:6
d:4
e:5
f:1
I hope this helps you.
I am looking for code in Perl similar to
my #lines1 = split /\n/, $str1;
my #lines2 = split /\n/, $str2;
for (int $i=0; $i<lines1.length; $i++)
{
if (lines1[$i] ~= lines2[$i])
print "difference in line $i \n";
}
To compare two strings line by line and show the lines at which there is any difference.
I know what I have written is mixture of C/Perl/Pseudo-code. How do I write it in the way that it works on Perl?
What you have written is sort of ok, except you cannot use that notation in Perl lines1.length, int $i, and ~= is not an operator, you mean =~, but that is the wrong tool here. Also if must have a block { } after it.
What you want is simply $i < #lines1 to get the array size, my $i to declare a lexical variable, and eq for string comparison. Along with if ( ... ) { ... }.
Technically you can use the binding operator to perform a string comparison, for example:
"foo" =~ "foobar"
But it is not a good idea when comparing literal strings, because you can get partial matches, and you need to escape meta characters. Therefore it is easier to just use eq.
Using C-style for loops is valid, but the more Perl-ish way is to use this notation:
for my $i (0 .. $#lines1)
Which will iterate over the range 0 to the max index of the array.
Perl allows you to open filehandles on strings by using a reference to the scalar variable that holds the string:
open my $string1_fh, '<', \$string1 or die '...';
open my $string2_fh, '<', \$string2 or die '...';
while( my $line1 = <$string1_fh> ) {
my $line2 = <$string2_fh>;
....
}
But, depending on what you mean by difference (does that include insertion or deletion of lines?), you might want something different.
There are several modules on CPAN that you can inspect for ideas, such as Test::LongString or Algorithm::Diff.
my #lines1 = split(/^/, $str1);
my #lines2 = split(/^/, $str2);
# splits at start of line
# use /\n/ if you want to ignore newline and trailing spaces
for ($i=0; $i < #lines1; $i++) {
print "difference in line $i \n" if (lines1[$i] ne lines2[$i]);
}
Comparing Arrays is a way easier if you create a Hashmap out of it...
#Searching the difference
#isect = ();
#diff = ();
%count = ();
foreach $item ( #array1, #array2 ) { $count{$item}++; }
foreach $item ( keys %count ) {
if ( $count{$item} == 2 ) {
push #isect, $item;
}
else {
push #diff, $item;
}
}
#Output
print "Different= #diff\n\n";
print "\nA Array = #array1\n";
print "\nB Array = #array2\n";
print "\nIntersect Array = #isect\n";
Even after spliting you could compare them as Array.
I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}