Perl array - trying to parse quotes in proper array elements - arrays

I have been struggling with this for a while in a Perl script I have. Probably a slam dunk for you Perl experts, and probably should be easier, but I can't quite crack the nut on this. I might be needing to split this, not sure.
My array code as is follows.
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ){
push #query_output, $row;
push (#{portfo->{label}},$row->{data},$row->{label});
}
And then my print of the array is as follows:
print "array here--";
print "[";
foreach (#{portfo->{label}}) {
#(#{portfo->{label}},$row->{data});
print "\{\"data\":";
print "".$_.",";
print "\"label\":";
print "\"".$row[1]."\"\},";
}
print "]";
print "\n";
And then my output looks like this:
[{"data":2943,"label":""},{"data":CDI3,"label":""},
{"data":1,"label":""},{"data":COS-COS2,"label":""},
{"data":1087,"label":""},{"data":COS1,"label":""},
{"data":5183,"label":""},{"data":COS2,"label":""},
{"data":2731,"label":""},{"data":CULB,"label":""},{"data":1,"label":""},
{"data":EQUIT,"label":""},{"data":4474,"label":""},
{"data":Network,"label":""},]
I am trying to make the apha-num string array items like CDI3, COS1, COS2, etc in quotes, in the label part. Somehow I'm getting it separated. Meanwhile, I do want the numeric values left with the "data" name pair.
[{"data":2943,"label":""},{"data":"CDI3","label":""},
{"data":1,"label":""},{"data":"COS-COS2","label":""},
{"data":1087,"label":""},{"data":"COS1","label":""},
{"data":5183,"label":""},{"data":"COS2","label":""},
{"data":2731,"label":""},{"data":"CULB","label":""},{"data":1,"label":""},
{"data":"EQUIT","label":""},{"data":4474,"label":""},
{"data":"Network","label":""}]
I'm sure it's a simpler fix that I'm making it but so far no luck. Any tips would be greatly appreciated!
Thanks!

use JSON::XS qw( encode_json );
my #data;
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ) {
# If $row->{data} is a number,
# make sure it's stored as a number
# so that it gets serialized as a number.
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
push #data, $row;
}
print(encode_json(\#data));
Or
my $data = $query_handle->fetchall_arrayref({ data => 1, label => 1 });
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
Or if you ensure the fields names are returned as lowercase[1],
my $data = $query_handle->fetchall_arrayref({});
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
This can be done using $dbh->{FetchHashKeyName} = 'NAME_lc'; or AS `label`.

Related

Comparing two strings line by line in Perl

I am looking for code in Perl similar to
my #lines1 = split /\n/, $str1;
my #lines2 = split /\n/, $str2;
for (int $i=0; $i<lines1.length; $i++)
{
if (lines1[$i] ~= lines2[$i])
print "difference in line $i \n";
}
To compare two strings line by line and show the lines at which there is any difference.
I know what I have written is mixture of C/Perl/Pseudo-code. How do I write it in the way that it works on Perl?
What you have written is sort of ok, except you cannot use that notation in Perl lines1.length, int $i, and ~= is not an operator, you mean =~, but that is the wrong tool here. Also if must have a block { } after it.
What you want is simply $i < #lines1 to get the array size, my $i to declare a lexical variable, and eq for string comparison. Along with if ( ... ) { ... }.
Technically you can use the binding operator to perform a string comparison, for example:
"foo" =~ "foobar"
But it is not a good idea when comparing literal strings, because you can get partial matches, and you need to escape meta characters. Therefore it is easier to just use eq.
Using C-style for loops is valid, but the more Perl-ish way is to use this notation:
for my $i (0 .. $#lines1)
Which will iterate over the range 0 to the max index of the array.
Perl allows you to open filehandles on strings by using a reference to the scalar variable that holds the string:
open my $string1_fh, '<', \$string1 or die '...';
open my $string2_fh, '<', \$string2 or die '...';
while( my $line1 = <$string1_fh> ) {
my $line2 = <$string2_fh>;
....
}
But, depending on what you mean by difference (does that include insertion or deletion of lines?), you might want something different.
There are several modules on CPAN that you can inspect for ideas, such as Test::LongString or Algorithm::Diff.
my #lines1 = split(/^/, $str1);
my #lines2 = split(/^/, $str2);
# splits at start of line
# use /\n/ if you want to ignore newline and trailing spaces
for ($i=0; $i < #lines1; $i++) {
print "difference in line $i \n" if (lines1[$i] ne lines2[$i]);
}
Comparing Arrays is a way easier if you create a Hashmap out of it...
#Searching the difference
#isect = ();
#diff = ();
%count = ();
foreach $item ( #array1, #array2 ) { $count{$item}++; }
foreach $item ( keys %count ) {
if ( $count{$item} == 2 ) {
push #isect, $item;
}
else {
push #diff, $item;
}
}
#Output
print "Different= #diff\n\n";
print "\nA Array = #array1\n";
print "\nB Array = #array2\n";
print "\nIntersect Array = #isect\n";
Even after spliting you could compare them as Array.

Empty array in a perl while loop, should have input

Was working on this script when I came across a weird anomaly. When I go to print #extract after declaring it, it prints correctly the following:
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Now the weird part, when I then try to print or return #extract (or $column) inside of the while loop, it comes up empty, thus rendering the rest of the script useless. I've never come across this before up until now, haven't been able to find any documentation or people with similar problems as mine. Below is the code, I marked with #<------ where the problems are and are not, to see if anyone can have any idea what is going on? Thank you kindly.
P.S. I am utilizing perl version 5.12.2
use strict;
use warnings;
#use diagnostics;
#use feature qw(say);
open (S, "Val nuc align.txt") || die "cannot open FASTA file to read: $!";
open (OUTPUT, ">output.txt");
my #extract;
my $sum = 0;
my #lines = <S>;
my #seq = ();
my $start = 0; #amino acid column start
my $end = 10; #amino acid column end
#Removing of the sequence tag until amino acid sequence composition (from >gi to )).
foreach my $line (#lines) {
$line =~ s/\n//g;
if ($line =~ />/g) {
$line =~ s/>.*\]/>/g;
push #seq, $line;
}
else {
push #seq, $line;
}
}
my $seq = join ('', #seq);
my #seq_prot = join "\n", split '>', $seq;
#seq_prot = grep {/[A-Z]/} #seq_prot;
#number of sequences
print OUTPUT "Number of sequences:", scalar (grep {defined} #seq_prot), "\n";
#selection of amino acid sequence. From $start to $end.
my #vertical_array;
while ( my $line = <#seq_prot> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( $start..$end ) { #AA position, extracts whole columns
$vertical_array[$index] .= $split_line[$index];
}
}
# Print out your vertical lines
for my $line ( #vertical_array ) {
my $extract = say OUTPUT for unpack "(a200)*", $line; #split at end of each column
#extract = grep {defined} $extract;
}
print OUTPUT #extract; #<--------------- This prints correctly the input
#Count selected amino acids excluding '-'.
my %counter;
while (my $column = #extract) {
print #extract; #<------------------------ Empty print, no input found
}
Update: Found the main problem to be with the unpack command, I thought I could utilize it to split my columns of my input at X elements (43 in this case). While this works, the minute I change $start to another number that is not 0 (say 200), the code brings up errors. Probably has something to do with the number of column elements does not match the lines. Will keep updated.
Write your last while loop the same way as your previous for loop. The assignment
my $column = #extract
is in scalar context, which does not give you the same result as:
for my $column (#extract)
Instead, it will give you the number of elements in the array. Try this second option and it should work.
However, I still have a concern, because in fact, if #extract had anything in it, you would obtain an infinite loop. Is there any code that you did not include between your two commented lines?

How to generate an array with a counter in Perl?

I want to generate a list of unique IDs. Because some of the IDs are duplicates, I need to add a number at the end to make them unique, like so:
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
Here's what I have so far.
while (loop through the IDs) {
# if $id is an exon, then increment the counter by one and add it
# to the end of the ID
if ($id =~ m/exon/) {
my $exon_count = 0;
my #exon = $exon_count++; #3
$number = pop #exon; # removes the first element of the list
$id = $id.$number;
print $id."/n"
}
}
Basically I want to dynamically generate an array with a counter. It's supposed to create an array (1, 2, 3, 4, ... ) for the total number of exons, then remove the elements and add it to the string. This code doesn't work properly. I think there's something wrong with line #3. Do you guys know? Any ideas? thank you
Is this what you need? The counter needs to retain its value, so you can't keep resetting it as you are:
use v5.10;
my $exon_count = 0;
while( my $id = <DATA> ) {
chomp $id;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
__END__
ID=exon00001
ID=exon00002
ID=exon00003
ID=exon00004
The output looks like:
ID=exon00001.000
ID=exon00002.001
ID=exon00003.002
ID=exon00004.003
If you're on 5.10 or later, you can use state to declare the variable inside the loop but let it keep its value:
use v5.10;
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 0;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%03d", $id, $exon_count++;
}
say $id;
}
I figure you are new to Perl since your code looks like a mishmash of unrelated things that probably do something much different than you think they do. There's a Perl tutorial for biologists, "Unix and Perl". There's also my Learning Perl book.
Joel asked about using a string as the additional tag. That's fine; Perl lets you increment a string, but only on the ranges a-z and A-Z. We can mix numbers and letters by having a numeric tag that we present in base 36:
use v5.10;
use Math::Base36 'encode_base36';
while( my $id = <DATA> ) {
chomp $id;
state $exon_count = 30;
if( $id =~ m/exon/ ) {
$id = sprintf "%s.%-5s", $id, encode_base36($exon_count++);
}
say $id;
}
Now you have tags like this:
ID=exon00003.1Q
ID=exon00004.1R
ID=exon00001.1S
ID=exon00002.1T
ID=exon00003.1U
ID=exon00004.1V
As noted in my comment, your code does not compile, and does not work. Start by counting the duplicates, then print the correct count of duplicates based on the ids found. Using printf will be suitable for formatting your number.
my %seen;
my #ids = ( bunch of ids );
map $seen{$_}++, #ids; # count the duplicates
for my $id (keys %seen) {
for my $num (1 .. $seen{$id}) {
printf "%s%05d\n", $id, $num;
}
}
You want to generate a list of unique ids for these exons (to output into a GFF file?).
You have to be sure to initialize the counter outside of the loop. I'm not sure what you wanted to accomplish with the array. However, the program below will generate unique exon ids according to the format you posted (exon00001, etc).
my $exon_count=0;
while(my $id=<SOMEINPUT>){
if($id=~m/exon/){
$exon_count++;
my $num='0' x (5 - length $exon_count) . $exon_count;
print "$id$num\n";
}
}

Need help with a perl program

Ok, so I am trying take a hash and if any string in an array contains the key(not value actual key name) in the hash discard it. Else print out the string. This issue is with a portion of the findHidden sub routine. I have tried a lot of different things, I will comment below where I have issues. I'm sure someone has an answer, always get one on stack overflow :)
#!/usr/bin/perl
# Configure
use strict;
use warnings;
use Data::Dumper;
#
sub findHidden;
sub GetInfo;
sub defineHash;
##############
$passwd = '/etc/passwd';
%info = ();
sub GetInfo {
die "Cannot open: $passwd"
unless (open(PW,$passwd));
while(<PW>) {
chomp;
my ($uname,$junk1,$junk2,$junk3,$domain,$home) = split(':', $_);
next unless ($home =~ /vs/);
%info = (
domain => $domain,
home => "$home/",
tmp => "$home/tmp",
htdocs => "$home/www/htdocs",
cgibin => "$home/www/cgi\-bin",
);
print "\n" . $info{domain} . "\n";
print "+"x40,"\n\n";
findHidden($info{tmp});
}
}
sub findHidden {
defineHash;
print "Searching " . $_[0] . "\n";
print "-"x30,"\n\n";
#hidden = `find $_[0] -iname ".*"`;
for(#hidden) {
foreach $key (keys % hExcludes) {
if ($_ =~ /$key/){ #
last; # This portion is
}else{ # Only an issue when using more
print "$_"; # than 2 keys in my hash.
last;
}
}
}
}
sub defineHash {
%hExcludes = ();
%hExcludes = map { $_, 1 } (
'spamd','.nfs' # If I add another key here, it breaks.
);
%knownExploits =
( );
print Dumper \%hExcludes;
}
GetInfo;
This Works, and prints out something like this:
/somedir/tmp/.testthis
/somedir/tmp/.sdkfbsdif
/somedir/tmp/.asdasdasd
I understand why It is not working, because it is looping through the keys where some are false and some are positive, I just cannot think of how to make it do what I want, please assume I might want to you 10 keys. I know there are ways to do it without using hash key values for my excludes but it is what I want to accomplish.
I have also tried shift #hidden as below to no avail.
foreach $key (keys % hExcludes) {
if ($_ =~ /$key/){ #
last; #
shift #hidden;# This portion is
}else{ # Only an issue when using more
print "$_"; # than 2 keys in my hash.
last;
}
Also, keep in mind that things only stop working when I add the third...or more keys.
%hExcludes = map { $_, 1 } (
'spamd','.nfs','key3' # If I add another key here, it breaks
);
What you need is this:
#hidden = `find $_[0] -iname ".*"`;
for(#hidden) {
undef $isExcluded;
foreach $key (keys % hExcludes) {
if ($_ =~ /$key/){
$isExcluded=1;
last;
}
}
if( ! $isExcluded ) {
print "$_";
}
}
Whatever happened in your scan through the keys of hExcludes, the code encountered a last on the first key and did not process any more. You need to set a flag and continue iterating until either there are no more keys to set, or a match is found. Then you can print out the values that were not matched.

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources