Store and read hash and array in files in Perl - arrays

I'm a noob.I need some basic knowledge about how data should be saved and read under perl. Say to save a hash and an array. What format (extension) of file should be used? txt? So far I can only save all the things as stringprint FILE %hash and read them back as stringprint <FILE>. What should I do if I need my function hash and array inputs from a file. How to put them back to hash and array?

You're looking for data serialisation. Popular choices that are robust are Sereal, JSON::XS and YAML::XS. Lesser known formats are: ASN.1, Avro, BERT, BSON, CBOR, JSYNC, MessagePack, Protocol Buffers, Thrift.
Other often mentioned choices are Storable and Data::Dumper (or similar)/eval, but I cannot recommend them because Storable's format is Perl version dependent, and eval is unsafe because it executes arbitrary code. As of 2012, the parsing counter-part Data::Undump has not progressed very far yet. I also cannot recommend using XML because it does not map Perl data types well, and there exists multiple competing/incompatible schemas how to translate between XML and data.
Code examples (tested):
use JSON::XS qw(encode_json decode_json);
use File::Slurp qw(read_file write_file);
my %hash;
{
my $json = encode_json \%hash;
write_file('dump.json', { binmode => ':raw' }, $json);
}
{
my $json = read_file('dump.json', { binmode => ':raw' });
%hash = %{ decode_json $json };
}
use YAML::XS qw(Load Dump);
use File::Slurp qw(read_file write_file);
my %hash;
{
my $yaml = Dump \%hash;
write_file('dump.yml', { binmode => ':raw' }, $yaml);
}
{
my $yaml = read_file('dump.yml', { binmode => ':raw' });
%hash = %{ Load $yaml };
}
The next step up from here is object persistence.
Also read: Serializers for Perl: when to use what

Perlmonks has two good discussions on serialization.
How to save and reload my hash
How can I visualize my complex data structure?

This really depends upon how you'd like store your data in your file. I will try writing some basic perl code to enable you to read a file into an array and or write back a hash into a file.
#Load a file into a hash.
#My Text file has the following format.
#field1=value1
#field2=value2
#<FILE1> is an opens a sample txt file in read-only mode.
my %hash;
while (<FILE1>)
{
chomp;
my ($key, $val) = split /=/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}

If you new I just suggest make to string from array/hash with join() and they write it with "print" and then read and use split() to make array/hash again. That would be more simple way like Perl teaching text book example.

Related

Can't use string as an ARRAY ref while strict refs in use

Getting an error when I attempt to dump out part of a multi dimensional hash array. Perl spits out
Can't use string ("somedata") as an ARRAY ref while "strict refs" in use at ./myscript.pl
I have tried multiple ways to access part of the array I want to see but I always get an error. I've used Dumper to see the entire array and it looks fine.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper qw(Dumper);
use String::Util qw(trim);
my %arrHosts;
open(my $filehdl, "<textfile.txt") || die "Cannot open or find file textfile.txt: $!\n";
while( my $strInputline = <$filehdl> ) {
chomp($strInputline);
my ($strHostname,$strOS,$strVer,$strEnv) = split(/:/, $strInputline);
$strOS = lc($strOS);
$strVer = trim($strVer);
$strEnv = trim($strEnv);
$strOS = trim($strOS);
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
}
# If you want to see the entire database, remove the # in front of Dumper
print Dumper \%arrHosts;
foreach my $machine (#{$arrHosts{solaris}{10}{DEV}}) {
print "$machine\n";
}
close($filehdl);
The data is in the form
machine:OS:OS version:Environment
For example
bigserver:solaris:11:PROD
smallerserver:solaris:11:DEV
I want to print out only the servers that are solaris, version 11, in DEV. Using hashes seems the easiest way to store the data but alas, Perl barfs when attempting to print out only a portion of it. Dumper works great but I don't want to see everything. Where did I go wrong??
You have the following:
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
That means the following contains a string:
$arrHosts{solaris}{10}{DEV}
You are treating it as if it contains a reference to an array. To group the hosts by OS+ver+env, replace
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
with
push #{ $arrHosts{$strOS}{$strVer}{$strEnv} }, $strHostname;
Iterating over #{ $arrHosts{solaris}{10}{DEV} } will then make sense.
My previous code also had the obvious problem whereby if the combo of OS, Version, and Environment were the same it wrote over previous data. Blunderful. Push is the trick

Perl data structures - looping an array of hashes inside a hash

I need a data structure to keep metadata about a field in a database, which I'm going to access to write dynamic SQL.
I'm using a hash to store things like the name, maybe data type, etc. And most importantly, an array of hashes containing information about the values I want to query out of the field, and the name I want to alias them with.
When I try to access elements of that array, I get:
Global symbol "%elem" requires explicit package name at test.pl line 18.
It sounds like maybe it's having trouble registering the fact that the loop variable representing the array elements is a hash, not a scalar. If I try:
foreach my %elem
then I get:
Missing $ on loop variable at test.pl line 17 (#1)
So far I can't find the relevant Perl documentation that addresses this.
#!/usr/local/bin/perl
use warnings;
use strict;
use diagnostics;
use POSIX 'strftime';
my %struct = (
#"field" = "foobar",
"values" => [
{value => "Y", name => "FOO"}
, {value => "N", name => "BAR"}
]
);
foreach my $elem (#{$struct->{'values'}}) {
print $elem->{'value'};
}
I expect the program to print "YN" to the console.
UPDATE, as someone pointed out I needed to use %hash->{'ref'} in the loop addressing. I added it. Now, I get a notification saying that using a hash as a reference is deprecated (?) but it is printing to the console now!
When I tried running your code, I got a different error than you reported:
Global symbol "$struct" requires explicit package name
This is because you've defined a hash %struct, not a hashref $struct, so you don't need to dereference it. Thus, I changed the line
foreach my $elem (#{$struct->{'values'}}) {
to
foreach my $elem (#{$struct{'values'}}) {
(note no -> to dereference) and it ran perfectly, no errors or warnings, and emitted the output
YN
as expected.
%struct is a hash, not a hash reference. Therefore, $struct->{'values'} is not the correct way to access the values key.
for my $elem (#{$struct{values}}) {
print "$elem->{value}\n";
}

How do I get gene features in FASTA nucleotide format from NCBI using Perl?

I am able to download a FASTA file manually that looks like:
>lcl|CR543861.1_gene_1...
ATGCTTTGGACA...
>lcl|CR543861.1_gene_2...
GTGCGACTAAAA...
by clicking "Send to" and selecting "Gene Features", FASTA Nucleotide is the only option (which is fine because that's all I want) on this page.
With a script like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Bio::DB::EUtilities;
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'nucleotide',
-id => 'CR543861',
-rettype => 'fasta');
my $file = 'CR543861.fasta';
$factory->get_Response(-file => $file);
I get a file that looks like:
>gi|49529273|emb|CR543861.1| Acinetobacter sp. ADP1 complete genome
GATATTTTATCCACA...
with the whole genomic sequence lumped together. How do I get information like in the first (manually downloaded) file?
I looked at a couple of other posts:
how to download complete genome sequence in biopython entrez.esearch (this answer seemed relevant)
How can I download the entire GenBank file with just an accession number?
As well as this section from EUtilities Cookbook.
I tried fetching and saving a GenBank file (since it seems to have separate sequences for each gene in the .gb file I get), but when I go work with it using Bio::SeqIO, I will get only 1 large sequence.
With that accession number and return type, you are getting the complete genome sequence. If you want to get the individual gene sequences, specify that you want the complete genbank file, then parse out the genes. Here is an example:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::EUtilities;
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-email => 'foo#bar.com',
-db => 'nucleotide',
-id => 'CR543861',
-rettype => 'gb');
my $file = 'CR543861.gb';
$factory->get_Response(-file => $file);
my #gene_features = grep { $_->primary_tag eq 'gene' }
Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
for my $feat_object (#gene_features) {
for my $tag ($feat_object->get_all_tags) {
# open a filehandle here for writing each to a separate file
say ">",$feat_object->get_tag_values($tag);
say $feat_object->spliced_seq->seq;
# close it!
}
}
This will write each gene to the same file (if you redirect it, now it just writes to STDOUT) but I indicated where you could make a small change to write them to separate files. Parsing genbank can be a bit tricky at times, so it is always helpful to read the docs and in particular, the excellent Feature Annotation HOWTO.

Creating a list of duplicate filenames with Perl

I've been trying to write a script to pre-process some long lists of files, but I am not confident (nor competent) with Perl yet and am not getting the results I want.
The script below is very much work in progress but I'm stuck on the check for duplicates and would be grateful if anyone could let me know where I am going wrong. The block dealing with duplicates seems to be of the same form as examples I have found but it doesn't seem to work.
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', $ARGV[0] or die "can't open: $!";
foreach my $line (<$fh>) {
# Trim list to remove directories which do not need to be checked
next if $line =~ m/Inventory/;
# MORE TO DO
next if $line =~ m/Scanned photos/;
$line =~ s/\n//; # just for a tidy list when testing
my #split = split(/\/([^\/]+)$/, $line); # separate filename from rest of path
foreach (#split) {
push (my #filenames, "$_");
# print "#filenames\n"; # check content of array
my %dupes;
foreach my $item (#filenames) {
next unless $dupes{$item}++;
print "$item\n";
}
}
}
I am struggling to understand what is wrong with my check for duplicates. I know the array contains duplicates (uncommenting the first print function gives me a list with lots of duplicates). The code as it stands generates nothing.
Not the main purpose of my post but my final aim is to remove unique filenames from the list and keep filenames which are in duplicated in other directories.
I know that none of these files are identical but many are different versions of the same file which is why I'm focussing on filename.
Eg I would want an input of:
~/Pictures/2010/12345678.jpg
~/Pictures/2010/12341234.jpg
~/Desktop/temp/12345678.jpg
to give an output of:
~/Pictures/2010/12345678.jpg
~/Desktop/temp/12345678.jpg
So I suppose ideally it would be good to check for uniqueness of a match based on the regex without splitting if that is possible.
This below loop does nothing, because the hash and the array only contain one value for each loop iteration:
foreach (#split) {
push (my #filenames, "$_"); # add one element to lexical array
my %dupes;
foreach my $item (#filenames) { # loop one time
next unless $dupes{$item}++; # add one key to lexical hash
print "$item\n";
}
} # #filenames and %dupes goes out of scope
A lexical variable (declared with my) has a scope that extends to the surrounding block { ... }, in this case your foreach loop. When they go out of scope, they are reset and all the data is lost.
I don't know why you copy the file names from #split to #filenames, it seems very redundant. The way to dedupe this would be:
my %seen;
my #uniq;
#uniq = grep !$seen{$_}++, #split;
Additional information:
You might also be interested in using File::Basename to get the file name:
use File::Basename;
my $fullpath = "~/Pictures/2010/12345678.jpg";
my $name = basename($fullpath); # 12345678.jpg
Your substitution
$line =~ s/\n//;
Should probably be
chomp($line);
When you read from a file handle, using for (foreach) means you read all the lines and store them in memory. It is preferable most times to instead use while, like this:
while (my $line = <$fh>)
TLP's answer gives lots of good advice. In addition:
Why use both an array and a hash to store the filenames? Simply use the hash as your one storage solution, and you will automatically remove duplicates. i.e:
my %filenames; #outside of the loops
...
foreach (#split) {
$filenames{$_}++;
}
Now when you want to get the list of unique filenames, just use keys %filenames or, if you want them in alphabetical order, sort keys %filenames. And the value for each hash key is a count of occurrences, so you can find out which ones were duplicated if you care.

How do I serialize an array of array-references in Perl?

There are so many modules for serializing data for Perl, and I don't know which one to choose.
I've the following data that I need to serialize as a string so I can put it in the database:
my #categories = (
["Education", "Higher Education", "Colleges"],
["Schooling", "Colleges"]
);
How could I turn it into text, and then later when I need it, turn back into an array of array-references?
I vote for JSON (or Data::Serializer as mentioned in another answer, in conjunction with JSON).
The JSON module is plenty fast and efficient (if you install JSON::XS from cpan, it will compile the C version for you, and use JSON will automatically use that).
It works great with Perl data structures, is standardized, and the Javascript syntax is so similar to Perl syntax. There are options you can set with the JSON module to improve human readability (linebreaks, etc.)
I've also used Storable. I don't like it--the interface is weird, and the output is nonsensical, and it is a proprietary format. Data::Dumper is fast and quite readable but is really meant to be one-way (evaling it is slightly hackish), and again, it's Perl only. I've also rolled my own as well. In the end, I concluded JSON is the best, is fast, flexible, and robust.
You can use Data::Serializer:
Examples/Information from OnPerl.net
Data::Serializer Module from CPAN
You could roll your own, but you have to worry about tricky issues such as escaping quotes and backslashes or the separators you choose.
The program below shows how you can use standard Perl modules Data::Dumper and Storable to serialize and deserialize your data in a way that is suitable for storing in a database.
#! /usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Storable qw/ nfreeze thaw /;
use Test::More tests => 2;
my #categories = (
["Education", "Higher Education", "Colleges"],
["Schooling", "Colleges"]
);
{
local $Data::Dumper::Indent = 0;
local $Data::Dumper::Terse = 1;
my $serialized = Dumper \#categories;
print $serialized, "\n";
my $restored = eval($serialized) || die "deserialization failed: $#";
is_deeply $restored, \#categories;
}
{
my $serialized = unpack "H*", nfreeze \#categories;
print $serialized, "\n";
my $restored = thaw pack "H*", $serialized;
die "deserialization failed: $#" unless defined $restored;
is_deeply $restored, \#categories;
}
Data::Dumper has the nice property of being human readable, but the severe negative of requiring eval to deserialize. Storable is nice and compact but opaque.

Resources