I am currently parsing a comma separated string of 2-tuples into a hash of scalars. For example, given the input:
"ip=192.168.100.1,port=80,file=howdy.php",
I end up with a hash that looks like:
%hash =
{
ip => 192.168.100.1,
port => 80,
file => howdy.php
}
Code works fine and looks something like this:
my $paramList = $1;
my #paramTuples = split(/,/, $paramList);
my %hash;
foreach my $paramTuple (#paramTuples) {
my($key, $val) = split(/=/, $paramTuple, 2);
$hash{$key} = $val;
}
I'd like to expand the functionality from just taking scalars to also take arrays and hashes. So, another example input could be:
"ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}",
I end up with a hash that looks like:
%hash =
{
ips => (192.168.100.1, 192.168.100.2), # <--- this is an array
port => 80,
file => howdy.php,
hashthing => { key1 => val1, key2 => val2 } # <--- this is a hash
}
I know I can parse the input string character by character. For each tuple I would do the following: If the first character is a ( then parse an array. Else, if the first character is a { then parse a hash. Else parse a scalar.
A co-worker of mine indicated he thought you could turn a string that looked like "(red,yellow,blue)" into an array or "{c1 => red, c2 => yellow, c3 => blue}" into a hash with some kind of cast function. If I went this route, I could use a different delimiter instead of a comma to separate my 2-tuples like a |.
Is this possible in perl?
I think the "cast" function you're referring to, might be eval.
Using eval
use strict;
use warnings;
use Data::Dumper;
my $string = "{ a => 1, b => 2, c => 3}";
my $thing = eval $string;
print "thing is a ", ref($thing),"\n";
print Dumper $thing;
Will print:
thing is a HASH
$VAR1 = {
'a' => 1,
'b' => 2,
'c' => 3
};
Or for arrays:
my $another_string = "[1, 2, 3 ]";
my $another_thing = eval $another_string;
print "another_thing is ", ref ( $another_thing ), "\n";
print Dumper $another_thing;
another_thing is ARRAY
$VAR1 = [
1,
2,
3
];
Although note that eval requires you to use brackets suitable for the appropriate data types - {} for anon hashes, and [] for anon arrays. So to take your example above:
my %hash4;
my $ip_string = "ips=[192.168.100.1,192.168.100.2]";
my ( $key, $value ) = split ( /=/, $ip_string );
$hash4{$key} = eval $value;
my $hashthing_string = "{ key1 => 'val1', key2 => 'val2' }";
$hash4{'hashthing'} = eval $hashthing_string;
print Dumper \%hash4;
Gives:
$VAR1 = {
'hashthing' => {
'key2' => 'val2',
'key1' => 'val1'
},
'ips' => [
192.168.100.1,
192.168.100.2
]
};
Using map to make an array into a hash
If you want to turn an array into a hash, the map function is for that.
my #array = ( "red", "yellow", "blue" );
my %hash = map { $_ => 1 } #array;
print Dumper \%hash;
Using slices of hashes
You can also use a slice if you have known values and known keys:
my #keys = ( "c1", "c2", "c3" );
my %hash2;
#hash2{#keys} = #array;
print Dumper \%hash2;
JSON / XML
Or if you have control over the export mechanism, you may find exporting as JSON or XML format would be a good choice, as they're well defined standards for 'data as text'. (You could perhaps use Perl's Storable too, if you're just moving data between Perl processes).
Again, to take the %hash4 above (with slight modifications, because I had to quote the IPs):
use JSON;
print encode_json(\%hash4);
Gives us:
{"hashthing":{"key2":"val2","key1":"val1"},"ips":["192.168.100.1","192.168.100.2"]}
Which you can also pretty-print:
use JSON;
print to_json(\%hash4, { pretty => 1} );
To get:
{
"hashthing" : {
"key2" : "val2",
"key1" : "val1"
},
"ips" : [
"192.168.100.1",
"192.168.100.2"
]
}
This can be read back in with a simple:
my $data_structure = decode_json ( $input_text );
Style point
As a point of style - can I suggest that the way you've formatted your data structures isn't ideal. If you 'print' them with Dumper then that's a common format that most people will recognise. So your 'first hash' looks like:
Declared as (not - my prefix, and () for the declaration, as well as quotes required under strict):
my %hash3 = (
"ip" => "192.168.100.1",
"port" => 80,
"file" => "howdy.php"
);
Dumped as (brackets of {} because it's an anonymous hash, but still quoting strings):
$VAR1 = {
'file' => 'howdy.php',
'ip' => '192.168.100.1',
'port' => 80
};
That way you'll have a bit more joy with people being able to reconstruct and interpret your code.
Note too - that the dumper style format is also suitable (in specific limited cases) for re-reading via eval.
Try this but compound values will have to be parsed separately.
my $qr_key_1 = qr{
( # begin capture
[^=]+ # equal sign is separator. NB: spaces captured too.
) # end capture
}msx;
my $qr_value_simple_1 = qr{
( # begin capture
[^,]+ # comma is separator. NB: spaces captured too.
) # end capture
}msx;
my $qr_value_parenthesis_1 = qr{
\( # starts with parenthesis
( # begin capture
[^)]+ # end with parenthesis NB: spaces captured too.
) # end capture
\) # end with parenthesis
}msx;
my $qr_value_brace_1 = qr{
\{ # starts with brace
( # begin capture
[^\}]+ # end with brace NB: spaces captured too.
) # end capture
\} # end with brace
}msx;
my $qr_value_3 = qr{
(?: # group alternative
$qr_value_parenthesis_1
| # or other value
$qr_value_brace_1
| # or other value
$qr_value_simple_1
) # end group
}msx;
my $qr_end = qr{
(?: # begin group
\, # ends in comma
| # or
\z # end of string
) # end group
}msx;
my $qr_all_4 = qr{
$qr_key_1 # capture a key
\= # separates key from value(s)
$qr_value_3 # capture a value
$qr_end # end of key-value pair
}msx;
while( my $line = <DATA> ){
print "\n\n$line"; # for demonstration; remove in real script
chomp $line;
while( $line =~ m{ \G $qr_all_4 }cgmsx ){
my $key = $1;
my $value = $2 || $3 || $4;
print "$key = $value\n"; # for demonstration; remove in real script
}
}
__DATA__
ip=192.168.100.1,port=80,file=howdy.php
ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}
Addendum:
The reason why it is so difficult to expand the parse is, in one word, context. The first line of data, ip=192.168.100.1,port=80,file=howdy.php is context free. That is, all the symbols in it do not change their meaning. Context-free data format can be parsed with regular expressions alone.
Rule #1: If the symbols denoting the data structure never change, it is a context-free format and regular expressions can parse it.
The second line, ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2} is a different issue. The meaning of the comma and equal sign changes.
Now, you're thinking the comma doesn't change; it still separates things, doesn't it? But it changes what it separates. That is why the second line is more difficult to parse. The second line has three contexts, in a tree:
main context
+--- list context
+--- hash context
The tokienizer must switch parsing sets as the data switches context. This requires a state machine.
Rule #2: If the contexts of the data format form a tree, then it requires a state machine and different parsers for each context. The state machine determines which parser is in use. Since every context except the root have only one parent, the state machine can switch back to the parent at the end of its current context.
And this is the last rule, for completion sake. It is not used in this problem.
Rule #3: If the contexts form a DAG (directed acyclic graph) or a recursive (aka cyclic) graph, then the state machine requires a stack so it will know which context to switch back to when it reaches the end of the current context.
Now, you may have notice that there is no state machine in the above code. It's there but it's hidden in the regular expressions. But hiding it has a cost: the list and hash contexts are not parsed. Only their strings are found. They have to be parsed separately.
Explanation:
The above code uses the qr// operator to create the parsing regular expression. The qr// operator compiles a regular expression and returns a reference to it. This reference can be used in a match, substitute, or another qr// expression. Think of each qr// expression as a subroutine. Just like normal subroutines, qr// expressions can be used in other qr// expressions, building up complex regular expressions from simpler ones.
The first expression, $qr_key_1, captures the key name in the main context. Since the equal sign separates the key from the value, it captures all non-equal-sign characters. The "_1" on the end of the variable name is what I use to remind myself that one capture group is present.
The options on the end of the expression, /m, /s, and /x, are recommended in Perl Best Practices but only the /x option has an effect. It allows spaces and comments in the regular expression.
The next expression, $qr_value_simple_1, captures simple values for the key.
The next one, $qr_value_parenthesis_1, handles the list context. This is possible only because a closing parenthesis has only one meaning: end of list context. But is also has a price: the list is not parsed; only its string is found.
And again for $qr_value_brace_1: the closing brace has only one meaning. And the hash is also not parsed.
The $qr_value_3 expression combines the value REs into one. The $qr_value_simple_1 must be last but the others can be in any order.
The $qr_end parses the end of a field in the main context. There is no number at its end because it does not capture anything.
And finally, $qr_all_4 puts them all together to create the RE for data.
The RE used in the inner loop, m{ \G $qr_all_4 }cgmsx, parses out each field in the main context. The \G assertion means: if the has been changed since the last call (or it has never been called), then start the match at the beginning of the string; otherwise, start where the last match finished. This is used in conjunction with the /c and /g``options to parse each field out from the$line`, one at a time for processing inside the loop.
And that is briefly what is happening inside the code. ☺
Related
I have found a couple of ways to copy the elements of a list to the keys of a hash, but could somebody please explain how this works?
#!/usr/bin/perl
use v5.34.0;
my #arry = qw( ray bill lois shirly missy hank );
my %hash;
$hash{$_}++ for #arry; # What is happening here?
foreach (keys %hash) {
say "$_ => " . $hash{$_};
}
The output is what I expected. I don't know how the assignment is being made.
hank => 1
shirly => 1
missy => 1
bill => 1
lois => 1
ray => 1
$hash{$_}++ for #array;
Can also be written
for (#array) {
$hash{$_}++;
}
Or more explicitly
for my $key (#array) {
$hash{$key}++;
}
$_ is "the default input and pattern-searching space"-variable. Often in Perl functions, you can leave out naming an explicit variable to use, and it will default to using $_. for is an example of that. You can also write an explicit variable name, that might feel more informative for your code:
for my $word (#words)
Or idiomatically:
for my $key (keys %hash) # using $key variable name for hash keys
You should also be aware that for and foreach are exactly identical in Perl. They are aliases for the same function. Hence, I always use for because it is shorter.
The second part of the code is the assignment, using the auto-increment operator ++
It is appended to a variable on the LHS and increments its value by 1. E.g.
$_++ means $_ = $_ + 1
$hash{$_}++ means $hash{$_} = $hash{$_} + 1
...etc
It also has a certain Perl magic included, which you can read more about in the documentation. In this case, it means that it can increment even undefined variables without issuing a warning about it. This is ideal when it comes to initializing hash keys, which do not exist beforehand.
Your code will initialize a hash key for each word in your #arry list, and also count the occurrences of each word. Which happens to be 1 in this case. This is relevant to point out, because since hash keys are unique, your array list may be bigger than the list of keys in the hash, since some keys would overwrite each other.
my #words = qw(foo bar bar baaz);
my %hash1;
for my $key (#words) {
$hash{$key} = 1; # initialize each word
}
# %hash1 = ( foo => 1, bar => 1, baaz => 1 );
# note -^^
my %hash2; # new hash
for my $key (#words) {
$hash{$key}++; # use auto-increment: words are counted
}
# %hash2 = ( foo => 1, bar => 2, baaz => 1);
# note -^^
Here is another one
my %hash = map { $_ => 1 } #ary;
Explanation: map takes an element of the input array at a time and for each prepapres a list, here of two -- the element itself ($_, also quoted because of =>) and a 1. Such a list of pairs then populates a hash, as a list of an even length can be assigned to a hash, whereby each two successive elements form a key-value pair.
Note: This does not account for possibly multiple occurences of same elements in the array but only builds an existance-check structure (whether an element is in the array or not).
$hash{$_}++ for #arry; # What is happening here?
It is iterating over the array, and for each element, it's assigning it as a key to the hash, and incrementing the value of that key by one. You could also write it like this:
my %hash;
my #array = (1, 2, 2, 3);
for my $element (#array) {
$hash{$element}++;
}
The result would be:
$VAR1 = {
'2' => 2,
'1' => 1,
'3' => 1
};
$hash{$_}++ for #arry; # What is happening here?
Read perlsyn, specifically simple statements and statement modifiers:
Simple Statements
The only kind of simple statement is an expression evaluated for its side-effects. Every simple statement must be terminated with a semicolon, unless it is the final statement in a block, in which case the semicolon is optional. But put the semicolon in anyway if the block takes up more than one line, because you may eventually add another line. Note that there are operators like eval {}, sub {}, and do {} that look like compound statements, but aren't--they're just TERMs in an expression--and thus need an explicit termination when used as the last item in a statement.
Statement Modifiers
Any simple statement may optionally be followed by a SINGLE modifier, just before the terminating semicolon (or block ending). The possible modifiers are:
if EXPR
unless EXPR
while EXPR
until EXPR
for LIST
foreach LIST
when EXPR
[...]
The for(each) modifier is an iterator: it executes the statement once for each item in the LIST (with $_ aliased to each item in turn). There is no syntax to specify a C-style for loop or a lexically scoped iteration variable in this form.
print "Hello $_!\n" for qw(world Dolly nurse);
I have some data that should be able to be easily split into a hash.
The following code is intended to split the string into its corresponding key/value pairs and store the output in a hash.
Code:
use Data::Dumper;
# create a test string
my $string = "thing1:data1thing2:data2thing3:data3";
# Doesn't split properly into a hash
my %hash = split m{(thing.):}, $string;
print Dumper(\%hash);
However upon inspecting the output it is clear that this code does not work as intended.
Output:
$VAR1 = {
'data3' => undef,
'' => 'thing1',
'data2' => 'thing3',
'data1' => 'thing2'
};
To further investigate the problem I split the output into an array instead and printed the results.
Code:
# There is an extra blank element at the start of the array
my #data = split m{(thing.):}, $string;
for my $line (#data) {
print "LINE: $line\n";
}
Output:
LINE:
LINE: thing1
LINE: data1
LINE: thing2
LINE: data2
LINE: thing3
LINE: data3
As you can see the problem is that split is returning an extra empty element at the start of the array.
Is there any way that I can throw away the first element from the split output and store it in a hash in one line?
I know I can store the output in an array and then just shift off the first value and store the array in a hash... but I'm just curious whether or not this can be done in one step.
my (undef, %hash) = split m{(thing.):}, $string; will throw away the first value.
I'd alternatively suggest - use regex not split:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $string = "thing1:data1thing2:data2thing3:data3";
my %results = $string =~ m/(thing\d+):([A-Z]+\d+)/ig;
print Dumper \%results;
Of course, this does make the assumption that you're matching 'word+digit' groups, as without that "numeric" separator it won't work as well.
I'm aiming to primarily illustrate the technique - grab 'paired' values out of a string, because then they assign straight to a hash.
You might have to be a bit more complicated with the regex, for example nongreedy quantifiers:
my %results = $string =~ m/(thing.):(\w+?)(?=thing|$)/ig;
This may devalue it in terms of clarity.
Im not getting any output, anyone get where the issue lies,
matching or calling?
(The two subarrays in the multidimensional array have the same length.)
//Multidimensional array,
//Idarray = Fasta ID, Seqarray = "ATTGTTGGT" sequences
#ordarray = (\#idarray, \#seqarray);
//This calling works
print $ordarray[0][0] , "\n";
print $ordarray[1][0] , "\n", "\n";
// Ordarray output = "TTGTGGCACATAATTTGTTTAATCCAGAT....."
User inputs a search string, loop iterates the sequence dimension,
and counts amount of matches. Prints number of matches and the corresponding ID from the ID dimension.
//The user input-searchstring
$sestri = <>;
for($r=0;$r<#idarray;$r++) {
if ($sestri =~ $ordarray[1][$r] ){
print $ordarray[0][$r] , "\n";
$counts = () = $ordarray[0][$r] =~ /$sestri/g;
print "number of counts: ", $counts ;
}
I think the problem lies with this:
$sestri = <>;
That may well not be doing what you intended - your comment says "user specified search string" but that's not what that operator does.
What it does, is open the filename you specifed on the command line, and 'return' the first line.
I would suggest that if you want to grab a search string from command line you want to do it via #ARGV
E.g.
my ( $sestri ) = #ARGV; # will give first word.
However, please please please switch on use strict and use warnings. You should always do this prior to posting on a forum for assistance.
I would also question quite why you need a two dimensional array with two elements in it though. It seems unnecessary.
Why not instead make a hash, and key your "fasta ids" to the sequence?
E.g.
my %id_of;
#id_of{#seqarray} = #idarray;
my %seq_of;
#seq_of{#id_array} = #seqarray;
I think this would suit your code a bit better, because then you don't have to worry about the array indicies at all.
use strict;
use warnings;
my ($sestri) = #ARGV;
my %id_of;
#id_of{#seqarray} = #idarray;
foreach my $sequence ( keys %id_of ) {
##NB - this is a pattern match, and will be 'true'
## if $sestri is a substring of $sequence
if ( $sequence =~ m/$sestri/ ) {
print $id_of{$sequence}, "\n";
my $count = () = $sequence =~ m/$sestri/g;
print "number of counts: ", $count, "\n";
}
}
I've rewritten it a bit, because I'm not entirely understanding what your code is doing. It looks like it's substring matching in #seqarray but then returning the count of matching elements in #idarray I don't think that makes sense, but if it does, then amend according to your needs.
Array #p is a multiline array, e.g. $p[1] is the second line.
This code will explain what I want:
$size=#p; # line number of array #p
for($i=0; $i<$size; $i++)
{
#p{$i}= split(/ +/,$p[$i]);
}
I want the result should be like this:
#p0 = $p[0] first line of array #p goes to array #p0;
#p1 = $p[1] second line of array #p goes to array #p1;
...
...
and so on.
But above code does not work, how can I do it?
It is a bad idea to dynamically generate variable names.
I suggest the best solution here is to convert each line in your #p array to an array of fields.
Lets suppose you have a better name for #p, say #lines. Then you can write
my #lines = map [ split ], <$fh>;
to read in all the lines from the file handle $fh and split them on whitespace. The first field of the first line is then $lines[0][0]. The third field of the first line is $lines[0][2] etc.
First, the syntax #p{$i} accesses the entry with the key $i in a hash %p, and returns it in list context. I don't think you meant that. use strict; use warnings; to get warned about undeclared variables.
You can declare variables with my, e.g. my #p; or my $size = #p;
Creating variable names on the fly is possible, but a bad practice. The good thing is that we don't need to: Perl has references. A reference to an array allows us to nest arrays, e.g.
my #AoA = (
[1, 2, 3],
["a", "b"],
);
say $AoA[0][1]; # 2
say $AoA[1][0]; # a
We can create an array reference by using brackets, e.g. [ #array ], or via the reference operator \:
my #inner_array = (1 .. 3);
my #other_inner = ("a", "b");
my #AoA = (\#inner_array, \#other_array);
But careful: the array references still point to the same array as the original names, thus
push #other_inner, "c";
also updates the entry in #AoA:
say $AoA[1][2]; # c
Translated to your problem this means that you want:
my #pn;
for (#p) {
push #pn, [ split /[ ]+/ ];
}
There are many other ways to express this, e.g.
my #pn = map [ split /[ ]+/ ], #p;
or
my #pn;
for my $i ( 0 .. $#p ) {
$pn[$i] = [ split /[ ]+/, $p[$i] ];
}
To learn more about references, read
perlreftut,
perldsc, and
perlref.
I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.