Perl split and throw away the first element in one line - arrays

I have some data that should be able to be easily split into a hash.
The following code is intended to split the string into its corresponding key/value pairs and store the output in a hash.
Code:
use Data::Dumper;
# create a test string
my $string = "thing1:data1thing2:data2thing3:data3";
# Doesn't split properly into a hash
my %hash = split m{(thing.):}, $string;
print Dumper(\%hash);
However upon inspecting the output it is clear that this code does not work as intended.
Output:
$VAR1 = {
'data3' => undef,
'' => 'thing1',
'data2' => 'thing3',
'data1' => 'thing2'
};
To further investigate the problem I split the output into an array instead and printed the results.
Code:
# There is an extra blank element at the start of the array
my #data = split m{(thing.):}, $string;
for my $line (#data) {
print "LINE: $line\n";
}
Output:
LINE:
LINE: thing1
LINE: data1
LINE: thing2
LINE: data2
LINE: thing3
LINE: data3
As you can see the problem is that split is returning an extra empty element at the start of the array.
Is there any way that I can throw away the first element from the split output and store it in a hash in one line?
I know I can store the output in an array and then just shift off the first value and store the array in a hash... but I'm just curious whether or not this can be done in one step.

my (undef, %hash) = split m{(thing.):}, $string; will throw away the first value.

I'd alternatively suggest - use regex not split:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $string = "thing1:data1thing2:data2thing3:data3";
my %results = $string =~ m/(thing\d+):([A-Z]+\d+)/ig;
print Dumper \%results;
Of course, this does make the assumption that you're matching 'word+digit' groups, as without that "numeric" separator it won't work as well.
I'm aiming to primarily illustrate the technique - grab 'paired' values out of a string, because then they assign straight to a hash.
You might have to be a bit more complicated with the regex, for example nongreedy quantifiers:
my %results = $string =~ m/(thing.):(\w+?)(?=thing|$)/ig;
This may devalue it in terms of clarity.

Related

Extract number from array in Perl

I have a array which have certain elements. Each element have two char "BC" followed by a number
e.g - "BC6"
I want to extract the number which is present and store in a different array.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my #band = ("BC1", "BC3");
foreach my $elem(#band)
{
my #chars = split("", $elem);
foreach my $ele (#chars) {
looks_like_number($ele) ? 'push #band_array, $ele' : '';
}
}
After execution #band_array should contain (1,3)
Can someone please tell what I'm doing wrong? I am new to perl and still learning
To do this with a regular expression, you need a very simple pattern. /BC(\d)/ should be enough. The BC is literal. The () are a capture group. They save the match inside into a variable. The first group relates to $1 in Perl. The \d is a character group for digits. That's 0-9 (and others, but that's not relevant here).
In your program, it would look like this.
use strict;
use warnings;
use Data::Dumper;
my #band = ('BC1', 'BC2');
my #numbers;
foreach my $elem (#band) {
if ($elem =~ m/BC(\d)/) {
push #numbers, $1;
}
}
print Dumper #numbers;
This program prints:
$VAR1 = '1';
$VAR2 = '2';
Note that your code had several syntax errors. The main one is that you were using #band = [ ... ], which gives you an array that contains one array reference. But your program assumed there were strings in that array.
Just incase your naming contains characters other than BC this will exctract all numeric values from your list.
use strict;
use warnings;
my #band = ("AB1", "BC2", "CD3");
foreach my $str(#band) {
$str =~ s/[^0-9]//g;
print $str;
}
First, your array is an anonymous array reference; use () for a regular array.
Then, i would use grep to filter out the values into a new array
use strict;
use warnings;
my #band = ("BC1", "BC3");
my #band_array = grep {s/BC(\d+)/$1/} #band;
$"=" , "; # make printing of array nicer
print "#band_array\n"; # print array
grep works by passing each element of an array in the code in { } , just like a sub routine. $_ for each value in the array is passed. If the code returns true then the value of $_ after the passing placed in the new array.
In this case the s/// regex returns true if a substitution is made e.g., the regex must match. Here is link for more info on grep

Generate array and subarray dynamically (perl)

I have a several files that have contain product install information.
I am able to grep for the elements that I want (for example, "version") from the files. And I end up with something like this:
instancefile1:ABC_version=1.2.3
instancefile1:XYZ_version=2.5
instancefile2:ABC_version=3.4.5
instancefile3:XYZ_version=1.1
Where the components are named ABC or XYZ.
What I'd like to do is take this grep output and parse it through perl to build arrays on the file.
The first array would be composed of the instance number (pulled from the filenames) - above, I'd have an array that would have 1,2,3 as elements.
And inside each of those arrays would be the components that that particular instance has.
Full expected arrays and components from above:
array[0] = instancefile1 # can keep this named like the filename,
or assign name. Does not matter
array[0][0] = ABC_version=1.2.3
array[0][1] = XYZ_version=2.5
array[1] = instancefile2
array[1][0] = ABC_version=3.4.5
array[2] = instancefile3
array[2][0] = XYZ_version=1.1
(I know my notation for referencing subarrays is not correct - I'm rusty on my perl.)
How might I do that?
(I've been doing it with just bash arrays and grep and then reiterating through the initial grep output with my first array and doing another grep to fill another array - but this seems like it is going through the data more than one time, instead of building it on the fly.)
The idea is for it to build each array as it sees it. It sees "fileinstance1", it stores the values to the right in that array, as it sees it. Then if it sees "fileinstance2", it creates that array and populates with those values, all in one pass. I believe perl is the best tool for this?
Unless you can guaranteed the records with the same key will be next to each other, it's easier to start with a HoA.
my %data;
while (<>) {
chomp;
my ($key, $val) = split /:/;
push #{ $data{$key} }, $val;
}
Then convert to AoA:
my #data = values(%data);
Order-preserving:
my %data;
my #data;
while (<>) {
chomp;
my ($key, $val) = split /:/;
push #data, $data{$key} = []
if !$data{$key};
push #{ $data{$key} }, $val;
}

Changing element's positions in Perl

So I have a problem and I can't solve it. If I read some words from a file in Perl, in that file the words aren't in order, but have a number (as a first character) that should be the element's position to form a sentence.The 0 means that position is correct, 1 means that the word should be in position [1] etc.
The file looks like: 0This 3a 4sentence 2be 1should, and the solution should look like 0This 1should 2be 3a 4sentence.
In a for loop I get through the words array that i get from the file, and this is how i get the first character(the number) $firstCharacter = substr $words[$i], 0, 1;, but i don't know how to properly change the array.
Here's the code that I use
#!/usr/bin/perl -w
$arg = $ARGV[0];
open FILE, "< $arg" or die "Can't open file: $!\n";
$/ = ".\n";
while($row = <FILE>)
{
chomp $row;
#words = split(' ',$row);
}
for($i = 0; $i < scalar #words; $i++)
{
$firstCharacter = substr $words[$i], 0, 1;
if($firstCharacter != 0)
{
}
}
Just use sort. You can use a match in list context to extract the numbers, using \d+ will work even for numbers > 9:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #words = qw( 0This 3a 4sentence 2be 1should );
say join ' ', sort { ($a =~ /\d+/g)[0] <=> ($b =~ /\d+/g)[0] } #words;
If you don't mind the warnings, or you are willing to turn them off, you can use numeric comparison directly on the words, Perl will extract the numeric prefixes itself:
no warnings 'numeric';
say join ' ', sort { $a <=> $b } #words;
Assuming you have an array like this:
my #words = ('0This', '3a', '4sentence', '2be', '1should');
And you want it sorted like so:
('0This', '1should', '2be', '3a', '4sentence');
There's two steps to this. First is extracting the leading number. Then sorting by that number.
You can't use substr, because you don't know how long the number might be. For example, ('9Second', '12345First'). If you only looked at the first character you'd get 9 and 1 and sort them incorrectly.
Instead, you'd use a regex to capture the number.
my($num) = $word =~ /^(\d+)/;
See perlretut for more on how that works, particularly Extracting Matches.
Now that you can capture the numbers, you can sort by them. Rather than doing it in loop yourself, sort handles the sorting for you. All you have to do is supply the criterion for the sorting. In this case we capture the number from each word (assigned to $a and $b by sort) and compare them as numbers.
#words = sort {
# Capture the number from each word.
my($anum) = $a =~ /^(\d+)/;
my($bnum) = $b =~ /^(\d+)/;
# Compare the numbers.
$anum <=> $bnum
} #words;
There are various ways to make this more efficient, in particular the Schwartzian Transform.
You can also cheat a bit.
If you ask Perl to treat something as a number, it will do its damnedest to comply. If the string starts with a number, it will use that and ignore the rest, though it will complain.
$ perl -wle 'print "23foo" + "42bar"'
Argument "42bar" isn't numeric in addition (+) at -e line 1.
Argument "23foo" isn't numeric in addition (+) at -e line 1.
65
We can take advantage of that to simplify the sort by just comparing the words as numbers directly.
{
no warnings 'numeric';
#words = sort { $a <=> $b } #words;
}
Note that I turned off the warning about using a word as a number. use warnings and no warnings only has effect within the current block, so by putting the no warnings 'numeric' and the sort in their own block I've only turned off the warning for that one sort statement.
Finally, if the words are in a file you can use the Unix sort utility from the command line. Use -n for "numeric sorting" and it will do the same trick as above.
$ cat test.data
00This
3a
123sentence
2be
1should
$ sort -n test.data
00This
1should
2be
3a
123sentence
You should be able to split on the spaces, which will make the numbers the first character of the word. With that assumption, you can simply compare using the numerical comparison operator (<=>) as opposed to the string comparison (cmp).
The operators are important because if you compare strings, the first character is used, meaning 10, 11, and 12 would be out of order, and listed near the 1 (1,10,11,12,2,3,4… instead of 1,2,3,4…10,11,12).
Split, Then Sort
Note: #schwern commented an important point. If you use warnings -- and you should -- you will receive warnings. This is because the values of the internal comparison variables, $a and $b, aren't numbers, but strings (e.g., `"0this", "3a"). I've update the following Codepad and provided more suitable alternatives to avoid this issue.
http://codepad.org/xs2GH9xT
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/,$line;
my #sorted = sort {$a <=> $b} #words;
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
Alternatives
One method is to ignore the warning using no warnings 'numeric' as in Schwern's answer. As he has shown, turning off the warnings in a block will re-enable it afterwards, which may be a little foolproof compared to Choroba's answer, which applies it to the broader scope.
Choroba's solution works by parsing the digits from the those values internally. This is much fewer lines of code, but I would generally advise against that for performance reasons. The regex isn't only run once per word, but multiple times over the sorting process.
Another method is to strip the numbers out and use them for the sort comparison. I attempt to do this below by creating a hash, where the key will be the number and the value will be the word.
Hash Mapping / Key Sort
Once you have an array where the values are the words prefixed by the numbers, you could just as easily split those number/word combo into a hash that has the key as the number and value as the word. This is accomplished by using split.
The important thing to note about the split statement is that a limit is passed (in this case 2), which limits the maximum number of fields the string is split into.
The two values are then used in the map to build the key/value assignment. Thus "0This" is split into "0" and "This" to be used in the hash as "0"=>"This"
http://codepad.org/kY8wwajc
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/, $line; # [ '0This', '3a', ... ]
my %mapped = map { split /(?=\D)/, $_, 2 } #words; # { '0'=>'This, '3'=>'a', ... }
my #sorted = #mapped{ sort { $a <=> $b } keys %mapped }; # [ 'This', 'should', 'be', ... ]
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
This also can be further optimized, but uses multiple variables to illustrate the steps in the process.

Perl: string to an array reference?

Say, there is a string "[1,2,3,4,5]", how could I change it to an array reference as [1,2,3,4,5]? Using split and recomposing the array is one way, but it looks like there should be a simpler way.
eval is the simplest way
$string = "[1,2,3,4,5]";
$ref = eval $string;
but this is insecure if you don't have control over the contents of $string.
Your input string is also valid JSON, though, so you could say
use JSON;
$ref = decode_json( $string );
You can use eval but this should definitely be avoided when the string in question comes from untrusted sources.
Otherwise you have to parse it yourself:
my #arr = split(/\s*,\s*/, substr($string, 1, -1));
my $ref = \#arr;
You really should avoid eval if you can. If the string comes from outside the program then untold damage can be done by simply applying eval to it.
If the contents of the array is just numbers then you can use a regex to extract the information you need.
Here's an example
use strict;
use warnings;
my $string = "[1,2,3,4,5]";
my $data = [ $string =~ /\d+/g ];
use Data::Dump;
dd $data;
output
[1 .. 5]

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources