Sorting by columns in perl - arrays

I have a file named all_energy.out and I am trying to sort it in a way so I can renumber files in the directory based on the lowest energy in the all_energy.out file. So I want to create an array with the file names and energy starting at the lowest energy and going to increasing energy like name and age.
Analogous Example:
Don 24
Jan 30
Sue 19
sorted to
Sue 19
Don 24
Jan 30
Example of all_energy.out file: The highest negative value is the lowest energy.
Energy
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
0001_IP3_fullBinding_Rigid0004 -219.050730
0001_IP3_fullBinding_Rigid0005 -219.044573
0001_IP3_fullBinding_Rigid0006 -218.927479
0001_IP3_fullBinding_Rigid0007 -218.919717
0001_IP3_fullBinding_Rigid0008 -218.900923
0001_IP3_fullBinding_Rigid0009 -218.898945
0001_IP3_fullBinding_Rigid0010 -218.889269
0001_IP3_fullBinding_Rigid0011 -218.871619
0001_IP3_fullBinding_Rigid0012 -218.859429
0001_IP3_fullBinding_Rigid0013 -218.848516
0001_IP3_fullBinding_Rigid0014 -218.835355
0001_IP3_fullBinding_Rigid0015 -218.822244
0001_IP3_fullBinding_Rigid0016 -218.819328
0001_IP3_fullBinding_Rigid0017 -218.818431
0001_IP3_fullBinding_Rigid0018 -218.815494
0001_IP3_fullBinding_Rigid0019 -218.798388
0001_IP3_fullBinding_Rigid0020 -218.792151
Energy
0002_IP3_fullBinding_Rigid0001 -226.007998
0002_IP3_fullBinding_Rigid0002 -225.635657
The file names are given before the energy value, for example 0001_IP3_fullBinding_Rigid0001.mol2 is the name of the first file.
Example solution:
0002_IP3_fullBinding_Rigid0001 -226.007998
0002_IP3_fullBinding_Rigid0002 -225.635657
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
My current script is:
#!/usr/bin/perl
use strict;
use warnings;
print "Name of all total energy containing file:\n";
my $energy_file = <STDIN>;
chomp $energy_file;
my $inputfile_energy = $energy_file;
open (INPUTFILE_ENERGY, "<", $inputfile_energy) or die $!;
print map $inputfile_energy->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)$/ ] }
<INPUTFILE_ENERGY>;
close $inputfile_energy;
At this point I am just trying to get the energy with their names to print to the correct order. Then I will loop through the files in the directory and when the name matches with the sorted energy names it will be renumber.

Problems with your script:
/(\d+)$/ only matches digits (0-9) at the end of a line. Your file contains floating point numbers, so only digits after the decimal point will be matched. You could get away with /(\S+)$/ instead. (Actually, in your sample input there is a line with a trailing space, so let's make that /(\S+)\s*$/ instead)
$inputfile_energy is a filename, a scalar, and not a reference, so $inputfile_energy->[0] doesn't make sense. You use it as the expression in a map construction, and in a map EXPR, LIST construction, $_ refers to the current element of the list that is being iterated through, so you probably just meant to say $_->[0].
Your input contains a few lines -- all with the keyword Energy -- that don't have the same format as the other lines you want to sort and should be filtered out.
Putting this all together, I get working code when the penultimate statement looks like:
print map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\S+)\s*$/ ] }
grep /\d/,
<INPUTFILE_ENERGY>;

you can use oneliner like this and run it from command line:
perl -lnae 'push #arr, [$_, $F[1]] if $F[1]; END { print join "\n", map {$_->[0]} sort {$a->[1] <=> $b->[1]} #arr }' energy_file.txt
1) special key -n makes the loop over all lines in input file (energy_file.txt); current line is available in $_ variable.
2) then key -a splits each line by whitespaces and puts nonempty values into #F array.

A less "idiomatic" solution could be :
#data = <DATA>;
my #table;
foreach(#data){
chomp;
next unless /^0/; # skip Energy lines (or any other cleaning test)
#line = split /\s+/;
push #table,[#line]; # build a 2d array
}
my #sortedTable = sort { $a->[1] <=> $b->[1] } #table;
foreach(#sortedTable){
printf(
"%5s,%25s\n",
$_->[0],
$_->[1]
) # some pretty printing
}
__DATA__
Energy
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
0001_IP3_fullBinding_Rigid0004 -219.050730
....

Try this:
print join "\n", sort {(split /\s+/,$a)[1] <=> (split /\s+/,$b)[1]} map{chomp $_; $_} <INPUTFILE_ENERGY>;

Related

Advice on Perl sort that uses Schwartzian transform

I have been looking at an old post about sorting an array by using a regular expression in Perl. The original post is here
I am struggling to totally understand the script that was voted as the ‘correct’ answer.
The original post was regarding sorting the array below:
my #array = (
"2014 Computer Monitor 200",
"2010 Keyboard 30",
"2012 Keyboard 80",
"2011 Study Desk 100"
);
The question was how to use regular expressions in perl to sort the entire array by year, item name, and price? For example, if the user wants to sort by price they type 'price' and it sorts like this:
2010 Keyboard 30
2012 Keyboard 80
2011 Study Desk 100
2014 Computer Monitor 200
A solution was proposed, that uses a Schwartzian transform. I have just started to learn about this, and this script is a little different to the other examples I've seen. The script that was voted as the correct answer is below. I am looking for advice on how it works.
my $order = "price";
my #array = (
"2014 Computer Monitor 200",
"2010 Keyboard 30",
"2012 Keyboard 80",
"2011 Study Desk 100"
);
my %sort_by = (
year => sub { $a->{year} <=> $b->{year} },
price => sub { $a->{price} <=> $b->{price} },
name => sub { $a->{name} cmp $b->{name} },
);
#array = sort {
local ($a, $b) = map {
my %h;
#h{qw(year name price)} = /(\d+) \s+ (.+) \s+ (\S+)/x;
\%h;
} ($a, $b);
$sort_by{$order}->();
} #array;
# S. transform
# #array =
# map { $_->{line} }
# sort { $sort_by{$order}->() }
# map {
# my %h = (line => $_);
# #h{qw(year name price)} = /(\d+) \s+ (.+) \s+ (\S+)/x;
# $h{name} ? \%h : ();
# } #array;
use Data::Dumper; print Dumper \#array;
I know the script is using the regular expression /(\d+) \s+ (.+) \s+ (\S+)/x to match on year name and price.
I think the rest of the script works as below:
• The initial sort on line 14 takes in items from #array two at a time, one in $a and one in $b
• The map function then takes items $a and $b and maps each to a hash - each item becomes a hash with keys 'year', 'price', and 'name. This is based on the regex /(\d+) \s+ (.+) \s+ (\S+)/x
• Map returns the two hashes, as references, to local variables $a and $b
• I think it is necessary to use local $a and $b otherwise sort will use the default $a and $b taken in at the start of the sort on line 17?
• The 'price' sort function is stored as an coderef in the %sort_by hash
• This is called at line 26 by the code $sort_by{$order}->() on the local versions of $a and $b
This repeated until all items are returned to #array in line 14
Please can anyone tell me if I'm on the right lines here, or correct any misunderstandings. Also can you advise on the use of the local $a and $b variables.
thanks
J
A Schwartzian transform is a way to avoid computing the sorting keys too many times, like in the solution - the one with the local ($a,$b)
The steps of a S. tranform are basically:
use a Map to enrich the list elements with computed sorted keys. Here, %h is used as the new element, containing the original line as line
use a Sort to sort this rich list. The sort with a bit of dirty $a $b magic.
use a Map to extract the original list elements. Here by extracting the line key.
A note on $a $b
Very sadly, $a and $b are global variables in Perl. They usually get automagically assigned inside a sort block. Like in sort { $a <=> $b } (3,2,1)
This explains why the S. solution works even though the compared elements are not given as arguments to the sorting subs. And it also explains the need for local (another Perl horror to pretend a global variable is local) so the naive solution's sort function get the right values in $a, $b.
I strongly encourage you to forget about this and avoid implicit use of $a , $b deeper than the sort block itself.
A slightly more understandable version would be:
my $order = "price";
my #array = (
"2014 Computer Monitor 200",
"2010 Keyboard 30",
"2012 Keyboard 80",
"2011 Study Desk 100"
);
my %sort_by = (
year => sub { shift->{year} <=> shift->{year} },
price => sub { shift->{price} <=> shift->{price} },
name => sub { shift->{name} cmp shift->{name} },
);
my #sorted =
map { $_->{line} }
sort { $sort_by{$order}->($a, $b) }
map {
my %h = (line => $_); # $_ is the array element (the input line)
#h{qw(year name price)} = ( $_ =~ /(\d+) \s+ (.+) \s+ (\S+)/x );
# Did the regex capture a name, i.e. did it work?
if( $h{name} ){
\%h
} else{
(); # Empty array will cause the invalid line to disappear, but you can choose to do something else with it.
}
} #array;
print(join("\n", #sorted))

Changing element's positions in Perl

So I have a problem and I can't solve it. If I read some words from a file in Perl, in that file the words aren't in order, but have a number (as a first character) that should be the element's position to form a sentence.The 0 means that position is correct, 1 means that the word should be in position [1] etc.
The file looks like: 0This 3a 4sentence 2be 1should, and the solution should look like 0This 1should 2be 3a 4sentence.
In a for loop I get through the words array that i get from the file, and this is how i get the first character(the number) $firstCharacter = substr $words[$i], 0, 1;, but i don't know how to properly change the array.
Here's the code that I use
#!/usr/bin/perl -w
$arg = $ARGV[0];
open FILE, "< $arg" or die "Can't open file: $!\n";
$/ = ".\n";
while($row = <FILE>)
{
chomp $row;
#words = split(' ',$row);
}
for($i = 0; $i < scalar #words; $i++)
{
$firstCharacter = substr $words[$i], 0, 1;
if($firstCharacter != 0)
{
}
}
Just use sort. You can use a match in list context to extract the numbers, using \d+ will work even for numbers > 9:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #words = qw( 0This 3a 4sentence 2be 1should );
say join ' ', sort { ($a =~ /\d+/g)[0] <=> ($b =~ /\d+/g)[0] } #words;
If you don't mind the warnings, or you are willing to turn them off, you can use numeric comparison directly on the words, Perl will extract the numeric prefixes itself:
no warnings 'numeric';
say join ' ', sort { $a <=> $b } #words;
Assuming you have an array like this:
my #words = ('0This', '3a', '4sentence', '2be', '1should');
And you want it sorted like so:
('0This', '1should', '2be', '3a', '4sentence');
There's two steps to this. First is extracting the leading number. Then sorting by that number.
You can't use substr, because you don't know how long the number might be. For example, ('9Second', '12345First'). If you only looked at the first character you'd get 9 and 1 and sort them incorrectly.
Instead, you'd use a regex to capture the number.
my($num) = $word =~ /^(\d+)/;
See perlretut for more on how that works, particularly Extracting Matches.
Now that you can capture the numbers, you can sort by them. Rather than doing it in loop yourself, sort handles the sorting for you. All you have to do is supply the criterion for the sorting. In this case we capture the number from each word (assigned to $a and $b by sort) and compare them as numbers.
#words = sort {
# Capture the number from each word.
my($anum) = $a =~ /^(\d+)/;
my($bnum) = $b =~ /^(\d+)/;
# Compare the numbers.
$anum <=> $bnum
} #words;
There are various ways to make this more efficient, in particular the Schwartzian Transform.
You can also cheat a bit.
If you ask Perl to treat something as a number, it will do its damnedest to comply. If the string starts with a number, it will use that and ignore the rest, though it will complain.
$ perl -wle 'print "23foo" + "42bar"'
Argument "42bar" isn't numeric in addition (+) at -e line 1.
Argument "23foo" isn't numeric in addition (+) at -e line 1.
65
We can take advantage of that to simplify the sort by just comparing the words as numbers directly.
{
no warnings 'numeric';
#words = sort { $a <=> $b } #words;
}
Note that I turned off the warning about using a word as a number. use warnings and no warnings only has effect within the current block, so by putting the no warnings 'numeric' and the sort in their own block I've only turned off the warning for that one sort statement.
Finally, if the words are in a file you can use the Unix sort utility from the command line. Use -n for "numeric sorting" and it will do the same trick as above.
$ cat test.data
00This
3a
123sentence
2be
1should
$ sort -n test.data
00This
1should
2be
3a
123sentence
You should be able to split on the spaces, which will make the numbers the first character of the word. With that assumption, you can simply compare using the numerical comparison operator (<=>) as opposed to the string comparison (cmp).
The operators are important because if you compare strings, the first character is used, meaning 10, 11, and 12 would be out of order, and listed near the 1 (1,10,11,12,2,3,4… instead of 1,2,3,4…10,11,12).
Split, Then Sort
Note: #schwern commented an important point. If you use warnings -- and you should -- you will receive warnings. This is because the values of the internal comparison variables, $a and $b, aren't numbers, but strings (e.g., `"0this", "3a"). I've update the following Codepad and provided more suitable alternatives to avoid this issue.
http://codepad.org/xs2GH9xT
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/,$line;
my #sorted = sort {$a <=> $b} #words;
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
Alternatives
One method is to ignore the warning using no warnings 'numeric' as in Schwern's answer. As he has shown, turning off the warnings in a block will re-enable it afterwards, which may be a little foolproof compared to Choroba's answer, which applies it to the broader scope.
Choroba's solution works by parsing the digits from the those values internally. This is much fewer lines of code, but I would generally advise against that for performance reasons. The regex isn't only run once per word, but multiple times over the sorting process.
Another method is to strip the numbers out and use them for the sort comparison. I attempt to do this below by creating a hash, where the key will be the number and the value will be the word.
Hash Mapping / Key Sort
Once you have an array where the values are the words prefixed by the numbers, you could just as easily split those number/word combo into a hash that has the key as the number and value as the word. This is accomplished by using split.
The important thing to note about the split statement is that a limit is passed (in this case 2), which limits the maximum number of fields the string is split into.
The two values are then used in the map to build the key/value assignment. Thus "0This" is split into "0" and "This" to be used in the hash as "0"=>"This"
http://codepad.org/kY8wwajc
use strict;
use warnings;
my $line = q{0This 3a 4sentence 2be 1should};
my #words = split /\s/, $line; # [ '0This', '3a', ... ]
my %mapped = map { split /(?=\D)/, $_, 2 } #words; # { '0'=>'This, '3'=>'a', ... }
my #sorted = #mapped{ sort { $a <=> $b } keys %mapped }; # [ 'This', 'should', 'be', ... ]
print qq{
Line: $line
Words: #words
Sorted: #sorted
};
This also can be further optimized, but uses multiple variables to illustrate the steps in the process.

Create Perl array without the use of curly brackets so that array is not confined to specific block

I want to maintain an array outside of a block of code. The array that I am interested in is #csv2years. This is currently within a set of curly brackets which contain the code for a for loop. I realise I somehow need to rearrange this so that the curly brackets do not contain the conditions which create the array. I don't know how to specifically reference csv2 without the for { } section.
#!/bin/perl
# creates subroutines for discovering year
sub find_year {
my ( $str ) = #_;
my $year;
$year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ );
return $year
}
#####CREATE CSV2 DATA
my #csv2 = (); # Creates new empty array #csv2
open CSV2, "<csv2" or die; # Creates the file handle CSV2, inputs the data from csv2 into CVS2
#csv2=<CSV2>; # populates #csv2 array with data from file handle CSV2
close CSV2;
my %csv2hash = (); # Creates empty hash csv2hash
for (#csv2) { # for data in #csv2 array
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/define the data which is the title
#Indicate that title data will input into csv2hash
$csv2hash{$_} = $title;
#Create a hash for find year of csv2
my %csv2_year = find_year($title);
my #csv2years = keys(%csv2_year);
print "#csv2years\n";
}
The above code currently prints out #csv2years, but if the print "#csv2years\n"; is placed outside of the curly brackets, nothing is printed out. How can I rearrange this code so that there is no need for the for loop and/or curly brackets?
Here is the csv2 data:
14564564,1989 the door to the others,546456,47878787
456456445,the Twin Peaks theory project 1979,45454545,45454545
456456445,the Twin Peaks forget that,45454545,45454545
454654564, 1969 hello good world your great ,45456456, 54564654
The output of print "#csv2years\n"; is :
1989
1979
1969
To answer your immediate need:
my #csv2years;
for ( #csv2 ) {
...
push #csv2years, find_year($title);
}
print "$_\n" for #csv2years;
But there are a ton of improvements that will help you up your Perl-fu. A few of the salient ones are:
use strict; use warnings;
use a dedicated CSV parser like Text::CSV
iterate over the file line by line instead of cramming everything into an array and then loop over each element
replace bareword filehandles with lexical ones (open my $fh, ...)
assign a default value in case find_year doesn't find a year (empty string, perhaps?)
assign the result of find_year to a scalar instead of a hash

Can I use the contents of an array as the keys of a hash?

I want my array to become the keys of my new hash. I am writing a program that counts the number of word occurrences in a document.
my #array = split(" ", $line);
keys my %word_count = #array; #This does nothing
This segment is occuring while I am reading the infile line by line. I am trying to find a way to complete this project using hashes. The words are the keys, and the number of times they appear are the values. But, this step in particular is puzzling me.
Use a hash slice.
my %word_count;
#word_count{split ' ', $line} = ();
# if you like wasting memory:
# my #array = split ' ', $line;
# #word_count{#array} = (0) x #array;
You can't do it that way, certainly.
my %word_count = map {($_, 0)} #array;
would initialize the keys of the hash; but generally in Perl you don't want to do that. Two issues here are that
you need a second pass to actually account for the words in the line;
you can't cheat and change the 0 to 1 above, because if a word is repeated in the line you will count it only once, the others being overwritten.
my %word_count = map { $_ => 0 } split(" ", $line);
You're trying to count the number of occurences of words in a line, right? If so, you want
my %word_count;
++$word_count for split(/\s+/, $line);
Or to put it on its head in order to facilitate refining the definition of a word:
my %word_count;
++$word_count for $line =~ /(\S+)/g;

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources