Perl: Sort part of array - arrays

I have an array with many fields in each line spaced by different spacing like:
INDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
I'm trying to make sorting for this array by number in 3rd field so the desired output should be:
PIMP drawing 7 0 1444 718 437 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
INDDUMMY drawing2 139 30 1 0 0 0 0 0
I tried to make a split using regular expression within the sorting function like:
#sortedListOfLayers = sort {
split(m/\w+\s+(\d+)\s/gm,$a)
cmp
split(m/\w+\s+(\d+)\s/gm,$b)
}#listOfLayers;
but it doesn't work correctly. How I could make that type of sorting?

You need to expand out your sort function a little further. I'm also not sure that split is working the way you think it is. Split turns text into an array based on a delimiter.
I think your problem is that your regular expression - thanks to the gm flags - isn't matching what you think it's matching. I'd perhaps approach it slightly differently though:
#!/usr/bin/perl
use strict;
use warnings;
my #array = <DATA>;
sub sort_third_num {
my $a1 = (split ( ' ', $a ) )[2];
my $b1 = (split ( ' ', $b )) [2];
return $a1 <=> $b1;
}
print sort sort_third_num #array;
__DATA__
NDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
This does the trick, for example.
If you're set on doing a regex approach:
sub sort_third_num {
my ($a1) = $a =~ m/\s(\d+)/;
my ($b1) = $b =~ m/\s(\d+)/;
return $a1 <=> $b1;
}
not globally matching means only the first element is returned. And only the first match of 'whitespace-digits' is returned. We also compare numerically, rather than stringwise.

If you want to sort a list and the operation used in the sort block is expensive, an often used Perl idiom is the Schwartzian Transform: you apply the operation once to each list element and store the result alongside the original element, sort, then map back to your original format.
The classic textbook example is sorting files in a directory by size using the expensive -s file test. A naïve approach would be
my #sorted = sort { -s $a <=> -s $b } #unsorted;
which has to perform -s twice for each comparison operation.
Using the Schwartzian Transform, we map the file names into a list of array references, each referencing an array containing the list element and its size (which has to be determined only once per file), then sort by file size, and finally map the array references back to just the file names. This is all done in a single step:
my #sorted =
map $_->[0], # 3. map to file name
sort { a$->[1] <=> b$->[1] } # 2. sort by size
map [ $_, -s $_ ], # 1. evaluate size once for each file
#unsorted;
In your case, the question is how expensive it is to extract the third field of each array element. When in doubt, measure to compare different methods. The speedup in the file size example is dramatic at about a factor 10 for a few dozen files!
The Schwartzian Transform applied to your problem would look something like this:
my #sorted =
map $_->[0], # 3. Map to original array
sort { $a->[1] <=> $b->[1] } # 2. Sort by third column
map [ $_, ( split( ' ', $_ ) )[2] ], # 1. Use Sobrique's idea
#array;
If the operation used is so expensive that you want to avoid performing it more than once per value in case you have identical array elements, you can cache the results as outlined in this question; this is known as the Orcish Maneuver.

Related

How to use offset in arrays in bash?

Here is my code.
#! /bin/bash
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array:$i:$((i+1))}
echo $this
done
I want to print each number of my number separately. I have used this line to get the array elements using an offset number.
this=${array:$i:$((i+1))}
However I am only getting 3 printed and rest all are new lines. I basically want to print 3, 2, 1 etc on separate lines. How do I correct this?
First, you need to use the whole array array[#], not array.
echo "${array[#]:3:2}"
Then, you may change the index to simple variable names:
this=${array[#]:i:i+1}
And then, you probably need to extract just one value of the list:
this=${array[#]:i:1}
Try this code:
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array[#]:i:1}
echo "$this"
done
There is no reason to use an array slice here, just access the individual elements of the array. Try this:
#! /bin/bash
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array[$((i+1))]}
echo $this
done
In general you can access a single element of an array like that: ${array[3]}.
Note that in this case, it would have been preferable to do this:
array=(3 2 1 0 0 0 0 0 0 0)
for this in "${array[#]}"
do
echo $this
done

How do I sort a Tcl array by values?

how to sort an Array output for example
Sample input coming from
puts "$word $count($word)"}
Sample Input
Roger 15
Martin 18
Jemmy 16
Jon 12
Sara 12
Expected Output
Martin 18
Jemmy 16
Roger 15
Jon 12
Sara 12
Tcl's arrays are unsorted, always, and indeed the order of the elements changes from time to time as you add elements in (when the underlying hash table is rebuilt). To get the output you want, you're best off getting the contents of the array and using lsort with the -stride 2 option:
# Convert the array to a Tcl list
set contents [array get count]
# First sort by name, as a secondary key
set contents [lsort -stride 2 -index 0 $contents]
# Then sort by count, descending, as a primary key
set contents [lsort -stride 2 -index 1 -integer -decreasing $contents]
# Print the values
foreach {name score} $contents {
puts "$name $score"
}
The -stride option requires Tcl 8.6.
In older versions of Tcl, you have to pack things up into a list of tuples:
# Convert the array to a list of pairs
set contents {}
foreach {name score} [array get count] {
lappend contents [list $name $score]
}
# Do the sorting
set contents [lsort -index 0 $contents]
set contents [lsort -index 1 -integer -decreasing $contents]
# Print the values
foreach pair $contents {
# Unpack; *not* needed here, but useful for anything more complicated
foreach {name score} $pair break
# You could use “lassign $pair name score” but you're on 8.4
puts "$name $score"
}
Note that Tcl 8.4 is unsupported software, not even for security issues, and that 8.5 has only got a year or two more extended support lifetime left. There's a limit to how long we'll hold people's hands…
You probably have something like this
array set count { Roger 15 Martin 18 Jemmy 16 Jon 12 Sara 12 }
foreach word [array names count] {puts "$word $count($word)"}
Jemmy 16
Sara 12
Jon 12
Martin 18
Roger 15
what you want to do is to transform the array into a list, step over it in pairs and sort the pairs based on the number:
foreach {name num} \
[lsort -integer -decreasing -stride 2 -index 1 [array get count]] \
{puts "$name $num"}
Martin 18
Jemmy 16
Roger 15
Sara 12
Jon 12
refs:
http://www.tcl.tk/man/tcl8.6/TclCmd/lsort.htm
http://www.tcl.tk/man/tcl8.6/TclCmd/foreach.htm
Solution for Tcl < 8.6:
Given
array set count {Roger 15 Martin 18 Jemmy 16 Jon 12 Sara 12}
The way to get sorted outpus is
set L [list]
foreach {k v} [array get count] {
lappend L [list $k $v]
}
foreach e [lsort -index 1 -decreasing -integer $L] {
lassign $e k v
puts "$k $v"
}
Explanation:
Get a flat list of the interleaved keys and values using array get.
From it, produce a list of key/value pairs—a list of lists.
Given that list, sort it using the lsort command passing it
the -index 1 option which makes lsort interpret the elements of the
list it sorts as lists, and use their elements at index 1
(the 2nd position) for sorting.
To print out the elements of the sorted list, you need to extract the
key and the value back from each of them. The easiest way is to use
lassign but if you have Tcl < 8.5 you can use either the
foreach {k v} $e break trick or directly access the elements using
lindex $e 0 and lindex $e 1 to get the key and the value, respectively.

Tcl array sorting based on values

I have an array with dynamic 'keys' and values associated with them.
I want to sort the array based on the value and want to be able to retrieve the 'keys' from the sorted array.
For example, say I have,
for {set i 0} {$i < [db_get_nrows $rs]} {incr i} {
set x [db_get_col $rs $i abc]
set ARRAY_A($x) [db_get_col $rs $i def]
}
So, my array would look like,
ARRAY_A(111) 10
ARRAY_A(222) 50
ARRAY_A(333) 20
Now, I want to sort this array based on it's values (with 50 first, then 20 and then 10). And then I'm interested in it's keys (222, 333 and 111) for further processing.
I couldn't find much in the internet for such arrays with dynamically generated keys.
Any help is much appreciated.
Thanks.
Well, I just wanted to first mention that you cannot sort arrays as they don't really have a fixed order, but are saved in a manner that makes it easier/faster for the interpreter to retrieve values.
If you want to get the keys of the array in the order of the values, you can maybe use something like that:
set key_value [lmap {key val} [array get ARRAY_A] {list $key $val}]
set key_value [lsort -index 1 -integer -decreasing $key_value]
The list key_value now holds key/value pairs of your array sorted by values in decreasing order. -index 1 indicates that the sort is sorting by the 2nd element of the sublist (Tcl has lists 0-based). -integer just instructs that we are sorting integers (and not using dictionary sort). You just need to get the keys from the list:
foreach n $key_value {
puts [lindex $n 0]
}
You can combine the above in a single loop if you want (I combined the loop and the second line, adding the first line will make it look a bit too much):
foreach n [lsort -index 1 -integer -decreasing $key_value] {
puts [lindex $n 0]
}
This part of the answer is mostly an addendum to Dinesh's answer and not complete in itself.
Once you have created a list containing the array elements sorted according to value, you can put it in a dictionary (which is another kind of associative list structure):
set d [lsort -stride 2 -integer -decreasing -index 1 $l]
The dictionary will preserve the order of insertion and allow easy access to e.g. the keys:
dict keys $d
# -> 222 333 111
eta
If you can't use lmap or -stride, you can still generate a dictionary like this:
set pairs {}
foreach {a b} [array get ARRAY_A] {
lappend pairs [list $a $b]
}
set DICT_A [concat {*}[lsort -index 1 -integer -decreasing $pairs]]
This method packs the elements into "pairs", sorts the packed list, and then unpacks it into a flat list to be usable as a dictionary as above.
Documentation: array, concat, dict, foreach, lappend, list, lsort, set
% set tcl_version
8.6
% array set n {111 10 222 50 333 20}
% parray n
n(111) = 10
n(222) = 50
n(333) = 20
% set l [array get n]
333 20 222 50 111 10
% lsort -stride 2 -integer -index 1 $l
111 10 333 20 222 50
% lsort -stride 2 -integer -decreasing -index 1 $l
222 50 333 20 111 10
%
You can get them as list with expected order and then try to apply your logic further.

how to sort an array by a selected value in perl

i have strings inside an array in this way
/hello/Stack/oveflow 14
/hello/Stack/oveflow 11
/hello/Stack/oveflow 12
/hello/Stack/oveflow 166
/hello/Stack/oveflow 1
/hello/Stack/oveflow 2
/hello/Stack/oveflow 5
i have to sort by the last number
is it possible to use sort to do that?
Yes, sort is exactly what you need. Just provide the code block to compare two elements:
my #sorted = sort { ($a =~ /[0-9]+/g)[-1]
<=>
($b =~ /[0-9]+/g)[-1]
} #array;
<=> does the numeric comparison. The matching returns all the numbers in the string, [-1] selects the last one.

comparing all elements of two big files

How to compare All elements of a file with All elements of another file using C or Perl for much larger data? File 1 contains 100,000 such numbers and file 2 contains 500,000 elements.
I used foreach within a foreach with spliting each and every element in arrays. It worked perfectly in perl but the time consumed to check and print every occurrence of elements of just a single column from File2 in file1 was 40 minutes. There are 28 such columns.
Is there any way to reduce the time or using another language like C?
File 1:
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
File 2:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
EDIT:
Expected output:
If element in file 2, matched print 'column number' if not print '0'.
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code I am using. Note: it checks File2 columnwise in File 1 and prints the Column Number if true and '0' if false. It will print output for every column in 28 different files.
#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
#k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
#hsporg = <FH1>;
for $z (1 .. 28) {
open(OUT, ">$z.txt");
foreach (#hsporg) {
$i = 0;
#h_org = split('\t', $_);
chomp ($h_org[0]);
foreach(#k_org) {
#orginfo = split('\t', $_);
chomp($orginfo[0]);
if($h_org[0] eq $orginfo[0]) {
print OUT "$z\n";
$i = 1;
goto LABEL;
} elsif ($h_org[0] ne $orginfo[0]) {
if($h_org[0]=~/(\w+\s\w+)\s/) {
if($orginfo[0] eq $1) {
print OUT "0";
$i = 1;
goto LABEL;
}
}
}
}
if ($i == 0) {
print OUT "0";
}
LABEL:
}
}
close FH;
close FH1;
close OUT;
If you sort(1) the files you can then check it in a single pass. Should not take more than a couple of seconds (including the sort).
Another way is to load all values from file1 into a hash. It's a bit more memory-consuming, especially if file1 is large, but should be fast (again, no more than a couple of seconds).
I would choose perl over C for such a job, and I'm more proficient in C than in perl. It's much faster to code in perl for this kind of job, less error-prone and runs fast enough.
This script runs a test case. Note that your expected output is objectively wrong: In file 2, line 1, column 20, the value 0.2 exists.
#!perl
use 5.010; # just for `say`
use strict; use warnings;
use Test::More;
# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_
my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_
my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_OUTPUT_
# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";
my %file1 = map { chomp; 0+$_ => undef } <$file1>;
while (<$file2>) {
chomp;
my #vals = split;
# If value exists in file1, print the col number.
my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
chomp(my $expected_line = <$expected>);
is $line, $expected_line;
}
done_testing;
To print the exact same output to 28 files, you would remove the testing code, and rather
say {$_} $line for #filehandles;
instead.
Old answer
Your existing code is simply quite inefficient and unidiomatic. Let me help you fix that.
First, start all your Perl scripts with use strict; use warnings;, and if you have a modern perl (v10 or later), you could do use 5.010; (or whatever your version is) to activate additional features.
The chomp call takes a variable and removes the current value of $/ (usually a newline) from the end of the string. This is important because the readline operator doesn't do that for us. It is not good for declaring a constant variable. Rather, do
my $file = "File1.txt";
my $hspfle = "File2.txt";
The use strict forces you to declare your variables properly, you can do so with my.
To open a file, you should use the following idiom:
open my $fh, "<", $filename or die "Can't open $filename: $!";
Instead of or die ... you can use autodie at the top of your script.
This will abort the script if you can't open the file, tell you the reason ($!), and specifies an explicit open mode (here: < = read). This avoids bugs with special characters in filenames.
Lexical filehandles (in my variables, as contrasted to bareword filehandles) have proper scope, and close themselves. There are various other reasons why you should use them.
The split function takes a regex, not a string as first argument. If you inspect your program closely, you will see that you split each element in #hsporg 28 times, and each element in #k_org 28 × #hsporg times. This is extremely slow, and unneccessary, as we can do that beforehand.
If a condition is false, you don't need to explicitely test for the falseness again in
if ($h_org[0] eq $orginfo[0]) {
...;
} elsif ($h_org[0] ne $orginfo[0]) {
...;
}
as $a ne $b is exactly equivalent to not $a eq $b.
It is quite unidiomatic in Perl to use a goto, and jumping to a label somewhere isn't especially fast either. Labels are mainly used for loop control:
# random example
LOOP: for my $i (1 .. 10) {
for my $j (1 .. 5) {
next if $i == $j; # start next iteration of current loop
next LOOP if 2 * $i == $j; # start next iteration of labeled loop
last LOOP if $i + $j == 13;# like `break` in C
}
The redo loop control verb is similar to next, but doesn't recheck the loop condition, if there is one.
Because of these loop control facilities, and the ability to break out of any enclosing loop, maintaining flags or elaborate gotos is often quite unneccessary.
Here is a cleaned-up version of your script, without fixing too much of the actual algorithm:
#!/usr/bin/perl
use strict; use warnings;
use autodie; # automatic error messages
my ($file, $hspfile) = ("File1.txt", "file2.txt");
open my $fh1, "<", $file;
open my $fh2, "<", $hspfile;
my #k_org = <$fh1>;
my #hsporg = <$fh2>;
# Presplit the contents of the arrays:
for my $arr (\#k_org, \#hsporg) {
for (#$arr) {
chomp;
$_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot
}
}
my $output_files = 28;
for my $z (1 .. $output_files) {
open my $out, ">", "$z.txt";
H_ORG:
for my $h_org (#hsporg) {
my $i = 0;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org->[0] eq $orginfo->[0]) {
print $out "$z\n";
$i = 1;
last ORGINFO; # break out of this loop
} elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) {
print $out "0";
$i = 1;
last ORGINFO;
}
}
print $out "0" if not $i;
}
}
# filehandles are closed automatically.
Now we can optimize further: In each line, you only ever use the first element. This means we don't have to store the rest:
...;
for (#$arr) {
chomp;
$_ = (split /\t/, $_, 2)[0]; # save just the first element
}
...;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org eq $orginfo) {
...;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
...;
}
}
Also, acessing scalars is a bit faster than accessing array elements.
The third arg to split limits the number of resulting fragments. Because we are only interested in the first field, we don't have to split the rest too.
Next on, we last out of the ORGINFO loop, then check a flag. This is unneccessary: We can jump directly to the next iteration of the H_ORG loop instead of setting the flag. If we drop out naturally out of the ORGINFO loop, the flag is guaranteed to not be set, so we can execute the print anyway:
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print $out "$z\n";
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print $out "0";
next H_ORG;
}
}
print $out "0";
}
Then, you compare the same data 28 times to print it to different files. Better: Define two subs print_index and print_zero. Inside these, you loop over the output filehandles:
# make this initialization *before* you use the subs!
my #filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files;
...; # the H_ORG loop
sub print_index {
for my $i (0 .. $#filehandles) {
print {$filehandles[$i]} $i+1, "\n";
}
}
sub print_zero {
print {$_} 0 for #filehandles;
}
Then:
# no enclosing $z loop!
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print_index()
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print_zero();
next H_ORG;
}
}
print_zero();
}
This avoids checking data you already know doesn't match.
In C you could try using "qsort" and "bsearch" functions
First you need to load your files into an array.
Than you should execute a qsort() (unless you are sure the elements have an order). And use the bsearch() to execute a binary search into your array.
http://linux.die.net/man/3/bsearch
This will be much more faster than check all elements one by one.
You could try to implement a binary search in perl if it do not exist already, it is a simple algorithm.

Resources