Printing lines containing the least number in groups - AWK/SED/PERL - loops

I would like to print only lines containing the least number in groups. My files contains multiple columns, and I use the first column to determine groups. Let's say 1st, 4th, 6th lines are in the same group because the content of the first column is the same. My goal is to print out a line that contains the least number in the second column for each group.
file.txt:
VDDA 0.7 ....
VDDB 0.2 ....
VDDB 0.3 ....
VDDA 0.4 ....
VSS 0.1 ....
VDDA 0.2 ....
VSS 0.2 ....
output.txt:
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
I think I can do this job with C using a for loop and comparisons, but I think there is a better way using AWK/SED/PERL.

If you are not bothering about the sequence of the 1st field as per Input_file then following may help you in same too. Also this code will be looking for smallest number value for any 1st field and going to print it then.
awk '{a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2)} END{for(i in a){print i,a[i]}}' Input_file
EDIT1: If you want the output in same order as $1 is in, then following may help you in same too.
awk '!a[$1]{b[++i]=$1} {c[$1]=a[$1]>$2?$0:(c[$1]?c[$1]:$0);a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2);} END{for(j=1;j<=i;j++){print b[j],c[b[j]]}}' Input_file

$ awk '{split(a[$1],s);a[$1]=(s[2]<$2 && s[2])?a[$1]:$0} END{for(i in a)print a[i]}' file.txt
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
Brief explanation:
Save $0 into a[$1]
split(a[$1],s): split numeric s[2] from a[$1] for comparing
if the condition s[2]<$2 && s[2] is met, set a[$1]=a[$1], otherwise set a[$1]=$0

With GNU datamash tool:
Assuming the following exemplary input file containing rows with 5 columns:
VDDA 0.7 c1 2 a
VDDB 0.2 c2 3 b
VDDB 0.3 c4 5 c
VDDA 0.4 c5 6 d
VSS 0.1 c6 7 e
VDDA 0.2 c7 8 f
VSS 0.2 c8 9 g
datamash -sWf -g1 min 2 < file | awk '{--NF}1'
The output:
VDDA 0.2 c7 8 f
VDDB 0.2 c2 3 b
VSS 0.1 c6 7 e

Related

Is there a way to compare two arrays based on user input in matlab

MATLAB software
i=[0 1.264241 1.729329 1.900426 1.963369 1.986524 1.995042 1.998176 1.999329 1.999753 1.999909];
t=[0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2];
How can I call the value of i if the user input is from the t array as the position for both array is the same?
For example, if I call the value 0.2 the program will call the value 1.264341 from array i.
You can use input to get the user to enter a number, and ismembertol to find the number's index in t. Once you have the index, you can get the corresponding value in i. You could even throw an error if the number entered is not found in t. Here's an example:
i=[0 1.264241 1.729329 1.900426 1.963369 1.986524 1.995042 1.998176 1.999329 1.999753 1.999909];
t=[0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2];
x = input('Enter number:\n');
[~,ind] = ismembertol(x,t);
if ind > 0
fprintf('Corresponding number in i is %g\n', i(ind))
else
error('Number not found in i')
end

Create an array with a sequence of numbers in bash

I would like to write a script that will create me an array with the following values:
{0.1 0.2 0.3 ... 2.5}
Until now I was using a script as follows:
plist=(0.1 0.2 0.3 0.4)
for i in ${plist[#]}; do
echo "submit a simulation with this parameter:"
echo "$i"
done
But now I need the list to be much longer ( but still with constant intervals).
Is there a way to create such an array in a single command? what is the most efficient way to create such a list?
Using seq you can say seq FIRST STEP LAST. In your case:
seq 0 0.1 2.5
Then it is a matter of storing these values in an array:
vals=($(seq 0 0.1 2.5))
You can then check the values with:
$ printf "%s\n" "${vals[#]}"
0,0
0,1
0,2
...
2,3
2,4
2,5
Yes, my locale is set to have commas instead of dots for decimals. This can be changed setting LC_NUMERIC="en_US.UTF-8".
By the way, brace expansion also allows to set an increment. The problem is that it has to be an integer:
$ echo {0..15..3}
0 3 6 9 12 15
Bash supports C style For loops:
$ for ((i=1;i<5;i+=1)); do echo "0.${i}" ; done
0.1
0.2
0.3
0.4
Complementing the main answer
In my case, seq was not the best choice.
To produce a sequence, you can also use the jot utility. However, this command has a more elaborated syntaxis.
# 1 2 3 4
jot - 1 4
# 3 evenly distributed numbers between 0 and 10
# 0 5 10
jot 3 0 10
# a b c ... z
jot -c - 97 122

how to split file in arrays and find maximum value in each of them

I have a file:
1 0.5
2 0.7
3 0.55
4 0.7
5 0.45
6 0.8
7 0.75
8 0.3
9 0.35
10 0.5
11 0.65
12 0.75
I want to split the file into 4 arrays ending on every next 3rd line and then to find the maximum value in the second column for every array. So this file the outcome would be the:
3 0.7
6 0.8
9 0.75
12 0.75
I have managed so far to split the file into several by
awk 'NR%3==1{x="L"++i;}{print > x}' filename
then to find the maximum in every file:
awk 'BEGIN{max=0}{if(($2)>max) max=($2)}END {print $1,max}'
However, this creates additional files which is fine for this example but in reality the original file contains 65 million lines so I will be a bit overwhelmed by the amount of files and I am trying to avoid it by writing a short script which will combine both of the mentioned above.
I tried this one:
awk 'BEGIN {for (i=1; i<=12; i+=3) {max=0} {if(($2)>max) max=($2)}}END {print $1,max}' Filename
but it produces something irrelevant.
So if you can help me out it will be much appreciated!
You could go for something like this:
awk 'NR % 3 == 1 || $2 > max {max = $2} NR % 3 == 0 {print $1, max}' file
The value of max is always reset every three rows and updated if value of the second column is greater than it. At the end of every group of three, the first column and the max are printed.

Counting and manipulating occurrences in text file (Perl)

I have a tab separated text file that is like
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
And I want to manipulate it in order to get the following information:
For each element in the 1st column, count how many repeated elements in the second column there are, and do the average of all numbers in the third column for each element of the second column. The desired output can be another tab separated text file, where "Average" is the average number for that element in the 2nd column:
1st K# Average L# Average P# Average
1J 1 0.2 2 0.4 1 0.4
1B 1 0.7 3 0.38 1 0.3
How should I proceed? I thought about doing a Hash of Arrays with key = 1st column, but I don't think this would be too advantageous.
I also thought about creating multiple arrays named #L, #P, #K to count the occurrences of each of these elements, for each element of the 1st column; and other arrays #Ln, #Pn, #Kn that would get all numbers for each of these. In the end, the sum of each number divided by scalar #L would give me the average number.
But my main problem in these is: how can I do all of this processing for each element of the 1st column?
Edit: another possibility (that I am trying right now) is to create an array of all unique elements of the first column. Then, greping each one and do the processing. But there may be easier ways?
Edit2: it may happen that some elements of the second column do not exist for some elements in the first column - problem: division by 0. E.g.:
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3 <- note that this is not P as in the example above.
1B L 0.6
1B L 0.3
Here is a way to go:
my $result;
while(<DATA>){
chomp;
my #data = split;
$result->{$data[0]}{$data[1]}{sum} += $data[2];
$result->{$data[0]}{$data[1]}{nbr}++;
}
say "1st\tK#\tavg\tL#\tavg\tP#\tavg";
foreach my $k(keys %$result) {
print "$k\t";
for my $c (qw(K L P)) {
if (exists($result->{$k}{$c}{nbr}) && $result->{$k}{$c}{nbr} != 0) {
printf("%d\t%.2f\t",$result->{$k}{$c}{nbr},$result->{$k}{$c}{sum}/$result->{$k}{$c}{nbr});
} else {
printf("%d\t%.2f\t",0,0);
}
}
print "\n";
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
output:
1st K# avg L# avg P# avg
1B 1 0.70 3 0.37 1 0.30
1J 1 0.20 2 0.40 1 0.40
Untested code:
while (<>) {
chomp;
($x, $y, $z) = split /\t/;
push #{$f{$x}{$y}}, $z; # E.g. $f{'1J'}{'L'}[1] will be 0.3
}
#cols = qw/L P K/;
foreach $x (sort keys %f) {
print "$x\t";
foreach $y (#cols) {
$t = $n = 0;
foreach $z (#{$f{$x}{$y}}) {
$t += $z;
++$n;
}
$avg = $n ? $t / $n : 'N/A';
print "$n\t$avg\t";
}
print "\n";
}
For each of the count and sum I would use a Hash of Hashes where the first column is the key to the outer hash and the second column is the key to the inner hash. So something like:
my (%count, %sum);
while(<>) {
my #F = split / /, $_;
$count{$F[0]}->{$F[1]}++;
$sum{$F[0]}->{$F[1]} += $F[2];
}
for my $key (keys %count) {
print $key;
for my $subkey ("K", "L", "P") {
my $average = defined($count{$key}->{$subkey}) ? $sum{$key}->{$subkey} / $count{$key}->{$subkey} : 0;
...; # and print the result
}
print "\n";
}
I am sorry I did this - really - but here is a "one-liner" (ahem) that I will try to translate into a real script and explain - as an exercise for myself :-) I hope this admittedly artificial example of a one line solution adds something to the more clearly written and scripted examples submitted by the others.
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
say "$F1","s: $seen{$_}->{$F1} avg:",$sum{$_}->{$F1}/$seen{$_}->{$F1}}}' data.txt
See perlrun(1) for a more detailed explanation of Perl's switches. Essentially, perl -anE starts Perl in "autosplit" mode (-a) and creates a while <> loop to read input (-n) for the code that is executed between the ' ' quotes. The -E turns on all the newest bells and whistles for execution (normally one uses -e). Here's my attempt at explaining what it does.
First, in the while loop this (sort of) "oneliner":
autosplits input into an array (#F ... awkish for "fields" I guess) using a space as the delimitter.
uses the %seen{} trick to count occurrences of matching lines in part of the array. Here it increments the value of the %seen hash key created from column one ($F[0]) of #F each time it sees a line in column two ($F[1]) of #F that repeats
uses a hash %sum or %total to add the values in column three ($F[2]) using the =+ operator. See this perlmonks node for another example.
Then it breaks out of the while <> loop created with -n by using a "butterfly" }{ that acts like an END block allowing a nested for loop to spit everything out. I use $F1 as the subkey for the inner for loop to remind myself that I'm getting it from the second column of the autosplit array #F.
Output (we need printf to get nicer numerical results);
1B:
Ks: 1 avg:0.7
Ls: 3 avg:0.366666666666667
Ps: 1 avg:0.3
1J:
Ks: 1 avg:0.2
Ls: 2 avg:0.4
Ps: 1 avg:0.4
This makes the numbers look nicer (using printf to format)
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n", $F1, $seen{$_}->{$F1}, $sum{$_}->{$F1}/$seen{$_}->{$F1})}}' data.txt
Script version. It increments values of repeated keys drawn from column two ($field[1]) of the data; in a second hash it sums the key values drawn from column three ($field[2]). I wanted to impress you with a more functional style or the exact right CPAN module for the job but had to $work. Cheers and be sure to ask more Perl questions!
#!/usr/bin/env perl
use strict;
use warnings;
my %seen ;
my %sum ;
while(<DATA>){
my #fields = split ;
$seen{$fields[0]}{$fields[1]}++ ;
$sum{$fields[0]}{$fields[1]} += $fields[2];
}
for(keys %seen) {
print " $_:\n";
for my $f(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n",
$f, $seen{$_}->{$f}, $sum{$_}->{$f}/$seen{$_}->{$f} );
}
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3
1B L 0.6
1B L 0.3

merge / append files and re-number first column in unix

I am many (3 just an example) text files in different directories (3 different names) like following:
Directory: A, file name: run.txt format: txt tab deliminated
; file one
10 0.2 0.5 0.3
20 0.1 0.6 0.8
30 0.2 0.1 0.1
40 0.1 0.5 0.3
Directory: B, file name: run.txt format: txt tab deliminated
; file two
10 0.2 0.1 0.2
30 0.1 0.6 0.8
50 0.2 0.1 0.1
70 0.3 0.4 0.4
Directory: C, file name: run.txt format: txt tab deliminated
; file three
10 0.3 0.3 0.3
20 0.3 0.6 0.8
30 0.1 0.1 0.1
40 0.2 0.2 0.3
I want to combine all three run.txt files into single and renumber the first column. The resulting new file will look like:
; file combined
10 0.2 0.5 0.3
20 0.1 0.6 0.8
30 0.2 0.1 0.1
40 0.1 0.5 0.3
50 0.2 0.1 0.2
70 0.1 0.6 0.8
90 0.2 0.1 0.1
110 0.3 0.4 0.4
120 0.3 0.3 0.3
130 0.3 0.6 0.8
140 0.1 0.1 0.1
150 0.2 0.2 0.3
This what my codes are at:
cat A/run.txt B/run.txt C/run.txt > combined.txt
(1) I do not know how to take care of renumbering by first column
(2) Also I do not how to take care of comment starting with ";"
Edit:
Let me be clear about the number scheme:
A/run.txt, B/run.txt and C/run.txt are actually parallel run to combined into one.
so each will have stored samples with run number. However gap can be uneven among the run.
(1) for first file A/run.txt (gap is 10, 20-10, 30-20)
10, 10+10, 20+10, 30+10
(2) for second file B/run.txt, starts from 10 but has gap of 20
(eg. 30-10, 50-70, 70-50)
40 (from last line of the first file) + 10 (first in file two) = 50,
50 + 20 = 70,70 + 20 = 90, 90+ 20 = 110
(3) file C/run.txt starts from 10 and increment is 10
110 (last number in file 2) + 10 = 120, 120+ 10 = 130,
130+10 = 140, 140+10 = 150`
You could use awk:
awk 'BEGIN{l=0;print "; file combined"}; {if($1!=";")print l,$2,$3,$4;l=l+10}' A/run.txt B/run.txt C/run.txt > combined.txt
EDIT
I made a guess about your numbering scheme (you've provided still no spec) and come up with:
awk 'BEGIN{line=0;last=0;print "; file combined"}; !/^;/{if($1<last){line=last+$1}else{line=line+$1-last;last=$1};print line,$2,$3,$4}' \
A/run.txt B/run.txt C/run.txt > combined.txt
Is it what you mean?
#!/usr/bin/awk -f
BEGIN {
OFS = "\t"
printf "%s\n", "; file combined"
}
! /^;/ {
if (FILENAME != prevfile) {
prevnum = $1
prevfile = FILENAME
interval = 10
c = 0
}
c++
if (c == 2) {
interval = $1 - prevnum
}
$1 = (i += interval)
print
}
To run it:
$ ./renumber {A,B,C}/run.txt
Given your sample input, it produces output that exactly matches your sample.
awk '{$1="";print NR"0",$0}' A/run.txt B/run.txt C/run.txt > combined.txt
This might work for you:
awk -F'[\t]' 'lastfile!=FILENAME{lastfile=FILENAME;i=l};{$1+=i;l=$1};1' A/run.txt B/run.txt C/run.txt > combined.txt

Resources