I wrote a bash script that is supposed to calculate the statistical average and median of each columns of an input file. The input file format is shown below. Each number is separated by a tab.
1 2 3
3 2 8
3 4 2
My approach is to first transpose the matrix, so that rows become columns and vice versa. The transposed matrix is stored in a temporary text file. Then, I calculated the average and median for rows. However, the script gives me the wrong output. First of all, the array that holds the average and median for each column only produces one output. Secondly, the median value calculated is incorrect.
After a bit of code inspection and testing, I discovered that while the transposed matrix did get written to the text file, it is not read correctly by the script. Specifically, each line read only gives one number. Below is my script.
#if column is chosen instead
89 if [[ $initial == "-c" ]]
90 then
91 echo "Calculating column stats"
92
93 #transpose columns to row to make life easier
94 WORD=$(head -n 1 $filename | wc -w); #counts the number of columns
95 for((index=1; index<=$WORD; index++)) #loop it over the number of columns
96 do
97 awk '{print $'$index'}' $filename | tr '\n' ' ';echo; #compact way of performing a row-col transposition
98 #prints the column as determined by $index, and then translates new-line with a tab
99 done > tmp.txt
100
101 array=()
102 averageArray=()
103 medianArray=()
104 sortedArray=()
105
106 #calculate average and median, just like the one used for rows
107 while read -a cols
108 do
109 total=0
110 sum=0
111
112 for number in "${cols[#]}" #for every item in the transposed column
113 do
114 (( sum += $number )) #the total sum of the numbers in the column
115 (( total++ )) #the number of items in the column
116 array+=( $number )
117 done
118
119 sortedArray=( $( printf "%s\n" "${array[#]}" | sort -n) )
120 arrayLength=${#sortedArray[#]}
121 #echo sorted array is $sortedArray
122 #based on array length, construct the median array
123 if [[ $(( arrayLength % 2 )) -eq 0 ]]
124 then #even
125 upper=$(( arrayLength / 2 ))
126 lower=$(( (arrayLength/2) - 1 ))
127 median=$(( (${sortedArray[lower]} + ${sortedArray[upper]}) / 2 ))
128 #echo median is $median
129 medianArray+=$index
130 else #odd
131 middle=$(( (arrayLength) / 2 ))
132 median=${sortedArray[middle]}
133 #echo median is $median
134 medianArray+=$index
135 fi
136 averageArray+=( $((sum/total)) ) #the final row array of averages that is displayed
137
138 done < tmp.txt
139 fi
Thanks for the help.
Related
I appreciate that I was touched with many solutions promptly from many contributors!!! (AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value)
This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.
Data
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9
Desired output
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
Code from #jared_mamrot works perfectly and print out all columns.
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' file file
Code #Andre Wildberg provided also works perfectly and print out all columns.
awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
The awk script below by #Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.
My further question is when I have multiple columns in data, how should I modify this script to print all columns.
sort file | awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
'
Thank you all for great helps!!!
Using any awk and sort:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
or:
$ sort -k1,1 -k2,2nr file | awk '$1!=prev{prev=$1; max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
original script before realising I'd over-thought it:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{key=$1; max=$2} $1==key && $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
The value of seen[$1]++ is 0 the first time any given value of $1 appears in the input, and some incremental non-zero number when that same $1 appears again. So, the value of !seen[$1]++ is 1 (i.e. true in a conditional context) the first time a given $ is seen in the input, and 0 (false) afterwards. So, the first time a appears as $1 we set key to a and max to whatever value $2 has, i.e. 130 in this case. That's it for the involvement of !seen["a"]++.
From then on we just print every line for which $1 is a and $2 is 130, which in this case is just the first line of input.
Then the same happens when b is first seen as $1.
You just need one additional associative array to store 3rd column as value and key as first 2 columns and a running counter being computer in cnt variable:
awk '{
map[$1,$2,++cnt[$1,$2]] = $0
max[$1] = ($2 > max[$1] ? $2 : max[$1])
}
END {
for (key in max) {
val = max[key]
for (i=1; i<=cnt[key,val]; i++)
print map[key,val,i]
}
}' file
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
There is no need to sort the file for this awk solution.
Assuming there may be more than 3 fields to a row:
$ cat file
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9 data10 data11
One idea for modifying the current awk code:
awk '
{ key=$1; val=$2 # save 1st two fields
$1=$2="" # clear 1st two fields
gsub(/^[[:space:]]+/,"") # remove leading white space from line
++cnt[key,val]
max[key]=(val > max[key] ? val : max[key])
row[key,val,cnt[key,val]]=$0 # save rest of line
}
END { for (key in max) {
val=max[key]
for (i=1; i<=cnt[key,val]; i++)
print key, val, row[key,val,i]
}
}
' file
This generates:
a 66 data3
b 99 data5
b 99 data6
c 130 data8
c 130 data9 data10 data11
awk '
$1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 }
$1 == firstcol{
if($2>max){ map[NR--]=$0; max=$2 }
if($2==max) map[NR]=$0
}
END{
for(i in map) print map[i]
}
' inputfile
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
The same ruby works with minor adjustments:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 130 data1", "a 55 data2", "a 66 data3"], "b"=>["b 88 data4", "b 99 data5", "b 99 data6"], "c"=>["c 110 data7", "c 130 data8", "c 130 data9"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
# [["a", "130"], ["b", "99"], ["c", "130"]]
grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
' file
Prints:
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
Once you start getting into 3 or more columns to manage, it is easier to use ruby (or Perl, Python, etc) because of the support for slicing, grouping and joining arrays.
I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
Edit:
I've change my definition of allVals to the somewhat cleaner/simpler:
allVals=( $( printf '%s\n' \
$( printf '%s\n' \
{0..9}{0..9}{0,3,5,7} | \
sed -e 's#^0*##g' ) | \
awk '$1>='"$valMin"' && $1<='"$valMax" ) \
${exptVals[#]} )
I have a short BASH script used to produce space separated configuration files for a secondary executable. The scripted part is determining which values to print to column 1.
To accomplish this my script uses brace expansion to create an array of integers with the following rules:
Numbers are no more than 3 digits.
Numbers are integer (no decimal).
Products of 5 must be included in series.
I need at least one evenly spaced point between the [#][#]0 and [#][#]5 (i.e. a number ending in 3 or 7, as appropriate).
I use sed to clean up the case where the second most significant digit is blank (I'll probably replace the ' ' with '0' and write a simpler equivalent by remove leading '0's when I get around to it...).
Anyhow, these are values I input into a second program to produce computed predictions for certain properties. I also want to be sure to include numbers corresponding to certain experimental values I have... so I do that by creating an array of experimental values and then merging the two arrays together, sorting them and removing redundant values.
The script is given below (it was a oneliner -- I've edited it into script form for readability below):
#!/bin/bash
lineItem5=61
valMax=433
valMin=260
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
allVals=( $( printf '%s\n' \
$( printf '%s\n' {' ',{1..9}}{' ',{1..9}}{0,3,5,7} | \
sed -e 's# \([1-9]\) 0 [1-9] 3 [1-9] 5 [1-9] 7 # \100 \103 \105 \107 #g' ) | \
awk '$1>='"$valMin"' && $1<='"$valMax" ) \
${exptVals[#]} )
sortVals=( $( printf '%s\n' ${allVals[#]} | sort -nr | uniq ) )
for ((t=0;t<${#sortVals[#]};t++)); do
printf '%s\n' "${sortVals[t]}"' -4000 -4000 200 '"${lineItem5}"' -1.0'
done
unset exptVals allVals sortVals
It works, but I would like to cut down on the number of lines (which equate to evaluated points and hence computation cost) and improve the spacing of values (which improves my statistical accuracy as each point of outputted properties depends on the previous calculations).
Specifically I'd like to remove the value ##7 if the sequence ##7 ##8 is encountered, and likewise ##3 if the sequence ##2 ##3... but only if the ##3 or ##7 value is not found in my list of experimental values. Also I want to change ##3 ##4 to ##2 ##4 and ##6 ##7 to ##6 ##8 to improve the spacing -- but only if the ##3 or ##7 are not in the experimental sequence.
So far the best way to do this I can think of is doing something like
valStart=0
for ((e=0; e<${#exptVals[#]}; e++)); do
for ((v=valStart; v<${#allT[#]}; v++)); do
if [[ ${allVals[v]} -ge ${exptVals[$((e+1))]} ]]; then
valStart=v
break
else
#Do edits to list here...
fi
done
done
The code isn't finished, but I think it would be moderately efficient as I don't have to loop through the second list entirely... just a small stretch of it (my experimental list is in order).
But I feel like there are easier ways to delete 'Y' from 'X Y' if 'Y' is not in array $vals or change 'Y' to 'Z' for 'X Y' if 'Y' is not in array $vals?
Is there a simple way to in a single expression using some sort of built in accomplish:
delete 'Y' from 'X Y' if 'Y' is not in array $vals
change 'Y' to 'Z' for 'X Y' if 'Y' is not in array $vals
...which does not involve looping through the values in bash-style loops (my brute-force method)?
The script you have makes calls to sed and awk to remove the spaces created by the brace expansion you used. A simpler brace expansion is:
$ echo {0..9}{0..9}{0,3,5,7}
The problem of the leading 0s is easy to solve with printf '%3.0f'.
A shorter list (as an example) will be created with this:
$ printf '%3.0f ' {0..1}{0..9}{0,3,5,7}
0 3 5 7 10 13 15 17 20 23 25 27 30 33 35 37 40 43
45 47 50 53 55 57 60 63 65 67 70 73 75 77 80 83 85 87
90 93 95 97 100 103 105 107 110 113 115 117 120 123 125 127 130 133
135 137 140 143 145 147 150 153 155 157 160 163 165 167 170 173 175 177
180 183 185 187 190 193 195 197
Once cleared this issue, we need to limit values between valMin and valMax.
Instead of calling an external awk to process a short list, a loop is better. With sorting (only called once) and printing, this script does about the same as yours with a lot less of external calls:
#!/bin/bash
lineItem5=61 valMin=260 valMax=433
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
for v in $( printf '%3.0f\n' {0..9}{0..9}{0,3,5,7} )
do (( v>=valMin && v<=valMax )) && allVals+=( "$v" )
done
sortVals=( $(printf '%s\n' "${allVals[#]}" "${exptVals[#]}"|sort -nu) )
printf '%s ' "${sortVals[#]}"
Here we get to the core of your question. How to:
remove the value ##7 if the sequence ##7 ##8 is encountered
The usual wisdom to do this is to call sed. Something like:
printf '%s ' "${sortVals[#]}" | sed -e 's/\(..7 \)\(..8\)/\2/g'
That will convert ..7 ..8 to ..8 (the backreference \2).
Then, you may add more filters for more changes. Something similar to:
printf '%s ' "${sortVals[#]}" |
sed -e 's/\(..7 \)\(..8\)/\2/g' |
sed -e 's/\(..\)3\( ..4\)/\12\2/g'
echo
That will solve the ..7 ..8 to ..8 and the ..3 ..4 to ..2 ..4 items.
But your requirement of:
but only if the ##3 or ##7 value is not found in my list
Is more complex to meet. We need to scan all the values with grep, and execute different code for each option. One usual solution is to use grep:
if printf '%s ' "${sortVals[#]}" | grep -Eq '..3|..7'; then
cmd2=(cat)
else
cmd2=(sed -e 's/\(..2\)\( ..3\)/\1/g')
fi
But that means to scan all values with grep for each condition.
The command created: cmd2 is an array and may be used as this:
printf '%s ' "${sortVals[#]}" |
sed -e 's/\(..7 \)\(..8\)/\2/g' | "${cmd2[#]}" |
sed -e 's/\(..\)3\( ..4\)/\12\2/g' | "${cmd4[#]}"
echo
No grep
The values you are testing are only the last digit, which could be easily extracted with a modulo 10 math operation. And to make easier/faster
the testing of values, we can create an array of indexes like this:
unset indexVals; declare -A indexVals
for v in "${sortVals[#]}"; do indexVals[$((v%10))]=1; done
That's only one scan of values, no external tool called, and a big simplification of the testing of values (for example, for ..2 or ..3):
(( ${indexVals[2]-0} || ${indexVals[3]-0} ))
An script with all the changes is:
#!/bin/bash
lineItem5=61 valMin=260 valMax=433
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
for v in $( printf '%3.0f\n' {0..9}{0..9}{0,3,5,7} )
do (( v>=valMin && v<=valMax )) && allVals+=( "$v" )
done
sortVals=( $(printf '%s\n' "${allVals[#]}" "${exptVals[#]}" | sort -nu) )
unset indexVals; declare -A indexVals
for v in "${sortVals[#]}"; do indexVals[$((v%10))]=1; done
cmd1=( sed -e 's/\(..7 \)\(..8\)/\2/g' )
(( ${indexVals[2]-0} || ${indexVals[3]-0} )) &&
cmd2=( cat ) ||
cmd2=( sed -e 's/\(..2\)\( ..3\)/\1/g' )
cmd3=( sed -e 's/\(..\)3\( ..4\)/\12\2/g' )
(( ${indexVals[3]-0} || ${indexVals[7]-0} )) &&
cmd4=( cat ) ||
cmd4=( sed -e 's/\(..6 ..\)7/\18/g' )
printf '%s ' "${sortVals[#]}" | "${cmd1[#]}" | "${cmd2[#]}" |
"${cmd3[#]}" | "${cmd4[#]}" ; echo
instead of generating the numbers by pattern, why not use awk to generate numbers as a numerical sequence?
for example,
$ awk -v from=100 -v to=200 -v ORS=' ' 'BEGIN{for(i=from;i<=to-10;i+=10)
print i,i+3,i+5,i+7; ORS="\n"; print""}'
100 103 105 107 110 113 115 117 120 123 125 127 130 133 135 137 140 143 145 147 150 153 155 157 160 163 165 167 170 173 175 177 180
183 185 187 190 193 195 197
I want to write/make/use a 3D array of [m][n][k] in BASH. From what I understand, BASH does not support array that are not 1D.
Any ideas how to do it?
Fake multi-dimensionality with a crafted associative array key:
declare -A ary
for i in 1 2 3; do
for j in 4 5 6; do
for k in 7 8 9; do
ary["$i,$j,$k"]=$((i*j*k))
done
done
done
for key in "${!ary[#]}"; do printf "%s\t%d\n" "$key" "${ary[$key]}"; done | sort
1,4,7 28
1,4,8 32
1,4,9 36
1,5,7 35
1,5,8 40
1,5,9 45
1,6,7 42
1,6,8 48
1,6,9 54
2,4,7 56
2,4,8 64
2,4,9 72
2,5,7 70
2,5,8 80
2,5,9 90
2,6,7 84
2,6,8 96
2,6,9 108
3,4,7 84
3,4,8 96
3,4,9 108
3,5,7 105
3,5,8 120
3,5,9 135
3,6,7 126
3,6,8 144
3,6,9 162
I used sort because keys of an assoc.array have no inherent order.
You can use associative arrays if your bash is recent enough:
unset assoc
declare -A assoc
assoc["1.2.3"]=x
But, I'd rather switch to a language that supports multidimensional arrays (e.g. Perl).
As in C, you can simulate multidimensional array using an offset.
#! /bin/bash
xmax=100
ymax=150
zmax=80
xymax=$((xmax*ymax))
vol=()
for ((z=0; z<zmax; z++)); do
for ((y=0; y<ymax; y++)); do
for ((x=0; x<xmax; x++)); do
((t = z*xymax+y*xmax+x))
if ((vol[t] == 0)); then
((vol[t] = vol[t-xymax] + vol[t-ymax] + vol[t-1]))
fi
done
done
done
I'm quite new to awk, that I am using more and more to process the output files from a model I am running. Right now, I am stuck with a multiplication issue.
I would like to calculate relative change in percentage.
Example:
A B
1 150 0
2 210 10
3 380 1000
...
I would like to calculate Ax = (Ax-A1)/A1 * 100.
Output:
New_A B
1 0 0
2 10 40
3 1000 153.33
...
I can multiply columns together but don't know how to fix a value to a position in the text file (ie. Row 1 Column 1).
Thank you.
Assuming your actual file does not have the "A B" header and the row numbers in it:
$ cat file
150 0
210 10
380 1000
$ awk 'NR==1 {a1=$1} {printf "%s %.1f\n", $2, ($1-a1)/a1*100}' file | column -t
0 0.0
10 40.0
1000 153.3