Increment array by 1 in awk - arrays

I try to use awk to increment my array by one with an "if" :
inf_a='1.0'
inf_b='4.0'
inf=($(seq $inf_a 1.0 $inf_b))
gro_a='0.0'
gro_b='4.0'
gro=($(seq $gro_a 0.5 $gro_b))
tri_a='1'
tri_b='12'
tri=($(seq $tri_a 1 $tri_b))
# counter
declare -A succ
declare -A fail
num_inf=${#inf[#]}
num_gro=${#gro[#]}
num_tri=${#tri[#]}
for ((i=1;i<num_inf;i++)) do
ii=${inf[$i]}
for ((j=1;j<num_gro;j++)) do
jj=${gro[$j]}
for ((k=1;k<num_tri;k++)) do
kk=${tri[$k]}
awk 'END{x=($2+$8);if($x<10) (( fail[$i,$j]++ )) ;else (( succ[$i,$j]++ )) }' infect$ii/$jj/trial$kk/out.dat
done
echo
done
done
printf '%s\n' "fail[1,1]"
printf '%s\n' "succ[1,1]"
However, they both return 0. It seems that the problem originates from (( array[x,y]++ )). The "++" does not seem to work.
Is there any suggestion? Anything I missed?
Thanks.
Edited :
This is an example of my out.dat
47990 1451 234803 25 9816 2 593 1478 245212 605053 999.999695
47991 1451 234811 25 9806 2 593 1478 245210 605618 1000.010071
47992 1451 234821 25 9810 2 592 1478 245223 605928 1000.000610
47993 1450 234828 25 9812 2 593 1477 245233 605786 1000.003357
47994 1450 234826 25 9837 2 588 1477 245251 605642 1000.017029
47995 1450 234837 25 9832 2 588 1477 245257 605191 1000.104919
47996 1449 234849 26 9830 2 589 1477 245268 605350 1000.089966
47997 1449 234856 26 9831 2 588 1477 245275 605260 999.999695
47998 1447 234852 26 9834 2 589 1475 245275 605122 1000.124512
47999 1447 234838 26 9852 2 590 1475 245280 605124 999.999695
Edited :
Modifying the code this way does not seem to work.
declare $( awk 'BEGIN{print "let "s=s+1""}' )

There are some problems in your script.
$i and $j is a bash variables. awk will not know their values. You have to use -v to export their values to awk.
Then, fail[] and succ[] are awk arrays. How bash will know the values? Either you have print their values in awk and get it in bash using $() (Command substitution) and use it.
Refer below links:
How to use shell variables in an awk script
http://tldp.org/LDP/abs/html/commandsub.html

Related

How to compare 2 files and returning matching values with awk [duplicate]

I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3

Issue with AWK array length?

I have a tab separated matrix (say filename).
If I do:
head -1 filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
followed by:
head -2 filename | tail -1 | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get an answer of 24 (same answer) for all rows basically.
But if I do it:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get:
24
25
25
25
25 ...
Why is it so?
Following is the inputfile:
Case1 17.49 0.643 0.366 11.892 0.85 5.125 0.589 0.192 0.222 0.231 27.434 0.228 0 0.111 0.568 0.736 0.125 0.038 0.218 0.253 0.055 0.019 0 0.078
Case2 0.944 2.412 4.296 0.329 0.399 1.625 0.196 0.038 0.381 0.208 0.045 1.253 0.382 0.111 0.324 0.268 0.458 0.352 0 1.423 0.887 0.444 5.882 0.543
Case3 21.266 14.952 24.406 10.977 8.511 21.75 6.68 0.613 12.433 1.48 1.441 21.648 6.972 42.931 8.029 4.883 11.912 6.248 4.949 26.882 9.756 5.366 38.655 12.723
Case4 0.888 0 0.594 0.549 0.105 0.125 0 0 0.571 0.116 0.019 1.177 0.573 0.111 0.081 0.401 0 0.05 0.073 0 0 0 0 0.543
Well, I found an answer to my own problem:
I wonder how I missed it, but nullifying the array at the end of each initiation is always critical for repeated usage of same array name (no matter which language/ script one uses).
correct awk was:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array);delete array}'

Remove or Replace Value in Array1 If Value Isn't in Array2 in Shell Script Without Looping?

Edit:
I've change my definition of allVals to the somewhat cleaner/simpler:
allVals=( $( printf '%s\n' \
$( printf '%s\n' \
{0..9}{0..9}{0,3,5,7} | \
sed -e 's#^0*##g' ) | \
awk '$1>='"$valMin"' && $1<='"$valMax" ) \
${exptVals[#]} )
I have a short BASH script used to produce space separated configuration files for a secondary executable. The scripted part is determining which values to print to column 1.
To accomplish this my script uses brace expansion to create an array of integers with the following rules:
Numbers are no more than 3 digits.
Numbers are integer (no decimal).
Products of 5 must be included in series.
I need at least one evenly spaced point between the [#][#]0 and [#][#]5 (i.e. a number ending in 3 or 7, as appropriate).
I use sed to clean up the case where the second most significant digit is blank (I'll probably replace the ' ' with '0' and write a simpler equivalent by remove leading '0's when I get around to it...).
Anyhow, these are values I input into a second program to produce computed predictions for certain properties. I also want to be sure to include numbers corresponding to certain experimental values I have... so I do that by creating an array of experimental values and then merging the two arrays together, sorting them and removing redundant values.
The script is given below (it was a oneliner -- I've edited it into script form for readability below):
#!/bin/bash
lineItem5=61
valMax=433
valMin=260
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
allVals=( $( printf '%s\n' \
$( printf '%s\n' {' ',{1..9}}{' ',{1..9}}{0,3,5,7} | \
sed -e 's# \([1-9]\) 0 [1-9] 3 [1-9] 5 [1-9] 7 # \100 \103 \105 \107 #g' ) | \
awk '$1>='"$valMin"' && $1<='"$valMax" ) \
${exptVals[#]} )
sortVals=( $( printf '%s\n' ${allVals[#]} | sort -nr | uniq ) )
for ((t=0;t<${#sortVals[#]};t++)); do
printf '%s\n' "${sortVals[t]}"' -4000 -4000 200 '"${lineItem5}"' -1.0'
done
unset exptVals allVals sortVals
It works, but I would like to cut down on the number of lines (which equate to evaluated points and hence computation cost) and improve the spacing of values (which improves my statistical accuracy as each point of outputted properties depends on the previous calculations).
Specifically I'd like to remove the value ##7 if the sequence ##7 ##8 is encountered, and likewise ##3 if the sequence ##2 ##3... but only if the ##3 or ##7 value is not found in my list of experimental values. Also I want to change ##3 ##4 to ##2 ##4 and ##6 ##7 to ##6 ##8 to improve the spacing -- but only if the ##3 or ##7 are not in the experimental sequence.
So far the best way to do this I can think of is doing something like
valStart=0
for ((e=0; e<${#exptVals[#]}; e++)); do
for ((v=valStart; v<${#allT[#]}; v++)); do
if [[ ${allVals[v]} -ge ${exptVals[$((e+1))]} ]]; then
valStart=v
break
else
#Do edits to list here...
fi
done
done
The code isn't finished, but I think it would be moderately efficient as I don't have to loop through the second list entirely... just a small stretch of it (my experimental list is in order).
But I feel like there are easier ways to delete 'Y' from 'X Y' if 'Y' is not in array $vals or change 'Y' to 'Z' for 'X Y' if 'Y' is not in array $vals?
Is there a simple way to in a single expression using some sort of built in accomplish:
delete 'Y' from 'X Y' if 'Y' is not in array $vals
change 'Y' to 'Z' for 'X Y' if 'Y' is not in array $vals
...which does not involve looping through the values in bash-style loops (my brute-force method)?
The script you have makes calls to sed and awk to remove the spaces created by the brace expansion you used. A simpler brace expansion is:
$ echo {0..9}{0..9}{0,3,5,7}
The problem of the leading 0s is easy to solve with printf '%3.0f'.
A shorter list (as an example) will be created with this:
$ printf '%3.0f ' {0..1}{0..9}{0,3,5,7}
0 3 5 7 10 13 15 17 20 23 25 27 30 33 35 37 40 43
45 47 50 53 55 57 60 63 65 67 70 73 75 77 80 83 85 87
90 93 95 97 100 103 105 107 110 113 115 117 120 123 125 127 130 133
135 137 140 143 145 147 150 153 155 157 160 163 165 167 170 173 175 177
180 183 185 187 190 193 195 197
Once cleared this issue, we need to limit values between valMin and valMax.
Instead of calling an external awk to process a short list, a loop is better. With sorting (only called once) and printing, this script does about the same as yours with a lot less of external calls:
#!/bin/bash
lineItem5=61 valMin=260 valMax=433
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
for v in $( printf '%3.0f\n' {0..9}{0..9}{0,3,5,7} )
do (( v>=valMin && v<=valMax )) && allVals+=( "$v" )
done
sortVals=( $(printf '%s\n' "${allVals[#]}" "${exptVals[#]}"|sort -nu) )
printf '%s ' "${sortVals[#]}"
Here we get to the core of your question. How to:
remove the value ##7 if the sequence ##7 ##8 is encountered
The usual wisdom to do this is to call sed. Something like:
printf '%s ' "${sortVals[#]}" | sed -e 's/\(..7 \)\(..8\)/\2/g'
That will convert ..7 ..8 to ..8 (the backreference \2).
Then, you may add more filters for more changes. Something similar to:
printf '%s ' "${sortVals[#]}" |
sed -e 's/\(..7 \)\(..8\)/\2/g' |
sed -e 's/\(..\)3\( ..4\)/\12\2/g'
echo
That will solve the ..7 ..8 to ..8 and the ..3 ..4 to ..2 ..4 items.
But your requirement of:
but only if the ##3 or ##7 value is not found in my list
Is more complex to meet. We need to scan all the values with grep, and execute different code for each option. One usual solution is to use grep:
if printf '%s ' "${sortVals[#]}" | grep -Eq '..3|..7'; then
cmd2=(cat)
else
cmd2=(sed -e 's/\(..2\)\( ..3\)/\1/g')
fi
But that means to scan all values with grep for each condition.
The command created: cmd2 is an array and may be used as this:
printf '%s ' "${sortVals[#]}" |
sed -e 's/\(..7 \)\(..8\)/\2/g' | "${cmd2[#]}" |
sed -e 's/\(..\)3\( ..4\)/\12\2/g' | "${cmd4[#]}"
echo
No grep
The values you are testing are only the last digit, which could be easily extracted with a modulo 10 math operation. And to make easier/faster
the testing of values, we can create an array of indexes like this:
unset indexVals; declare -A indexVals
for v in "${sortVals[#]}"; do indexVals[$((v%10))]=1; done
That's only one scan of values, no external tool called, and a big simplification of the testing of values (for example, for ..2 or ..3):
(( ${indexVals[2]-0} || ${indexVals[3]-0} ))
An script with all the changes is:
#!/bin/bash
lineItem5=61 valMin=260 valMax=433
exptVals=( 257 261 265 269 273 277 281 285 289 293 297 \
301 305 309 313 317 321 325 329 333 337 341 \
345 349 353 357 361 365 369 373 377 381 385 \
389 393 397 401 405 409 413 417 421 425 429 \
433 )
for v in $( printf '%3.0f\n' {0..9}{0..9}{0,3,5,7} )
do (( v>=valMin && v<=valMax )) && allVals+=( "$v" )
done
sortVals=( $(printf '%s\n' "${allVals[#]}" "${exptVals[#]}" | sort -nu) )
unset indexVals; declare -A indexVals
for v in "${sortVals[#]}"; do indexVals[$((v%10))]=1; done
cmd1=( sed -e 's/\(..7 \)\(..8\)/\2/g' )
(( ${indexVals[2]-0} || ${indexVals[3]-0} )) &&
cmd2=( cat ) ||
cmd2=( sed -e 's/\(..2\)\( ..3\)/\1/g' )
cmd3=( sed -e 's/\(..\)3\( ..4\)/\12\2/g' )
(( ${indexVals[3]-0} || ${indexVals[7]-0} )) &&
cmd4=( cat ) ||
cmd4=( sed -e 's/\(..6 ..\)7/\18/g' )
printf '%s ' "${sortVals[#]}" | "${cmd1[#]}" | "${cmd2[#]}" |
"${cmd3[#]}" | "${cmd4[#]}" ; echo
instead of generating the numbers by pattern, why not use awk to generate numbers as a numerical sequence?
for example,
$ awk -v from=100 -v to=200 -v ORS=' ' 'BEGIN{for(i=from;i<=to-10;i+=10)
print i,i+3,i+5,i+7; ORS="\n"; print""}'
100 103 105 107 110 113 115 117 120 123 125 127 130 133 135 137 140 143 145 147 150 153 155 157 160 163 165 167 170 173 175 177 180
183 185 187 190 193 195 197

Bash Indented Output for Multiple Variables

I have a script that loops over every text file in a directory, and stores the content in variables. The content can be anywhere from 1-50 characters long. The amount of text files is unknown. I would like to print the content in such a way that each variable falls into a clean column.
for file in $LIBPATH/*.txt; do
name=$( awk 'FNR == 1 {print $0}' $file )
height=$( awk 'FNR == 2 {print $0}' $file )
weight=$( awk 'FNR == 3 {print $0}' $file )
echo $name $height $weight
done
This code produces the output:
Avril Stewart 99 54
Sally Kinghorn 170 60
John Young 195 120
While the desired output is:
Avril Stewart 99 54
Sally Kinghorn 170 60
John Young 195 120
Thanks!
Use printf:
printf '%-20s %3s %3s\n' "$name" "$height" "$weight"
%3s ensures that all fields use three characters, %-20s does the same for 20 characters, but the - in front makes the output left-aligned.
If you want to limit the output to e.g. 20 characters, you can use
printf '%-20.20s %3s %3s\n' "$name" "$height" "$weight"
This will give you a left aligned minimum width of 20 characters and a maximum width of 20 characters, in other words it will ensure that you always have exactly 20 characters.

Getting output of shell command in bash array

I have a uniq -c output, that outputs about 7-10 lines with the count of each pattern that was repeated for each unique line pattern. I want to store the output of my uniq -c file.txt into a bash array. Right now all I can do is store the output into a variable and print it. However, bash currently thinks the entire output is just one big string.
How does bash recognize delimiters? How do you store UNIX shell command output as Bash arrays?
Here is my current code:
proVar=`awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c`
echo $proVar
And current output I get:
587 chr1 578 chr2 359 chr3 412 chr4 495 chr5 362 chr6 287 chr7 408 chr8 285 chr9 287 chr10 305 chr11 446 chr12 247 chr13 307 chr14 308 chr15 365 chr16 342 chr17 245 chr18 252 chr19 210 chr20 193 chr21 173 chr22 145 chrX 58 chrY
Here is what I want:
proVar[1] = 2051
proVar[2] = 1243
proVar[3] = 1068
...
proVar[22] = 814
proVar[X] = 72
proVar[Y] = 13
In the long run, I'm hoping to make a barplot based on the counts for each index, where every 50 counts equals one "=" sign. It will hopefully look like the below
chr1 ===========
chr2 ===========
chr3 =======
chr4 =========
...
chrX ==
chrY =
Any help, guys?
To build the associative array, try this:
declare -A proVar
while read -r val key; do
proVar[${key#chr}]=$val
done < <(awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c)
Note: This assumes that your command's output is composed of multiple lines, each containing one key-value pair; the single-line output shown in your question comes from passing $proVar to echo without double quotes.
Uses a while loop to read each output line from a process substitution (<(...)).
The key for each assoc. array entry is formed by stripping prefix chr from each input line's first whitespace-separated token, whereas the value is the rest of the line (after the separating space).
To then create the bar plot, use:
while IFS= read -r key; do
echo "chr${key} $(printf '=%.s' $(seq $(( ${proVar[$key]} / 50 ))))"
done < <(printf '%s\n' "${!proVar[#]}" | sort -n)
Note: Using sort -n to sort the keys will put non-numeric keys such as X and Y before numeric ones in the output.
$(( ${proVar[$key]} / 50 )) calculates the number of = chars. to display, using integer division in an arithmetic expansion.
The purpose of $(seq ...) is to simply create as many tokens (arguments) as = chars. should be displayed (the tokens created are numbers, but their content doesn't matter).
printf '=%.s' ... is a trick that effectively prints as many = chars. as there are arguments following the format string.
printf '%s\n' "${!proVar[#]}" | sort -n sorts the keys of the assoc. array numerically, and its output is fed via a process substitution to the while loop, which therefore iterates over the keys in sorted order.
You can create an array in an assignment using parentheses:
proVar=(`awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c`)
There's no built-in way to create an associative array directly from input. For that you'll need an additional loop.

Resources