I am storing details on invalid option to a function using arrays, w to store the positional index and warg to store the option name.
declare -a w=()
declare -a warg=()
local k=0 vb=0 ctp="" sty=""
while (( $# > 0 )); do
k=$((k+1))
arg="$1"
case $arg in
("--vlt")
ctp="$vlt" ; r=$k ; shift 1 ;;
("--blu")
blu=$(tput setaf 12)
ctp="$blu" ; r=$k ; shift 1 ;;
("--grn")
grn=$(tput setaf 2)
ctp="$grn" ; r=$k ; shift 1 ;;
("-m"*)
sty="${1#-m}" ; r=$k ; shift 1 ;;
("--")
shift 1 ; break ;;
("-"*)
w+=($k) ; warg+=("$arg")
shift 1 ;;
(*)
break ;;
esac
done
After that, I try to loop through the invalid options but as one can see, the positional elements in iw do not map to ${warg[$iw]}, in a way that I can print the invalid option name. What can I do?
r=4
local iw
for iw in "${w[#]}"; do
if (( iw < r )); then
printf '%s\n' "Invalid option | ${warg[$iw]}"
fi
done
Numerically-indexed arrays don't have to have sequentical indices.
Using an array of color name to number reduces the duplication in the case branches.
declare -A colors=(
[blu]=12
[grn]=2
[ylw]=3
[orn]=166
[pur]=93
[red]=1
[wht]=7
)
declare -a warg=()
for ((i = 1; i <= $#; i++)); do
arg=${!i} # value at position $i
case $arg in
--blu|--grn|--ylw|--orn|--pur|--red|--wht)
ctp=$(tput setaf "${colors[${arg#--}]}") ;;
--vlt) ctp="$vlt" ;;
-m*) sty="${arg#-m}" ;;
--) break ;;
-*) warg[i]=$arg ;;
*) break ;;
esac
done
numArgs=$i
shift $numArgs
# iterate over the _indices_ of the wrong args array
for i in "${!warg[#]}"; do
echo "wrong arg ${warg[i]} at postition $i"
done
That is untested, so there may be bugs
Addressing just the mapping issue between w[] and warg[] ...
The problem seems to be that $k is being incremented for every $arg, but the index for warg[] is only incremented for an 'invalid' $arg.
If the sole purpose of the w[] array is to track the position of 'invalid' args then you can eliminate the w[] array and populate warg[] as a sparse array with the following change to the current code:
# replace:
warg+=("$arg")
# with:
warg[$k]="$arg"
With this change your for loop becomes:
for iw in "${!warg[#]}"
do
....
printf '%s\n' "Invalid option | ${warg[$iw]}"
done
I am trying to do something which I thought would be fairly simple but not having any joy.
If I run the following, it successfully pulls the region (reg) from an array called PORT330 and checks to see it contains the value of $i. For example, it could be checking to see if "Europe London" contains the word "London". Here is what works:
if [[ ${PORT330[reg]} == *"$i"* ]] ; then
echo 302 is in $i ;
fi
However, I actually have a list of Port array's to check so it may be PORT330, PORT550 and so on. I want to be able to substitute the port number with a variable but then call it within a variable. Here is what I am trying to do:
This works:
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
if [[ ${PORT330[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
However this doesn't work:
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
if [[ ${PORT$portid[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
It throws this error:
-su: ${PORT$portid[reg]}: bad substitution
Any pointers as to where I am going wrong?
In BASH variable expansion there is an option to have indirection via the ${!name} and ${!name[index]} schemes
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
arr=PORT$portid
if [[ ${!arr[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
Here is a complete example
PORT330[reg]=a ;
PORT550[reg]=b ;
for portid in 330 550 ; do
for i in a b ;
do arr=PORT$portid ;
if [[ ${!arr[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
Produces
330 is in a
550 is in b
Another example
:~> portids=(330 350 )
:~> echo ${portids[#]}
330 350
:~> PORT350[reg]=London
:~> PORT330[reg]=Berlin
:~> for portid in ${portids[#]} ; do arr=PORT$portid ; echo ${!arr[reg]} ; done
Berlin
London
I am trying to write a function in bash but it won't work. The function is as follows, it gets a file in the format of:
1 2 first 3
4 5 second 6
...
I'm trying to access only the strings in the 3rd word in every line and to fill the array "arr" with them, without repeating identical strings.
When I activated the "echo" command right after the for loop, it printed only the first string in every iteration (in the above case "first").
Thank you!
function storeDevNames {
n=0
b=0
while read line; do
line=$line
tempArr=( $line )
name=${tempArr[2]}
for i in $arr ; do
#echo ${arr[i]}
if [ "${arr[i]}" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
n=$(($n+1))
fi
b=0
done < $1
}
The following line seems suspicious
for i in $arr ; do
I changed it as follows and it works for me:
#! /bin/bash
function storeDevNames {
n=0
b=0
while read line; do
# line=$line # ?!
tempArr=( $line )
name=${tempArr[2]}
for i in "${arr[#]}" ; do
if [ "$i" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
(( n++ ))
fi
b=0
done
}
storeDevNames < <(cat <<EOF
1 2 first 3
4 5 second 6
7 8 first 9
10 11 third 12
13 14 second 15
EOF
)
echo "${arr[#]}"
You can replace all of your read block with:
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
This will fill arr with only unique names from the 3rd word such as first, second, ... This will reduce the entire function to:
function storeDevNames {
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
}
Note: this will provide a list of all unique device names in sorted order. Removing duplicates also destroys the original order. If preserving the order accept where duplicates are removed, see 4ae1e1's alternative.
You're using the wrong tool. awk is designed for this kind of job.
awk '{ if (!seen[$3]++) print $3 }' <"$1"
This one-liner prints the third column of each line, removing duplicates along the way while preserving the order of lines (only the first occurrence of each unique string is printed). sort | uniq, on the other hand, breaks the original order of lines. This one-liner is also faster than using sort | uniq (for large files, which doesn't seem to be applicable in OP's case), since this one-liner linearly scans the file once, while sort is obviously much more expensive.
As an example, for an input file with contents
1 2 first 3
4 5 second 6
7 8 third 9
10 11 second 12
13 14 fourth 15
the above awk one-liner gives you
first
second
third
fourth
To put the results in an array:
arr=( $(awk '{ if (!seen[$3]++) print $3 }' <"$1") )
Then echo ${arr[#]} will give you first second third fourth.
I have the below input and I would like to do geometric average if the “Cpd_number” and ”ID3” are the same. The files have a lot of data so we might need arrays to do the tricks. However, as an awk beginner, I am not very sure how to start. Could anyone kindly offer some hints?
input:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”
Three lines of “95”,“123”,”4”,”5” should do a geometric average
Two lines of “95”, “123”,”4”,”6” should do a geometric average
Two lines of “95”, “456”,”4”,”6” should do a geometric average
Here is the desired output:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”
Some info about geometric mean:
http://en.wikipedia.org/wiki/Geometric_mean
This script computes a geometric mean
#!/usr/bin/awk -f
{
b = $1; # value of 1st column
C += log(b);
D++;
}
END {
print "Geometric mean : ",exp(C/D);
}
Having this file:
$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"
This piece:
awk -F\" 'BEGIN{print} # Print headers
last != $4""$8 && last{ # ONLY When last key "Cpd_number + ID3"
print line,exp(C/D) # differs from actual , print line + average
C=D=0} # reset acumulators
{ # This block process each line of infile
C += log($(NF-1)+0) # C calc
D++ # D counter
$(NF-1)="" # Get rid of activity col ir order to print line
line=$0 # Line will be actual line without activity
last=$4""$8} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print line,exp(C/D)}' infile
Will throw:
ID1 , Cpd_number , ID2 , ID3 , 0
95 , 123 , 4 , 5 , 10
95 , 123 , 4 , 6 , 31.6228
95 , 456 , 4 , 6 , 31.6228
Still some work left for formatting.
NOTE: char " is used instead of “ and ”
EDIT: NF is the number of fields in file , so NF-1 will be the next to last:
$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile
10
100
1
10
100
10
100
So in: log($(NF-1)+0) we apply log function to that value (0 sum is added to ensure numeric value)
D++ y just a counter.
Why use awk, just do it in bash, with either bc or calc to handle floating point math. You can download calc at http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). There are rpms, binary and source tarballs available. It is far superior to bc in my opinion. The routine is fairly simple. You need to remove the extranious " quotes from your datafile first leaving a csv file. That helps. See the sed command used in the comments below. Note, the geometric mean below is the 4th root of (id1*cpd*id2*id3). If you need a different mean, just adjust the code below:
#!/bin/bash
##
## You must strip all quotes from data before processing, or write more code to do
## it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use
## newdatafile as command line argument to this program
##
## Additionally, this script uses 'calc' for floating point math. go download it
## from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
## use bc if you like, but why, calc is so much better.
##
## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }
## function to strip extraneous whitespace from input
trimWS() {
[[ -z $1 ]] && return 1
strln="${#1}"
[[ strln -lt 2 ]] && return 1
trimSTR=$1
trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}" # remove leading whitespace characters
trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}" # remove trailing whitespace characters
echo $trimSTR
return 0
}
let cnt=0
let oldsum=0 # holds value to compare against new Cpd_number & ID3
product=1 # initialize product to 1
pcnt=0 # initialize the number of values in product
IFS=$',\n' # Internal Field Separator, set to break on ',' or newline
while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do
cpd=`trimWS $cpd` # trimWS from cpd (only one that needed it)
# if first iteration, just output first row
test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"
# after first iteration, test oldsum -ne sum, if so do geometric mean
# and reset product and counters
if test "$cnt" -gt 0 ; then
sum=$((newcpd+newid3)) # calculate sum to test against oldsum
if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
# geometric mean (nth root of product)
# mean=`calc -p "root ($product, $pcnt)"` # using calc
mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
echo " $id1 $cpd $id2 $id3 average: $mean"
pcnt=0
product=1
fi
# update last values to new values
oldsum=$sum
id1="$newid1"
cpd="$newcpd"
id2="$newid2"
id3="$newid3"
act="$newact"
((product*=act)) # accumulate product
((pcnt+=1))
fi
((cnt+=1))
done < "$1"
output:
# output using calc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 10
95 123 4 6 average: 31.62277660168379331999
95 456 4 6 average: 31.62277660168379331999
# output using bc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 9.999999
95 123 4 6 average: 31.622756
95 456 4 6 average: 31.622756
The updated script calculates the proper mean. It is a bit more involved due to having to keep old/new values to test for the change in cpd & id3. This may be where awk is the simpler way to go. But if you need more flexibility later, bash may be the answer.
Programming beginner here needs some help modifying an AWK script to make it conditional. Alternative non-awk solutions are also very welcome.
NOTE Main filtering is now working thanks to help from Birei but I have an additional problem, see note below in question for details.
I have a series of input files with 3 columns like so:
chr4 190499999 190999999
chr6 61999999 62499999
chr1 145499999 145999999
I want to use these rows to filter another file (refGene.txt) and if a row in file one mathces a row in refGene.txt, to output column 13 in refGene.txt to a new file 'ListofGenes_$f'.
The tricky part for me is that I want it to count as a match as long as column one (eg 'chr4', 'chr6', 'chr1' ) and column 2 AND/OR column 3 matches the equivalent columns in the refGene.txt file. The equivalent columns between the two files are $1=$3, $2=$5, $3=$6.
Then I am not sure in awk how to not print the whole row from refGene.txt but only column 13.
NOTE I have achieved the conditional filtering described above thanks to help from Birei. Now I need to incorporate an additional filter condition. I also need to output column $13 from the refGene.txt file if any of the region between value $2 and $3 overlaps with the region between $5 and $6 in the refGene.txt file. This seems a lot trickier as it involves mathmatical computation to see if the regions overlap.
My script so far:
FILES=/files/*txt
for f in $FILES ;
do
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $3, $5, $6 ] == 1 ) {
print $13;
}
}
' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done
Any help is really appreciated. Thanks so much!
Rubal
One way.
awk '
BEGIN { FS = "\t"; }
## Save third, fifth and seventh field of first file in arguments (refGene.txt) as the key
## to compare later. As value the field to print.
FNR == NR {
pair[ $3, $5, $6 ] = $13;
next;
}
## Set the name of the output file.
FNR == 1 {
output_file = "";
split( ARGV[ARGIND], path, /\// );
for ( i = 1; i < length( path ); i++ ) {
current_file = ( output_file ? "/" : "" ) path[i];
}
output_file = output_file "/ListOfGenes_" path[i];
}
## If $1 = $3, $2 = $5 and $3 = $6, print $13 to output file.
{
if ( pair[ $1, $2, $3 ] ) {
print pair[ $1, $2, $3 ] >output_file;
}
}
' refGene.txt /files/rubal/*.txt