Nested loop with two arrays - arrays

I have two sets of arrays with a variable number of elements, for instance:
chain=(B C)
hresname=(BMA MAN NAG NDG)
I am parsing a number of files that may contain elements from the array chain at a given position and elements of the array hresname at a different position (position is always fixed in both cases). This is a sample of the data:
ATOM 5792 CB MET D 213 49.385 -5.683 125.489 1.00142.66 C
ATOM 5793 CG MET D 213 50.834 -5.674 125.990 1.00154.50 C
ATOM 5794 SD MET D 213 51.530 -7.337 126.277 1.00164.73 S
ATOM 5795 CE MET D 213 52.854 -7.386 125.068 1.00169.73 C
HETATM 5797 C1 NAG B 323 70.090 50.934 125.869 1.00 86.35 C
HETATM 5798 C2 NAG B 323 69.687 52.074 126.879 1.00 95.95 C
HETATM 5799 C3 NAG B 323 68.377 52.740 126.390 1.00 87.65 C
HETATM 5800 C4 NAG B 323 68.598 53.314 125.014 1.00 83.97 C
First I need to copy lines starting with ATOM whose 5th column matches each of the elements of the array chain to separate files:
while read pdb ; do
for c in "${chain[#]}" ; do
#if [ ${#chain[#]} -eq 1 ] && \
if [ $(echo "$pdb" | cut -c1-4) == "ATOM" ] && \
[ $(echo "$pdb" | cut -c22-23) == "${chain[$c]}" ]; then
echo "$pdb" >> ../../properpdb/${pdbid}_${chain[$c]}.pdb
fi
done
done < ${pdbid}.pdb
This works well (slow but sure). Both the commented and the uncommented versions work.
Next I want to copy lines that start with HETATM and whose 4th column matches elements of the hresname but only if those lines also match an element form the chain array at cloumn number 5th:
while read pdb ; do
for c in "${chain[#]}" ; do
for h in "${hresname[#]}" ; do
if [ ${#chain[#]} -eq 1 ] && \
[ $(echo "$pdb" | cut -c1-6) == "HETATM" ] && \
[ $(echo "$pdb" | cut -c22-23) == "${chain[$c]}" ] \
[ $(echo "$pdb" | cut -c18-20) == "${hresname[$h]}" ] ; then
echo "$pdb" >> ../../properpdb/${pdbid}_${chain[$c]}.pdb
fi
done
done
done < ${pdbid}.pdb
However, this does not work. I repeatedly receive an error:
line 66: [: too many arguments
Line 66 is:
[ $(echo "$pdb" | cut -c22-23) == "${chain[$c]}" ] \
Which puzzles me because the error happens even if I restrict the loop to chain arrays containing a single element.
According to other StackOverflow questions, it should be perfectly possible to do this in bash. Any idea what the problem could be?

You for get to add &&, change this line:
[ $(echo "$pdb" | cut -c22-23) == "${chain[$c]}" ] \
To
[ $(echo "$pdb" | cut -c22-23) == "${chain[$c]}" ] &&\
Update:
You have too many errors in the script, I fixed it and it's now working. I suggest you read the for loop syntax in the bash manual first.
chain=(B C)
hresname=(BMA MAN NAG NDG)
while read pdb ; do
for c in ${chain[#]} ; do
for h in ${hresname[#]} ; do
if [ $(echo "$pdb" | cut -c1-6) == "HETATM" ] && \
[ $(echo "$pdb" | cut -c22-23) == "$c" ] && \
[ $(echo "$pdb" | cut -c18-20) == "$h" ] ; then
echo "$pdb" >> ../../properpdb/${pdbid}_${chain[$c]}.pdb
fi
done
done
done

Related

Bash find unique values in array1 not in array2 (and vice-versa)

In bash, I know to be able to find the unique values between two arrays can be found by:
echo "${array1[#]} ${array2[#]}" | tr ' ' '\n' | sort | uniq -u
However, this gives the unique values between BOTH arrays. What if I wanted something about the elements that are unique only to array1 and elements that are unique only to array2? For example:
array1=(1 2 3 4 5)
array2=(2 3 4 5 6)
original_command_output = 1 6
new_command_output1 = 1
new_command_output2 = 6
You could use the comm command.
To get elements unique to the first array:
comm -23 \
<(printf '%s\n' "${array1[#]}" | sort) \
<(printf '%s\n' "${array2[#]}" | sort)
and elements unique to the second array:
comm -13 \
<(printf '%s\n' "${array1[#]}" | sort) \
<(printf '%s\n' "${array2[#]}" | sort)
Or, more robust, allowing for any character including newlines to be part of the elements, split on the null byte:
comm -z -23 \
<(printf '%s\0' "${array1[#]}" | sort -z) \
<(printf '%s\0' "${array2[#]}" | sort -z)
comm is probably the way to go but if you're running bash >= 4 then you can do it with associative arrays:
#!/bin/bash
declare -a array1=(1 2 3 4 5) array2=(2 3 4 5 6)
declare -A uniq1=() uniq2=()
for e in "${array1[#]}"; do uniq1[$e]=; done
for e in "${array2[#]}"; do
if [[ ${uniq1[$e]-1} ]]
then
uniq2[$e]=
else
unset "uniq1[$e]"
fi
done
echo "new_command_output1 = ${!uniq1[*]}"
echo "new_command_output2 = ${!uniq2[*]}"
new_command_output1 = 1
new_command_output2 = 6
BASH builtins can handle this cleaner and quicker. This will read both arrays for each element, comparing if they exist in either. If no match is found, output unique elements
arr1=(1 2 3 4 5)
arr2=(1 3 2 4)
for i in "${arr1[#]}" "${arr2[#]}" ; do
[[ ${arr2[#]} =~ $i ]] || echo $i
[[ ${arr1[#]} =~ $i ]] || echo $i
done
output: 5
If one of yours arrays have multiple character elements, e.g. 152, then you must convert the arrays, adding a literal character before and after. This way regex can identify an exact match
arr1=(1 2 3 4 5)
arr2=(1 3 2 4 152)
for i in "${arr1[#]}" ; do
var1+="^$i$"
done
for i in "${arr2[#]}" ; do
var2+="^$i$"
done
for i in "${arr1[#]}" "${arr2[#]}" ; do
[[ $var1 =~ "^$i$" ]] || echo $i
[[ $var2 =~ "^$i$" ]] || echo $i
done
output: 5 152

Computing sum of specific field from array entries

I have an array trf. Would like to compute the sum of the second element in each array entry.
Example of array contents
trf=( "2 13 144" "3 21 256" "5 34 389" )
Here is the current implementation, but I do not find it robust enough. For instance, it fails with arbitrary number of elements (but considered constant from one array element to another) in each array entry.
cnt=0
m=${#trf[#]}
while (( cnt < m )); do
while read -r one two three
do
sum+="$two"+
done <<< $(echo ${array[$count]})
let count=$count+1
done
sum+=0
result=`echo "$sum" | /usr/bin/bc -l`
You're making it way too complicated. Something like
#!/usr/bin/env bash
trf=( "2 13 144" "3 21 256" "5 34 389" )
declare -i sum=0 # Integer attribute; arithmetic evaluation happens when assigned
for (( n = 0; n < ${#trf[#]}; n++)); do
read -r _ val _ <<<"${trf[n]}"
sum+=$val
done
printf "%d\n" "$sum"
in pure bash, or just use awk (This is handy if you have floating point numbers in your real data):
printf "%s\n" "${trf[#]}" | awk '{ sum += $2 } END { print sum }'
You can use printf to print the entire array, one entry per line. On such an input, one loop (while read) would be sufficient. You can even skip the loop entirely using cut and tr to build the bc command. The echo 0 is there so that bc can handle empty arrays and the trailing + inserted by tr.
{ printf %s\\n "${trf[#]}" | cut -d' ' -f2 | tr \\n +; echo 0; } | bc -l
For your examples this generates prints 68 (= 13+21+34+0).
Try this printf + awk combo:
$ printf '%s\n' "${trf[#]}" | awk '{print $2}{a+=$2}END{print "sum:", a}'
13
21
34
sum: 68
Oh, it's already suggested by Shawn. Then with loop:
$ for item in "${trf[#]}"; do
echo $item
done | awk '{print $2}{a+=$2}END{print "sum:", a}'
13
21
34
sum: 68
For relatively small arrays a for/while double loop should be ok re: performance; placing the final sum in the $result variable (as in OP's code):
result=0
for element in "${trf[#]}"
do
while read -r a b c
do
((result+=b))
done <<< "${element}"
done
echo "${result}"
This generates:
68
For larger data sets I'd probably opt for one of the awk-only solutions (for performance reasons).

array substitution throwing set -a error

I have following piece of code which is working fine if executed standalone -
I am facing a weird error, code is executed fine when executed standalone but throws error when embedded with another piece of code described below in the post
date=$1
set -A max_month 0 31 28 31 30 31 30 31 31 30 31 30 31
eval $(echo $date|sed 's!\(....\)\(..\)\(..\)!year=\1;month=\2;day=\3!')
(( year4=year%4 ))
(( year100=year%100 ))
(( year400=year%400 ))
if [ \( $year4 -eq 0 -a \
$year100 -ne 0 \) -o \
$year400 -eq 0 ]
then
set -A max_month 0 31 29 31 30 31 30 31 31 30 31 30 31
fi
day=$((day+1))
echo $day ${max_month[$month]}
if [ $day -gt ${max_month[$month]} ]
then
day=1
month=$((month+1))
if [ $month -gt 12 ]
then
year=$((year+1))
month=1
fi
fi
new_date=$(printf "%4.4d%2.2d%2.2d" $year $month $day)
echo $new_date
When I try to embed it into following code highlighted in red it throws error, obviously I replaced date=$1 to julian_date_14 -
#!/bin/bash
cd
unset project_env
cd /wload/baot/home/baotasa0/UKRB_UKBE/sandboxes/EXTRACTS/UK/RB/UKBA/ukrb_ukba_pbe_acq
. ab* . >> project_setup.log 2>&1
echo unset the environment is doNe
cd /wload/baot/app/data_abinitio/serial/uk_cust
param1=$1
param2=$2
param3=$3
email=$4
header_date_14=$(m_dump /wload/baot/app/data_abinitio/serial/uk_cust/ukrb_ukba_acnt_bde27_src.dml $param1 | head -35)
hdr_dt_14=$(echo "$header_date_14" | awk '$1=="bdfo_run_date" {print $2}')
julian_date_14=$(m_eval '(date("YYYYMMDD"))( unsigned integer(2)) '$hdr_dt_14'') 2>&1
header_date_15=$(m_dump /wload/baot/app/data_abinitio/serial/uk_cust/ukrb_ukba_acnt_bde27_src.dml $param2 | head -35)
hdr_dt_15=$(echo "$header_date_15" | awk '$1=="bdfo_run_date" {print $2}')
julian_date_15=$(m_eval '(date("YYYYMMDD"))( unsigned integer(2)) '$hdr_dt_15'')
header_date_16=$(m_dump /wload/baot/app/data_abinitio/serial/uk_cust/ukrb_ukba_acnt_bde27_src.dml $param3 | head -35)
hdr_dt_16=$(echo "$header_date_16" | awk '$1=="bdfo_run_date" {print $2}')
julian_date_16=$(m_eval '(date("YYYYMMDD"))( unsigned integer(2)) '$hdr_dt_16'')
echo $julian_date_16
if [ "$julian_date_14" = "$julian_date_15" -a "$julian_date_15" = "$julian_date_16" ]
then
echo all are same
else
echo check the file date please
fi
DATE=`echo $julian_date_14 | cut -c8-9`
Date_minus_1=`expr $DATE - 1`
DATE_1=`echo $julian_date_14 | cut -c2-7`
DATE_FINAL="$DATE_1$Date_minus_1"
echo $DATE_FINAL
Error is below -
./auto1.sh: line 70: set: -A: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
Any help will be greatly appreciated.
Thank you in advance!

Getting output of shell command in bash array

I have a uniq -c output, that outputs about 7-10 lines with the count of each pattern that was repeated for each unique line pattern. I want to store the output of my uniq -c file.txt into a bash array. Right now all I can do is store the output into a variable and print it. However, bash currently thinks the entire output is just one big string.
How does bash recognize delimiters? How do you store UNIX shell command output as Bash arrays?
Here is my current code:
proVar=`awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c`
echo $proVar
And current output I get:
587 chr1 578 chr2 359 chr3 412 chr4 495 chr5 362 chr6 287 chr7 408 chr8 285 chr9 287 chr10 305 chr11 446 chr12 247 chr13 307 chr14 308 chr15 365 chr16 342 chr17 245 chr18 252 chr19 210 chr20 193 chr21 173 chr22 145 chrX 58 chrY
Here is what I want:
proVar[1] = 2051
proVar[2] = 1243
proVar[3] = 1068
...
proVar[22] = 814
proVar[X] = 72
proVar[Y] = 13
In the long run, I'm hoping to make a barplot based on the counts for each index, where every 50 counts equals one "=" sign. It will hopefully look like the below
chr1 ===========
chr2 ===========
chr3 =======
chr4 =========
...
chrX ==
chrY =
Any help, guys?
To build the associative array, try this:
declare -A proVar
while read -r val key; do
proVar[${key#chr}]=$val
done < <(awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c)
Note: This assumes that your command's output is composed of multiple lines, each containing one key-value pair; the single-line output shown in your question comes from passing $proVar to echo without double quotes.
Uses a while loop to read each output line from a process substitution (<(...)).
The key for each assoc. array entry is formed by stripping prefix chr from each input line's first whitespace-separated token, whereas the value is the rest of the line (after the separating space).
To then create the bar plot, use:
while IFS= read -r key; do
echo "chr${key} $(printf '=%.s' $(seq $(( ${proVar[$key]} / 50 ))))"
done < <(printf '%s\n' "${!proVar[#]}" | sort -n)
Note: Using sort -n to sort the keys will put non-numeric keys such as X and Y before numeric ones in the output.
$(( ${proVar[$key]} / 50 )) calculates the number of = chars. to display, using integer division in an arithmetic expansion.
The purpose of $(seq ...) is to simply create as many tokens (arguments) as = chars. should be displayed (the tokens created are numbers, but their content doesn't matter).
printf '=%.s' ... is a trick that effectively prints as many = chars. as there are arguments following the format string.
printf '%s\n' "${!proVar[#]}" | sort -n sorts the keys of the assoc. array numerically, and its output is fed via a process substitution to the while loop, which therefore iterates over the keys in sorted order.
You can create an array in an assignment using parentheses:
proVar=(`awk '{printf ("%s\t\n"), $1}' file.txt | grep -P 'pattern' | uniq -c`)
There's no built-in way to create an associative array directly from input. For that you'll need an additional loop.

Cannot print entire array in Bash Shell script

I've written a shell script to get the PIDs of specific process names (e.g. pgrep python, pgrep java) and then use top to get the current CPU and Memory usage of those PIDs.
I am using top with the '-p' option to give it a list of comma-separated PID values. When using it in this mode, you can only query 20 PIDs at once, so I've had to come up with a way of handling scenarios where I have more than 20 PIDs to query. I'm splitting up the list of PIDs passed to the function below and "despatching" multiple top commands to query the resources:
# $1 = List of PIDs to query
jobID=0
for pid in $1; do
if [ -z $pidsToQuery ]; then
pidsToQuery="$pid"
else
pidsToQuery="$pidsToQuery,$pid"
fi
pidsProcessed=$(($pidsProcessed+1))
if [ $(($pidsProcessed%20)) -eq 0 ]; then
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`top -bn 1 -p "$pidsToQuery" | grep "^ " | awk '{print $9,$10}' | grep -o '.*[0-9].*' | sed ':a;N;$!ba;s/\n/ /g'`
jobID=$(($jobID+1))
pidsToQuery=""
fi
done
resourceUsage[$jobID]=`top -bn 1 -p "$pidsToQuery" | grep "^ " | awk '{print $9,$10}' | grep -o '.*[0-9].*' | sed ':a;N;$!ba;s/\n/ /g'`
The top command will return the CPU and Memory usage for each PID in the format (CPU, MEM, CPU, MEM etc)...:
13 31.5 23 22.4 55 10.1
The problem is with the resourceUsage array. Say, I have 25 PIDs I want to process, the code above will place the results of the first 20 PIDs in to $resourceUsage[0] and the last 5 in to $resourceUsage[1]. I have tested this out and I can see that each array element has the list of values returned from top.
The next bit is where I'm having difficulty. Any time I've ever wanted to print out or use an entire array's set of values, I use ${resourceUsage[#]}. Whenever I use that command in the context of this script, I only get element 0's data. I've separated out this functionality in to a script below, to try and debug. I'm seeing the same issue here too (data output to debug.log in same dir as script):
#!/bin/bash
pidList="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25"
function quickTest() {
for ((i=0; i<=1; i++)); do
resourceUsage[$i]=`echo "$i"`
done
echo "${resourceUsage[0]}"
echo "${resourceUsage[1]}"
echo "${resourceUsage[#]}"
}
function debugLog() {
debugLogging=1
if [ $debugLogging -eq 1 ]; then
currentTime=$(getCurrentTime 1)
echo "$currentTime - $1" >> debug.log
fi
}
function getCurrentTime() {
if [ $1 -eq 0 ]; then
echo `date +%s`
elif [ $1 -eq 1 ]; then
echo `date`
fi
}
jobID=0
for pid in $pidList; do
if [ -z $pidsToQuery ]; then
pidsToQuery="$pid"
else
pidsToQuery="$pidsToQuery,$pid"
fi
pidsProcessed=$(($pidsProcessed+1))
if [ $(($pidsProcessed%20)) -eq 0 ]; then
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`echo "10 10.5 11 11.5 12 12.5 13 13.5"`
debugLog "Resource Usage [$jobID]: ${resourceUsage[$jobID]}"
jobID=$(($jobID+1))
pidsToQuery=""
fi
done
#echo "Dispatched job: $pidsToQuery"
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`echo "14 14.5 15 15.5"`
debugLog "Resource Usage [$jobID]: ${resourceUsage[$jobID]}"
memUsageInt=0
memUsageDec=0
cpuUsage=0
i=1
debugLog "Row 0: ${resourceUsage[0]}"
debugLog "Row 1: ${resourceUsage[1]}"
debugLog "All resource usage results: ${resourceUsage[#]}"
for val in ${resourceUsage[#]}; do
resourceType=$(($i%2))
if [ $resourceType -eq 0 ]; then
debugLog "MEM RAW: $val"
memUsageInt=$(($memUsageInt+$(echo $val | cut -d '.' -f 1)))
memUsageDec=$(($memUsageDec+$(echo $val | cut -d '.' -f 2)))
debugLog " MEM INT: $memUsageInt"
debugLog " MEM DEC: $memUsageDec"
elif [ $resourceType -ne 0 ]; then
debugLog "CPU RAW: $val"
cpuUsage=$(($cpuUsage+$val))
debugLog "CPU TOT: $cpuUsage"
fi
i=$(($i+1))
done
debugLog "$MEM DEC FINAL: $memUsageDec (pre)"
memUsageDec=$(($memUsageDec/10))
debugLog "$MEM DEC FINAL: $memUsageDec (post)"
memUsage=$(($memUsageDec+$memUsageInt))
debugLog "MEM USAGE: $memUsage"
debugLog "CPU USAGE: $cpuUsage"
debugLog "MEM USAGE: $memUsage"
debugLog "PROCESSED VALS: $cpuUsage,$memUsage"
echo "$cpuUsage,$memUsage"
I'm really stuck here as I've printed out entire arrays before in Bash Shell with no problem. I've even repeated this in the shell console with a few lines and it works fine there:
listOfValues[0]="1 2 3 4"
listOfValues[1]="5 6 7 8"
echo "${listOfValues[#]}"
Am I missing something totally obvious? Any help would be greatly appreciated!
Thanks in advance! :)
Welcome to StackOverflow, and thanks for providing a test case! The bash tag wiki has additional suggestions for creating small, simplified test cases. Here's a minimal version that shows your problem:
log() {
echo "$1"
}
array=(foo bar)
log "Values: ${array[#]}"
Expected: Values: foo bar. Actual: Values: foo.
This happens because ${array[#]} is magic in quotes, and turns into multiple arguments. The same is true for $#, and for brevity, let's consider that:
Let's say $1 is foo and $2 is bar.
The single parameter "$#" (in quotes) is equivalent to the two arguments "foo" "bar".
"Values: $#" is equivalent to the two parameters "Values: foo" "bar"
Since your log statement ignores all arguments after the first one, none of them show up. echo does not ignore them, and instead prints all arguments space separated, which is why it appeared to work interactively.
This is as opposed to ${array[*]} and $*, which are exactly like $# except not magic in quotes, and does not turn into multiple arguments.
"$*" is equivalent to "foo bar"
"Values: $*" is equivalent to "Values: foo bar"
In other words: If you want to join the elements in an array into a single string, Use *. If you want to add all the elements in an array as separate strings, use #.
Here is a fixed version of the test case:
log() {
echo "$1"
}
array=(foo bar)
log "Values: ${array[*]}"
Which outputs Values: foo bar
I would use ps, not top, to get the desired information. Regardless, you probably want to put the data for each process in a separate element of the array, not one batch of 20 per element. You can do this using a while loop and a process substitution. I use a few array techniques to simplify the process ID handling.
pid_array=(1 2 3 4 5 6 7 8 9 ... )
while (( ${#pid_array[#]} > 0 )); do
printf -v pidsToQuery "%s," "${pid_array[#]:0:20}"
pid_array=( "${pid_array[#]:20}" )
while read cpu mem; do
resourceUsage+=( "$cpu $mem" )
done < <( top -bn -1 -p "${pidsToQuery%,}" ... )
done

Resources