Picking input record fields with AWK - arrays

Let's say we have a shell variable $x containing a space separated list of numbers from 1 to 30:
$ x=$(for i in {1..30}; do echo -n "$i "; done)
$ echo $x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
We can print the first three input record fields with AWK like this:
$ echo $x | awk '{print $1 " " $2 " " $3}'
1 2 3
How can we print all the fields starting from the Nth field with AWK? E.g.
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
EDIT: I can use cut, sed etc. to do the same but in this case I'd like to know how to do this with AWK.

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this awk:
awk '{for (i=3; i<=NF; ++i) printf "%s", $i (i<NF?OFS:ORS)}' file
or pass start position as argument:
awk -v n=3 '{for (i=n; i<=NF; ++i) printf "%s", $i (i<NF?OFS:ORS)}' file

Version 4: Shortest is probably using sub to cut off the first three fields and their separators:
$ echo $x | awk 'sub(/^ *([^ ]+ +){3}/,"")'
Output:
4 5 6 7 8 9 ...
This will, however, preserve all space after $4:
$ echo "1 2 3 4 5" | awk 'sub(/^ *([^ ]+ +){3}/,"")'
4 5
so if you wanted the space squeezed, you'd need to, for example:
$ echo "1 2 3 4 5" | awk 'sub(/^ *([^ ]+ +){3}/,"") && $1=$1'
4 5
with the exception that if there are only 4 fields and the 4th field happens to be a 0:
$ echo "1 2 3 0" | awk 'sub(/^ *([^ ]+ +){3}/,"")&&$1=$1'
$ [no output]
in which case you'd need to:
$ echo "1 2 3 0" | awk 'sub(/^ *([^ ]+ +){3}/,"") && ($1=$1) || 1'
0
Version 1: cut is better suited for the job:
$ cut -d\ -f 4- <<<$x
Version 2: Using awk you could:
$ echo -n $x | awk -v RS=\ -v ORS=\ 'NR>=4;END{printf "\n"}'
Version 3: If you want to preserve those varying amounts of space, using GNU awk you could use split's fourth parameter seps:
$ echo "1 2 3 4 5 6 7" |
gawk '{
n=split($0,a,FS,seps) # actual separators goes to seps
for(i=4;i<=n;i++) # loop from 4th
printf "%s%s",a[i],(i==n?RS:seps[i]) # get fields from arrays
}'

Adding one more approach to add all value into a variable and once all fields values are done with reading just print the value of variable. Change the value of n= as per from which field onwards you want to get the data.
echo "$x" |
awk -v n=3 '{val="";for(i=n; i<=NF; i++){val=(val?val OFS:"")$i};print val}'

With GNU awk, you can use the join function which has been a built-in include since gawk 4.1:
x=$(seq 30 | tr '\n' ' ')
echo "$x" | gawk '#include "join"
{split($0, arr)
print join(arr, 4, length(arr), "|")}
'
4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30
(Shown here with a '|' instead of a ' ' for clarity...)
Alternative way of including join:
echo "$x" | gawk -i join '{split($0, arr); print join(arr, 4, length(arr), "|")}'

Using gnu awk and gensub:
echo $x | awk '{ print gensub(/^([[:digit:]]+[[:space:]]){3}(.*$)/,"\\2",$0)}'
Using gensub, split the string into two sections based on regular expressions and print the second section only.

Related

Bad substitution error when passing decremented variable to "sed" command

I like to use "sed" command to delete two consecutive lines from a file.
I can delete single line using following syntax where
variable "index" holds the line number:
sed -i "${index}d" "$PWD$DEBUG_DIR$DEBUG_MENU"
Since I like to delete consecutive lines I did test this
syntax which supposedly decrements / increments the variable but I am getting a bad substitution error.
Addendum
I am not sure where to post this , but during troubleshooting of
this issue I have discovered that this syntax does not work as expected
(( index++)) nor (( index-- ))
Using sed with "index" in single line deletion twice in a row / sequentially / works and resolves few issues.
#delete command line
sed -i "${index}d" "$PWD$DEBUG_DIR$DEBUG_MENU"
#delete description line
sed -i "${index}d" "$PWD$DEBUG_DIR$DEBUG_MENU"
sed is the right tool for doing s/old/new, that is all. What you're trying to do isn't that so why are you trying to coerce sed into doing something when there's far more appropriate tools can do the job far easier?
The right way to do what your script:
#retrieve first matching line number
index=$(sed -n "/$choice/=" "$PWD$DEBUG_DIR$DEBUG_MENU")
#delete matching line plus next line from file
sed -i "/$index[1], (( $index[1]++))/"
"$PWD$DEBUG_DIR$DEBUG_MENU"
seems to be trying to do is this:
awk -v choice="$choice" '$0~choice{skip=2} !(skip&&skip--)' "$PWD$DEBUG_DIR$DEBUG_MENU"
For example:
$ seq 20 | awk -v choice="4" '$0~choice{skip=2} !(skip&&skip--)'
1
2
3
6
7
8
9
10
11
12
13
16
17
18
19
20
if you only want to delete the first match:
$ seq 20 | awk -v choice="4" '$0~choice{skip=2;cnt++} !(cnt==1 && skip && skip--)'
1
2
3
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
or the 2nd:
$ seq 20 | awk -v choice="4" '$0~choice{skip=2;cnt++} !(cnt==2 && skip && skip--)'
1
2
3
4
5
6
7
8
9
10
11
12
13
16
17
18
19
20
and to skip 5 lines instead of 2:
$ seq 20 | awk -v choice="4" '$0~choice{skip=5} !(skip&&skip--)'
1
2
3
9
10
11
12
13
19
20
Just use the right tool for the right job, don't go digging holes to plant trees with a teaspoon.
If you just want the first one, then quit when you see it:
sed -n "/$choice/ {=;q}" file
But you look like you're processing this file multiple times. There must be a simpler way to do it, if you can describe your over-arching goal.
For example, if you just want to remove the matched line and the next line, but only the first time, you can use awk: here we see "4" and "5" are gone, but "14" and "15" remain:
$ seq 20 | awk '/4/ && !seen {getline; seen++; next} 1'
1
2
3
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
With GNU sed, you can use +1 as the second address in an address range:
index=4
seq 10 | sed "$index,+1 d"
Or if you want to use bash/ksh arithmetic expansion: use the post-increment operator
seq 10 | sed "$((index++)),$index d"
Note here that, due to the pipeline, sed is running in a subshell: even though the index value is now 5, this occurs in a subshell. After the sed command ends, index has value 4 in the current shell.

Use bash variable as array in awk and filter input file by comparing with array

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file
Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

Cannot print entire array in Bash Shell script

I've written a shell script to get the PIDs of specific process names (e.g. pgrep python, pgrep java) and then use top to get the current CPU and Memory usage of those PIDs.
I am using top with the '-p' option to give it a list of comma-separated PID values. When using it in this mode, you can only query 20 PIDs at once, so I've had to come up with a way of handling scenarios where I have more than 20 PIDs to query. I'm splitting up the list of PIDs passed to the function below and "despatching" multiple top commands to query the resources:
# $1 = List of PIDs to query
jobID=0
for pid in $1; do
if [ -z $pidsToQuery ]; then
pidsToQuery="$pid"
else
pidsToQuery="$pidsToQuery,$pid"
fi
pidsProcessed=$(($pidsProcessed+1))
if [ $(($pidsProcessed%20)) -eq 0 ]; then
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`top -bn 1 -p "$pidsToQuery" | grep "^ " | awk '{print $9,$10}' | grep -o '.*[0-9].*' | sed ':a;N;$!ba;s/\n/ /g'`
jobID=$(($jobID+1))
pidsToQuery=""
fi
done
resourceUsage[$jobID]=`top -bn 1 -p "$pidsToQuery" | grep "^ " | awk '{print $9,$10}' | grep -o '.*[0-9].*' | sed ':a;N;$!ba;s/\n/ /g'`
The top command will return the CPU and Memory usage for each PID in the format (CPU, MEM, CPU, MEM etc)...:
13 31.5 23 22.4 55 10.1
The problem is with the resourceUsage array. Say, I have 25 PIDs I want to process, the code above will place the results of the first 20 PIDs in to $resourceUsage[0] and the last 5 in to $resourceUsage[1]. I have tested this out and I can see that each array element has the list of values returned from top.
The next bit is where I'm having difficulty. Any time I've ever wanted to print out or use an entire array's set of values, I use ${resourceUsage[#]}. Whenever I use that command in the context of this script, I only get element 0's data. I've separated out this functionality in to a script below, to try and debug. I'm seeing the same issue here too (data output to debug.log in same dir as script):
#!/bin/bash
pidList="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25"
function quickTest() {
for ((i=0; i<=1; i++)); do
resourceUsage[$i]=`echo "$i"`
done
echo "${resourceUsage[0]}"
echo "${resourceUsage[1]}"
echo "${resourceUsage[#]}"
}
function debugLog() {
debugLogging=1
if [ $debugLogging -eq 1 ]; then
currentTime=$(getCurrentTime 1)
echo "$currentTime - $1" >> debug.log
fi
}
function getCurrentTime() {
if [ $1 -eq 0 ]; then
echo `date +%s`
elif [ $1 -eq 1 ]; then
echo `date`
fi
}
jobID=0
for pid in $pidList; do
if [ -z $pidsToQuery ]; then
pidsToQuery="$pid"
else
pidsToQuery="$pidsToQuery,$pid"
fi
pidsProcessed=$(($pidsProcessed+1))
if [ $(($pidsProcessed%20)) -eq 0 ]; then
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`echo "10 10.5 11 11.5 12 12.5 13 13.5"`
debugLog "Resource Usage [$jobID]: ${resourceUsage[$jobID]}"
jobID=$(($jobID+1))
pidsToQuery=""
fi
done
#echo "Dispatched job: $pidsToQuery"
debugLog "DESPATCHED QUERY ($jobID): top -bn 1 -p $pidsToQuery | grep \"^ \" | awk '{print \$9,\$10}' | grep -o '.*[0-9].*' | sed ':a;N;\$!ba;s/\n/ /g'"
resourceUsage[$jobID]=`echo "14 14.5 15 15.5"`
debugLog "Resource Usage [$jobID]: ${resourceUsage[$jobID]}"
memUsageInt=0
memUsageDec=0
cpuUsage=0
i=1
debugLog "Row 0: ${resourceUsage[0]}"
debugLog "Row 1: ${resourceUsage[1]}"
debugLog "All resource usage results: ${resourceUsage[#]}"
for val in ${resourceUsage[#]}; do
resourceType=$(($i%2))
if [ $resourceType -eq 0 ]; then
debugLog "MEM RAW: $val"
memUsageInt=$(($memUsageInt+$(echo $val | cut -d '.' -f 1)))
memUsageDec=$(($memUsageDec+$(echo $val | cut -d '.' -f 2)))
debugLog " MEM INT: $memUsageInt"
debugLog " MEM DEC: $memUsageDec"
elif [ $resourceType -ne 0 ]; then
debugLog "CPU RAW: $val"
cpuUsage=$(($cpuUsage+$val))
debugLog "CPU TOT: $cpuUsage"
fi
i=$(($i+1))
done
debugLog "$MEM DEC FINAL: $memUsageDec (pre)"
memUsageDec=$(($memUsageDec/10))
debugLog "$MEM DEC FINAL: $memUsageDec (post)"
memUsage=$(($memUsageDec+$memUsageInt))
debugLog "MEM USAGE: $memUsage"
debugLog "CPU USAGE: $cpuUsage"
debugLog "MEM USAGE: $memUsage"
debugLog "PROCESSED VALS: $cpuUsage,$memUsage"
echo "$cpuUsage,$memUsage"
I'm really stuck here as I've printed out entire arrays before in Bash Shell with no problem. I've even repeated this in the shell console with a few lines and it works fine there:
listOfValues[0]="1 2 3 4"
listOfValues[1]="5 6 7 8"
echo "${listOfValues[#]}"
Am I missing something totally obvious? Any help would be greatly appreciated!
Thanks in advance! :)
Welcome to StackOverflow, and thanks for providing a test case! The bash tag wiki has additional suggestions for creating small, simplified test cases. Here's a minimal version that shows your problem:
log() {
echo "$1"
}
array=(foo bar)
log "Values: ${array[#]}"
Expected: Values: foo bar. Actual: Values: foo.
This happens because ${array[#]} is magic in quotes, and turns into multiple arguments. The same is true for $#, and for brevity, let's consider that:
Let's say $1 is foo and $2 is bar.
The single parameter "$#" (in quotes) is equivalent to the two arguments "foo" "bar".
"Values: $#" is equivalent to the two parameters "Values: foo" "bar"
Since your log statement ignores all arguments after the first one, none of them show up. echo does not ignore them, and instead prints all arguments space separated, which is why it appeared to work interactively.
This is as opposed to ${array[*]} and $*, which are exactly like $# except not magic in quotes, and does not turn into multiple arguments.
"$*" is equivalent to "foo bar"
"Values: $*" is equivalent to "Values: foo bar"
In other words: If you want to join the elements in an array into a single string, Use *. If you want to add all the elements in an array as separate strings, use #.
Here is a fixed version of the test case:
log() {
echo "$1"
}
array=(foo bar)
log "Values: ${array[*]}"
Which outputs Values: foo bar
I would use ps, not top, to get the desired information. Regardless, you probably want to put the data for each process in a separate element of the array, not one batch of 20 per element. You can do this using a while loop and a process substitution. I use a few array techniques to simplify the process ID handling.
pid_array=(1 2 3 4 5 6 7 8 9 ... )
while (( ${#pid_array[#]} > 0 )); do
printf -v pidsToQuery "%s," "${pid_array[#]:0:20}"
pid_array=( "${pid_array[#]:20}" )
while read cpu mem; do
resourceUsage+=( "$cpu $mem" )
done < <( top -bn -1 -p "${pidsToQuery%,}" ... )
done

bash, return position of the smallest entry in an array

I have an array in bash. For example
array=(1 3 4e-10 6 4 2e-4 7 5 2 9)
I would like to know how to return position of the smallest number, in this case 3.
Try this:
arr=(1 3 4e-10 6 4 2e-4 7 5 2 9)
for min value and position
echo "${arr[#]}" | tr -s ' ' '\n' | awk '{print($0" "NR)}' |
sort -g -k1,1 | head -1
4e-10 3
for position of min value
echo "${arr[#]}" | tr -s ' ' '\n' | awk '{print($0" "NR)}' |
sort -g -k1,1 | head -1 | cut -f2 -d' '
3
$ array=(1 3 4e-10 6 4 2e-4 7 5 2 9)
$ echo "${array[*]}" | tr ' ' '\n' | awk 'NR==1{min=$0}NR>1 && $1<min{min=$1;pos=NR}END{print min,pos}'
4e-10 3
or just
$ echo "${array[*]}" | tr ' ' '\n' | awk 'NR==1{min=$0}NR>1 && $1<min{min=$1;pos=NR}END{print pos}'
3
to get just the position.
Like the comments say, Bash doesn't do floats. I'd loop through the array and use perl or awk. Something like this should work:
for i in ${array[#]}; do echo $i; done | perl -e 'use strict; my #array; while(<STDIN>) { chomp $_; push (#array,$_+0); } foreach my $number (sort {$a <=> $b} #array) { print "$number\n"; } '
A very "line-noisy" combination of bash and perl
array=(1 3 4e-10 6 4 2e-4 7 5 2 9)
perl -lanE '
say 1 + (sort {$a->[1] <=> $b->[1]} map {[$_, $F[$_]]} 0..$#F)[0]->[0]
' <<< "${array[#]}"
outputs 3

bash + awk : looping over an array of 4 elements that returns size of 1?

I have the following script which uses awk to match fields with user input
NB=$#
FILE=myfile
#GET INPUT
if [ $NB -eq 1 ]
then
A=`awk -F "\t" -v town="$1" 'tolower($3) ~ tolower(town) {print NR}' $FILE`
fi
If I print the output, it reads :
7188 24369 77205 101441
Which is what I expected. Then if I do the following:
IFS=' '
array=($A)
echo ${#array[#]}
I actually get a length of 1 (?). Furthermore, if I try:
for x in $array
do
echo $x
done
It actually prints out :
7188
24369
77205
101441
How can I have it return the length of 4. I don't understand how the for...in works if there's only 1 element?
EDIT :
echo $A | od -c before I create the array is:
0000000 7 1 8 8 2 4 3 6 9 7 7 2 0 5
0000020 1 0 1 4 4 1 \n
0000030
echo $A | od -c after I create the array is:
0000000 7 1 8 8 \n 2 4 3 6 9 \n 7 7 2 0 5
0000020 \n 1 0 1 4 4 1 \n
0000030
It is because returned output from awk is newline (\n) delimited instead of space delimited. So if you have IFS like this instead:
IFS=$'\n' # newline between quotes
Then it will echo array length = 4 as you are expecting.

Resources