Understanding IN Statement in awk

Understanding IN Statement in awk - arrays

I have the problem to understand the in statement in bash... First of all heres the code:
#! /bin/bash
dns=()
while read line; do
up=$(nslookup $line | awk -F ': ' 'NR==6 {print $2} ')
dns+=($up)
done < dns.blacklist.txt.txt
awk '{if( $1 in dns ) print $1 " Blacklisted"; else print $1}' thttpd2.log
So thttp2.log is just a list of IPs, while nslookup is getting the IP of hostnames (for blacklist puposes). So now I want to check if the IP that connected in the log, was on the blacklist, in the code in the dns array.
All IPs and lookups from nslookup are good: Dns=81.169.145.82 192.0.3.45 and awk $1=81.169.145.82 . So how can I check in the awk statement at the lower part, if $1 is in dns?
I've been trying for half a day now... I am pretty sure I have not understood "in" so can someone please give me at least a tip?
PS: Current result is just:
81.169.145.82
81.169.145.82
81.169.145.82
192.0.3.45
Goal:
81.169.145.82 Blacklisted
81.169.145.82 Blacklisted
81.169.145.82 Blacklisted
192.0.3.45

It seems better to use the output of your while loop as awk input, there is no reason to use in the middle a bash array, awk prefers a stream than any bash variables.
So, you produce a stream of ips reading your blacklist.txt file and parsing the nslookup output. I see that part as a black box in my answer, I assume you get good results and want to run your logic with the other file. Also it is not efficient to run one nslookup and one awk per line, in case of a large input, but I don't know what you do in that part, I leave it as is.
while read -r line; do
nslookup "$line" | awk -F ': ' 'NR==6 {print $2}'
done < blacklist.txt | awk 'FNR==NR {dns[$0]; next}
{print ($1 in dns)? $1 " Blacklisted": $1}' - thttpd2.log
You could also give directly the blacklist file to awk, and inside awk to have a call to the external bash command you use. But I think it is simpler like this.

So how can I check in the awk statement at the lower part, if $1 is in dns?
awk is not shell and shell is not awk. Shell variable is unrelated to any awk variable and awk variables are unrelated to shell. awk is a separate program with separate syntax unrelated to shell and shell is a separate program with it's own syntax unrelated to awk.
The construct subscript in array is part of awk syntax to check if in awk the subscript subscript is one of subscripts inside awk array array. It's unrelated to shell variables and bash arrays. Note that subscript is not value of the element, it's the index. "array[subscript]=value"
Understanding IN - Statement in Linux bash
The in in bash shell is used only as a keyword in case statement:
case something in
pattern) ;;
esac
It's usage is unrelated to awk usage, because shell is not awk.
please give me at least a tip?
First read the input into awk as subscripts of array dns. After that, you may use the awk construct something in dns to check if something is a subscript of an array.

You already got answers explaining what in means but also - since nslookup can read a list of domain names from stdin:
$ cat dns.blacklist.txt.txt
google.com
yahoo.com
$ nslookup < dns.blacklist.txt.txt
Default Server: cdns01.foo.net
Address: 2222:555:beef::1
> Server: cdns01.foo.net
Address: 2222:555:beef::1
Non-authoritative answer:
Name: google.com
Addresses: 2607:f8b0:4009:804::200e
172.217.9.78
> Server: cdns01.foo.net
Address: 2222:555:beef::1
Non-authoritative answer:
Name: yahoo.com
Addresses: 2001:4998:44:3507::8000
2001:4998:124:1507::f001
2001:4998:124:1507::f000
2001:4998:44:3507::8001
2001:4998:24:120d::1:1
2001:4998:24:120d::1:0
98.137.11.163
74.6.143.25
74.6.231.21
98.137.11.164
74.6.143.26
74.6.231.20
you don't need to wrap anything in a shell loop, e.g. (untested):
nslookup < dns.blacklist.txt.txt |
awk '
NR==FNR {
if ( sub(/^Addresses:/,"") ) { inAddrs=1 }
if ( inAddrs ) {
if ( NF ) { dns[$1] }
else { inAddrs=0 }
}
next
}
{ print $1, ($1 in dns ? "Blacklisted" : "" }
' - thttpd2.log
Note that nslookup can output a list of IP addresses for a given domain, not just 1 as your existing script expects, and the above script will accommodate that.

Related

script to associate columns

I have a file consisting of definitions where a variable name points to an ip address. I want a script (bash/python/similar) to output each ip address followed by a list of each variable where it is defined as well as how many variables were in the list.
Input:
define alpha 192.168.1.1
define beta 192.168.1.3
define gamma 192.168.1.2
define delta 192.168.1.1
define epsilon 192.168.1.3
define zeta 192.168.1.1
define eta 192.168.1.3
define theta 192.168.1.1
Output
192.168.1.1:alpha,delta,zeta,theta:4
192.168.1.3:beta,epsilon,eta:3
192.168.1.2:gamma:1
Do I use associative arrays in bash to do this or is there a better way? I tried to do it but only ended up with a bash script which I had to combine with linux sort and uniq commands but still couldn't get it quite right.
Apologies for crappy title but I couldn't formulate this in a better way so feel free to edit.

1st solution: Could you please try following. This should be able to have those values even you have more than 1 values after keyword define(1st field) and between last field(IP address), though your example do not have more than 3 fields but I have taken care of that(in case your Input_file is having more than 3 fields in it too).
awk '{$1="";val=$NF;$NF="";gsub(/^ +| +$/,"");a[val]=a[val]?a[val]","$0:$0;b[val]++} END{for(i in a){print i":"a[i]":"b[i]}}' Input_file
OR adding a non-one liner form of solution here.
awk '
{
$1=""
val=$NF
$NF=""
gsub(/^ +| +$/,"")
a[val]=a[val]?a[val]","$0:$0
b[val]++
}
END{
for(i in a){
print i":"a[i]":"b[i]
}
}' Input_file
2nd solution: Above 1st solution will NOT give same order of IPs(which they are present in Input_file), this solution will take care of sequence of IPs in output should be same as Input_file's presence.
awk '{$1="";val=$NF;$NF="";gsub(/^ +| +$/,"")} !c[val]++{d[++count]=val} {a[val]=a[val]?a[val]","$0:$0;b[val]++} END{for(i=1;i<=count;i++){print d[i]":"a[d[i]]":"b[d[i]]}}' Input_file
OR(adding a non-one liner form of above solution).
awk '
{
$1=""
val=$NF
$NF=""
gsub(/^ +| +$/,"")
}
!c[val]++{
d[++count]=val
}
{
a[val]=a[val]?a[val]","$0:$0
b[val]++
}
END{
for(i=1;i<=count;i++){
print d[i]":"a[d[i]]":"b[d[i]]
}
}' Input_file

Using awk:
awk '
/^define/{
a[$3]++
b[$3]=(b[$3]?b[$3]",":"")$2
}
END {
for(i in a)
print i,b[i],a[i]
}' OFS=: file
a is an array that holds the count of each different IP.
b is an array that holds a string containing all keywords for each IP. A comma is inserted in between every keyword.
At the end of the parsing the index containing the IP and both arrays content are printed.
The OFS=: sets the output field separator to :.

Using array inside awk in shell script

I am very new to Unix shell script and trying to get some knowledge in shell scripting. Please check my requirement and my approach.
I have a input file having data
ABC = A:3 E:3 PS:6
PQR = B:5 S:5 AS:2 N:2
I am trying to parse the data and get the result as
ABC
A=3
E=3
PS=6
PQR
B=5
S=5
AS=2
N=2
The values can be added horizontally and vertically so I am trying to use an array. I am trying something like this:
myarr=(main.conf | awk -F"=" 'NR!=1 {print $1}'))
echo ${myarr[1]}
# Or loop through every element in the array
for i in "${myarr[#]}"
do
:
echo $i
done
or
awk -F"=" 'NR!=1 {
print $1"\n"
STR=$2
IFS=':' read -r -a array <<< "$STR"
for i in "${!array[#]}"
do
echo "$i=>${array[i]}"
done
}' main.conf
But when I add this code to a .sh file and try to run it, I get syntax errors as
$ awk -F"=" 'NR!=1 {
> print $1"\n"
> STR=$2
> FS= read -r -a array <<< "$STR"
> for i in "${!array[#]}"
> do
> echo "$i=>${array[i]}"
> done
>
> }' main.conf
awk: cmd. line:4: FS= read -r -a array <<< "$STR"
awk: cmd. line:4: ^ syntax error
awk: cmd. line:5: for i in "${!array[#]}"
awk: cmd. line:5: ^ syntax error
awk: cmd. line:8: done
awk: cmd. line:8: ^ syntax error
How can I complete the above expectations?

This is the awk code to do what you want:
$ cat tst.awk
BEGIN { FS="[ =:]+"; OFS="=" }
{
print $1
for (i=2;i<NF;i+=2) {
print $i, $(i+1)
}
print ""
}
and this is the shell script (yes, all a shell script does to manipulate text is call awk):
$ awk -f tst.awk file
ABC
A=3
E=3
PS=6
PQR
B=5
S=5
AS=2
N=2
A UNIX shell is an environment from which to call UNIX tools (find, sort, sed, grep, awk, tr, cut, etc.). It has its own language for manipulating (e.g. creating/destroying) files and processes and sequencing calls to tools but it is NOT intended to be used to manipulate text. The guys who invented shell also invented awk for shell to call to manipulate text.
Read https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice and the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

First off, a command that does what you want:
$ sed 's/ = /\n/;y/: /=\n/' main.conf
ABC
A=3
E=3
PS=6
PQR
B=5
S=5
AS=2
N=2
This replaces, on each line, the first (and only) occurrence of = with a newline (the s command), then turns all : into = and all spaces into newlines (the y command). Notice that
this works only because there is a space at the end of the first line (otherwise it would be a bit more involved to get the empty line between the blocks) and
this works only with GNU sed because it substitutes newlines; see this fantastic answer for all the details and how to get it to work with BSD sed.
As for what you tried, there is almost too much wrong with it to try and fix it piece by piece: from the wild mixing of awk and Bash to syntax errors all over the place. I recommend you read good tutorials for both, for example:
The BashGuide
Effective AWK Programming
A Bash solution
Here is a way to solve the same in Bash; I didn't use any arrays.
#!/bin/bash
# Read line by line into the 'line' variable. Setting 'IFS' to the empty string
# preserves leading and trailing whitespace; '-r' prevents interpretation of
# backslash escapes
while IFS= read -r line; do
# Three parameter expansions:
# Replace ' = ' by newline (escape backslash)
line="${line/ = /\\n}"
# Replace ':' by '='
line="${line//:/=}"
# Replace spaces by newlines (escape backslash)
line="${line// /\\n}"
# Print the modified input line; '%b' expands backslash escapes
printf "%b" "$line"
done < "$1"
Output:
$ ./SO.sh main.conf
ABC
A=3
E=3
PS=6
PQR
B=5
S=5
AS=2
N=2

Can I pass an array to awk using -v?

I would like to be able to pass an array variable to awk. I don't mean a shell array but a native awk one. I know I can pass scalar variables like this:
awk -vfoo="1" 'NR==foo' file
Can I use the same mechanism to define an awk array? Something like:
$ awk -v"foo[0]=1" 'NR==foo' file
awk: fatal: `foo[0]' is not a legal variable name
I've tried a few variations of the above but none of them work on GNU awk 4.1.1 on my Debian. So, is there any version of awk (gawk,mawk or anything else) that can accept an array from the -v switch?
I know I can work around this and can easily think of ways to do so, I am just wondering if any awk implementation supports this kind of functionality natively.

You can use the split() function inside mawk or gawk to split the input of the "-v" value (here is the gawk man page):
split(s, a [, r [, seps] ])
Split the string s into the array a and the separators array seps on
the regular expression r, and return the number of fields.*
An example here in which i pass the value "ARRAYVAR", a comma separated list of values which is my array, with "-v" to the awk program, then split it into the internal variable array "arrayval" using the split() function and then print the 3rd value of the array:
echo 0 | gawk -v ARRAYVAR="a,b,c,d,e,f" '{ split(ARRAYVAR,arrayval,","); print(arrayval[3]) }'
c
Seems to work :)

It looks like it is impossible by definition.
From man awk we have that:
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the
program begins. Such variable values are available to the BEGIN rule
of an AWK program.
Then we read in Using Variables in a Program that:
The name of a variable must be a sequence of letters, digits, or
underscores, and it may not begin with a digit.
Variables in awk can be assigned either numeric or string values.
So the way the -v implementation is defined makes it impossible to provide an array as a variable, since any kind of usage of the characters = or [ is not allowed as part of the -v variable passing. And both are required, since arrays in awk are only associative.

If you don't insist on using -v you could use -i (include) instead to read an awk file that contains the variable settings.
Like this:
if F=$(mktemp inputXXXXXX); then
cat >$F << 'END'
BEGIN {
foo[0]=1
}
END
cat $F
awk -i $F 'BEGIN { print foo[0] }' </dev/null
rm $F
fi
Sample trace (using gawk-4.2.1):
bash -x /tmp/test.sh
++ mktemp inputXXXXXX
+ F=inputrpMsan
+ cat
+ cat inputrpMsan
BEGIN {
foo[0]=1
}
+ awk -i inputrpMsan 'BEGIN { print foo[0] }'
1
+ rm inputrpMsan

Unfortunately, this is not possible. However, you can convert a bash array to an awk array using a few clever methods.
I wanted to do this recently by passing a bash array to awk to use it for filtering, so here is what I did:
$ arr=( hello world this is bash array )
$ echo -e 'this\nmight\nnot\nshow\nup' | awk 'BEGIN {
for (i = 1; i < ARGC; i++) {
my_filter[ARGV[i]]=1
ARGV[i]="" # unset ARGV[i] otherwise awk might try to read it as a file
}
} !my_filter[$0]' "${arr[#]}"
Output:
might
not
show
up

For associative arrays, you could pass it as a string of key-value pairs, and then reformat it in the BEGIN section.
$ echo | awk -v m="a,b;c,d" '
BEGIN {
split(m,M,";")
for (i in M) {
split(M[i],MM,",")
MA[MM[1]]=MM[2]
}
}
{
for (a in MA) {
printf("MA[%s]=%s\n",a, MA[a])
}
}'
Output:
MA[a]=b
MA[c]=d

Shell - Looping Array with command and increment command values

var1=$(echo $getDate | awk '{print $1} {print $2}')
var2=$(echo $getDate | awk '{print $3} {print $4}')
var3=$(echo $getDate | awk '{print $5} {print $6}')
Instead of repeating like the code above, I need to:
loop the same command
increment the values ({print $1} {print $2})
store the value in an array
I was doing something like below but I am stuck maybe someone can help me please:
COMMAND=`find $locationA -type f | wc -l`
getDate=$(find $locationA -type f | xargs ls -lrt | awk '{print $6} {print $7}')
a=1
b=2
for i in $COMMAND
do
i=$(echo $getDate | awk '{print $a} {print $b}')
myarray+=('$i')
a=$((a+1))
b=$((b+1))
done
PS - using ksh
Problem: $COMMAND stores the number of files found in $locationA. I need to loop through the amount of files found and store their dates in an array.

I don't get the meaning of your example code (what is the 'for' loop supposed to do? What is the content of the variable COMMAND?), but in your question you ask to store something in an array, while in the code you wish to simplify, you don't use an array, but simple variables (var1, var2, ....).
If I understand your requirement correctly, your variable getDate contains a string of several words, which are separated by spaces, and you want to assign the first two words to var1, the following two words to var2, and so on. Is this correct?

Now the edited code is at least a bit clearer, though I still don't understand, why you use i as a loop variable, and overwrite it in the first statement inside the loop.
However, a few comments:
If you push '$i' into your array, you will get a literal '$' sign, followed by the letter 'i'. To add a variable i containing to numbers, you need double quotes ("$i").
I don't understand why you want to loop over the cotnent of the variable COMMAND. This variable will always hold a single number, which means that the loop will be executed exactly once.
You could use a counting loop, incrementing loop variable by 2 on each iteration. You would have to precalculate the number of iterations beforehand.
Perhaps an easier alternative, which would work in bash or in zsh (I did not try other shells) is to first turn your variable in an array,
tmparr=($(echo $getDate|fmt -w 1))
and then use a loop to collect pairs of this element:
myarray=()
for ((i=0; i<${#tmparr[*]}; i+=2))
do
myarray+=("${tmparr[$i]} ${tmparr[$((i+1))]}")
done
${myarray[0]} will hold a string consisting of the first to words from getDate, etc.

This one should work on zsh, at least with newer versions:
myarray=()
echo $g|fmt -w 1|paste -s -d " \n"|while read s; do myarray+=("$s"); done
This leaves the first pair in ${myarray[1]}, etc.
It doesn't work with bash (and old zsh versions), because these shells would execute the body of the loop in a subshell.
ADDED:
On a second thought, in zsh this one would be simpler:
myarray=("${(f)$(echo $g|fmt -w 1|paste -s -d ' \n')}")

Enumerate the number of running processes with a given name - assign to variable

I need to know how many processes are running for a specific task (e.g. number of Apache tomcats) and if it's 1, then print the PID. Otherwise print out a message.
I need this in a BASH script, now when I perform something like:
result=`ps aux | grep tomcat | awk '{print $2}' | wc -l`
The number of items is assigned to result. Hurrah! But I don't have the PID(s). However when I attempt to perform this as an intermediary step (without the wc), I encounter problems. So if I do this:
result=`ps aux | grep tomcat | awk '{print $2}'`
Any attempts I make to modify the variable result just don't seem to work. I've tried set and tr (replace blanks with line-breaks), but I just cannot get the right result. Ideally I'd like the variable result to be an array with the PIDs as individual elements. Then I can see size, elements, easily.
Can anyone suggest what I am doing wrong?
Thanks,
Phil
Update:
I ended up using the following syntax:
pids=(`ps aux | grep "${searchStr}"| grep -v grep | awk '{print $2}'`)
number=${#pids[#]}
The key was putting the brackets around the back-ticked commands. Now the variable pids is an array and can be asked for length and elements.
Thanks to both choroba and Dimitre for their suggestions and help.

pids=($(
ps -eo pid,command |
sed -n '/[t]omcat/{s/^ *\([0-9]\+\).*/\1/;p}'
))
number=${#pids[#]}
pids=( ... ) creates an array.
$( ... ) returns its output as a string (similar to backquote).
Then, sed is called on the list of all the processes: for lines containing tomacat (the [t] prevents the sed itself from being included), only the pid is preserved and printed.

You may need to adjust the pgrep command (you may need or may not need the -f option).
_pids=(
$( pgrep -f tomcat )
)
(( ${#_pids[#]} == 1 )) &&
echo ${_pids[0]} ||
echo message
If you want to print the number of pids (with a message):
_pids=(
$( pgrep -f tomcat )
)
(( ${#_pids[#]} == 1 )) &&
echo ${_pids[0]} ||
echo "${#_pids[#]} running"
It should be noted that the pgrep utility and the syntax used are not standard.