script to associate columns - arrays

I have a file consisting of definitions where a variable name points to an ip address. I want a script (bash/python/similar) to output each ip address followed by a list of each variable where it is defined as well as how many variables were in the list.
Input:
define alpha 192.168.1.1
define beta 192.168.1.3
define gamma 192.168.1.2
define delta 192.168.1.1
define epsilon 192.168.1.3
define zeta 192.168.1.1
define eta 192.168.1.3
define theta 192.168.1.1
Output
192.168.1.1:alpha,delta,zeta,theta:4
192.168.1.3:beta,epsilon,eta:3
192.168.1.2:gamma:1
Do I use associative arrays in bash to do this or is there a better way? I tried to do it but only ended up with a bash script which I had to combine with linux sort and uniq commands but still couldn't get it quite right.
Apologies for crappy title but I couldn't formulate this in a better way so feel free to edit.

1st solution: Could you please try following. This should be able to have those values even you have more than 1 values after keyword define(1st field) and between last field(IP address), though your example do not have more than 3 fields but I have taken care of that(in case your Input_file is having more than 3 fields in it too).
awk '{$1="";val=$NF;$NF="";gsub(/^ +| +$/,"");a[val]=a[val]?a[val]","$0:$0;b[val]++} END{for(i in a){print i":"a[i]":"b[i]}}' Input_file
OR adding a non-one liner form of solution here.
awk '
{
$1=""
val=$NF
$NF=""
gsub(/^ +| +$/,"")
a[val]=a[val]?a[val]","$0:$0
b[val]++
}
END{
for(i in a){
print i":"a[i]":"b[i]
}
}' Input_file
2nd solution: Above 1st solution will NOT give same order of IPs(which they are present in Input_file), this solution will take care of sequence of IPs in output should be same as Input_file's presence.
awk '{$1="";val=$NF;$NF="";gsub(/^ +| +$/,"")} !c[val]++{d[++count]=val} {a[val]=a[val]?a[val]","$0:$0;b[val]++} END{for(i=1;i<=count;i++){print d[i]":"a[d[i]]":"b[d[i]]}}' Input_file
OR(adding a non-one liner form of above solution).
awk '
{
$1=""
val=$NF
$NF=""
gsub(/^ +| +$/,"")
}
!c[val]++{
d[++count]=val
}
{
a[val]=a[val]?a[val]","$0:$0
b[val]++
}
END{
for(i=1;i<=count;i++){
print d[i]":"a[d[i]]":"b[d[i]]
}
}' Input_file

Using awk:
awk '
/^define/{
a[$3]++
b[$3]=(b[$3]?b[$3]",":"")$2
}
END {
for(i in a)
print i,b[i],a[i]
}' OFS=: file
a is an array that holds the count of each different IP.
b is an array that holds a string containing all keywords for each IP. A comma is inserted in between every keyword.
At the end of the parsing the index containing the IP and both arrays content are printed.
The OFS=: sets the output field separator to :.

Related

Understanding IN Statement in awk

I have the problem to understand the in statement in bash... First of all heres the code:
#! /bin/bash
dns=()
while read line; do
up=$(nslookup $line | awk -F ': ' 'NR==6 {print $2} ')
dns+=($up)
done < dns.blacklist.txt.txt
awk '{if( $1 in dns ) print $1 " Blacklisted"; else print $1}' thttpd2.log
So thttp2.log is just a list of IPs, while nslookup is getting the IP of hostnames (for blacklist puposes). So now I want to check if the IP that connected in the log, was on the blacklist, in the code in the dns array.
All IPs and lookups from nslookup are good: Dns=81.169.145.82 192.0.3.45 and awk $1=81.169.145.82 . So how can I check in the awk statement at the lower part, if $1 is in dns?
I've been trying for half a day now... I am pretty sure I have not understood "in" so can someone please give me at least a tip?
PS: Current result is just:
81.169.145.82
81.169.145.82
81.169.145.82
192.0.3.45
Goal:
81.169.145.82 Blacklisted
81.169.145.82 Blacklisted
81.169.145.82 Blacklisted
192.0.3.45
It seems better to use the output of your while loop as awk input, there is no reason to use in the middle a bash array, awk prefers a stream than any bash variables.
So, you produce a stream of ips reading your blacklist.txt file and parsing the nslookup output. I see that part as a black box in my answer, I assume you get good results and want to run your logic with the other file. Also it is not efficient to run one nslookup and one awk per line, in case of a large input, but I don't know what you do in that part, I leave it as is.
while read -r line; do
nslookup "$line" | awk -F ': ' 'NR==6 {print $2}'
done < blacklist.txt | awk 'FNR==NR {dns[$0]; next}
{print ($1 in dns)? $1 " Blacklisted": $1}' - thttpd2.log
You could also give directly the blacklist file to awk, and inside awk to have a call to the external bash command you use. But I think it is simpler like this.
So how can I check in the awk statement at the lower part, if $1 is in dns?
awk is not shell and shell is not awk. Shell variable is unrelated to any awk variable and awk variables are unrelated to shell. awk is a separate program with separate syntax unrelated to shell and shell is a separate program with it's own syntax unrelated to awk.
The construct subscript in array is part of awk syntax to check if in awk the subscript subscript is one of subscripts inside awk array array. It's unrelated to shell variables and bash arrays. Note that subscript is not value of the element, it's the index. "array[subscript]=value"
Understanding IN - Statement in Linux bash
The in in bash shell is used only as a keyword in case statement:
case something in
pattern) ;;
esac
It's usage is unrelated to awk usage, because shell is not awk.
please give me at least a tip?
First read the input into awk as subscripts of array dns. After that, you may use the awk construct something in dns to check if something is a subscript of an array.
You already got answers explaining what in means but also - since nslookup can read a list of domain names from stdin:
$ cat dns.blacklist.txt.txt
google.com
yahoo.com
$ nslookup < dns.blacklist.txt.txt
Default Server: cdns01.foo.net
Address: 2222:555:beef::1
> Server: cdns01.foo.net
Address: 2222:555:beef::1
Non-authoritative answer:
Name: google.com
Addresses: 2607:f8b0:4009:804::200e
172.217.9.78
> Server: cdns01.foo.net
Address: 2222:555:beef::1
Non-authoritative answer:
Name: yahoo.com
Addresses: 2001:4998:44:3507::8000
2001:4998:124:1507::f001
2001:4998:124:1507::f000
2001:4998:44:3507::8001
2001:4998:24:120d::1:1
2001:4998:24:120d::1:0
98.137.11.163
74.6.143.25
74.6.231.21
98.137.11.164
74.6.143.26
74.6.231.20
you don't need to wrap anything in a shell loop, e.g. (untested):
nslookup < dns.blacklist.txt.txt |
awk '
NR==FNR {
if ( sub(/^Addresses:/,"") ) { inAddrs=1 }
if ( inAddrs ) {
if ( NF ) { dns[$1] }
else { inAddrs=0 }
}
next
}
{ print $1, ($1 in dns ? "Blacklisted" : "" }
' - thttpd2.log
Note that nslookup can output a list of IP addresses for a given domain, not just 1 as your existing script expects, and the above script will accommodate that.

Bash: how to extract longest directory paths from an array?

I put the output of find command into array like this:
pathList=($(find /foo/bar/ -type d))
How to extract the longest paths found in the array if the array contains several equal-length longest paths?:
echo ${pathList[#]}
/foo/bar/raw/
/foo/bar/raw/2020/
/foo/bar/raw/2020/02/
/foo/bar/logs/
/foo/bar/logs/2020/
/foo/bar/logs/2020/02/
After extraction, I would like to assign /foo/bar/raw/2020/02/ and /foo/bar/logs/2020/02/ to another array.
Thank you
Could you please try following. This should print the longest array(could be multiple in numbers same maximum length ones), you could assign it to later an array to.
echo "${pathList[#]}" |
awk -F'/' '{max=max>NF?max:NF;a[NF]=(a[NF]?a[NF] ORS:"")$0} END{print a[max]}'
I just created a test array with values provided by you and tested it as follows:
arr1=($(printf '%s\n' "${pathList[#]}" |\
awk -F'/' '{max=max>NF?max:NF;a[NF]=(a[NF]?a[NF] ORS:"")$0} END{print a[max]}'))
When I see new array's contents they are as follows:
echo "${arr1[#]}"
/foo/bar/raw/2020/02/
/foo/bar/logs/2020/02/
Explanation of awk code: Adding detailed explanation for awk code.
awk -F'/' ' ##Starting awk program from here and setting field separator as / for all lines.
{
max=max>NF?max:NF ##Creating variable max and its checking condition if max is greater than NF then let it be same else set its value to current NF value.
a[NF]=(a[NF]?a[NF] ORS:"")$0 ##Creating an array a with index of value of NF and keep appending its value with new line to it.
}
END{ ##Starting END section of this program.
print a[max] ##Printing value of array a with index of variable max.
}'

Reading several files into an associative array in bash (>4.0) [duplicate]

This question already has answers here:
How to pipe input to a Bash while loop and preserve variables after loop ends
(3 answers)
Closed 4 years ago.
I am new to associative arrays in bash so please forgive me if I sound silly somewhere. Let's say am reading through a large file and using bash (version = 4.2.46) associative array to store FDR values for genes. For one file, I am simply doing:
declare -A array
while read ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference; do
array[$geneSymbol]="${array[$geneSymbol]}${array[$geneSymbol]:+,}$FDR" ;
done < input.txt
Which will store the FDR values that I can print by doing
for key in "${!array[#]}"; do echo "$key->${array[$key]}"; done
# Prints out
"ABHD14B"->0.285807588279,0.898327660004,0.820468496328
"DHFR"->0.464931314555,0.449582575347
...
I naively tried to read several file through my array by doing
declare -A array
find ./aligned.filtered/rMAT*/MATS_output/SE.MATS.JunctionCountOnly.txt -type f -exec cat {} + |
while read ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference;
do array[$geneSymbol]="${array[$geneSymbol]}${array[$geneSymbol]:+,}$FDR" ;
done
But in this case my array ends up being empty. I can of course cat all the files I need and save them into a single file that I can use as above, but it would be nice to know how to make an associative array to store data from several distinct files.
Thank you very much!
You probably shouldn't be doing this in bash in the first place, but your main problem is that the while loop runs in a subshell induced by the pipeline. Use process substitution to invert the relationship.
(Also, don't give names to all the fields you don't actually use; just split the line into an indexed array and pick out the two fields you actually want.)
while read -a fields; do
geneSymbol=${fields[1]}
FDR=${fields[...]} # some number; i'm not counting
array[$geneSymbol]="${array[$geneSymbol]}${array[$geneSymbol]:+,}$FDR"
done < <(find ./aligned.filtered/rMAT*/MATS_output/SE.MATS.JunctionCountOnly.txt -type f -exec cat {} +)
find probably isn't necessary; just put your while loop inside a for loop:
for f in ./aligned.filtered/rMAT*/MATS_output/SE.MATS.JunctionCountOnly.txt; do
while read -a fields; do
...
done < "$f"
done

How to parse only selected column values using awk

I have a sample flat file which contains the following block
test my array which array is better array huh got it?
INDIA USA SA NZ AUS ARG ARM ARZ GER BRA SPN
I also have an array(ksh_arr2) which was defined like this
ksh_arr2=$(awk '{if(NR==1){for(i=1;i<=NF;i++){if($i~/^arr/){print i}}}}' testUnix.txt)
and contains the following integers
3 5 8
Now I want to parse only those column values which are at the respective numbered positions i.e. third fifth and eighth.
I also want the outputs from the 2nd line on wards.
So I tried the following
awk '{for(i=1;i<=NF;i++){if(NR >=1 && i=${ksh_arr2[i]}) do print$i ; done}}' testUnix.txt
but it is apparently not printing the desired outputs.
What am I missing ? Please help.
How i would approach it
awk -vA="${ksh_arr2[*]}" 'BEGIN{split(A,B," ")}{for(i in B)print $B[i]}' file
Explanation
-vA="${ksh_arr2[*]}" - Set variable A to expanded ksh array
'BEGIN{split(A,B," ") - Splits the expanded array on spaces
(effictively recreating it in awk)
{for(i in B)print $B[i]} - Index in the new array print the field that is the number
contained in that index
Edit
If you want to preserve the order of the fields when printing then this would be better
awk -vA="${ksh_arr2[*]}" 'BEGIN{split(A,B," ")}{while(++i<=length(B))print $B[i]}' file
Since no sample output is shown, I don't know if this output is what you want. It is the output one gets from the code provided with the minimal changes required to get it to run:
$ awk -v k='3 5 8' 'BEGIN{split(k,a," ");} {for(i=1;i<=length(a);i++){print $a[i]}}' testUnix.txt
array
array
array
SA
AUS
ARZ
The above code prints out the selected columns in the same order supplied by the variable k.
Notes
The awk code never defined ksh_arr2. I presume that the value of this array was to be passed in from the shell. It is done here using the -v option to set the variable k to the value of ksh_arr2.
It is not possible to pass into awk an array directly. It is possible to pass in a string, as above, and then convert it to an array using the split function. Above the string k is converted to the awk array a.
awk syntax is different from shell syntax. For instance, awk does not use do or done.
Details
-v k='3 5 8'
This defines an awk variable k. To do this programmatically, replace 3 5 8 with a string or array from the shell.
BEGIN{split(k,a," ");}
This converts the space-separated values in variable k into an array named a.
for(i=1;i<=length(a);i++){print $a[i]}
This prints out each column in array a in order.
Alternate Output
If you want to keep the output from each line on a single line:
$ awk -v k='3 5 8' 'BEGIN{split(k,a," ");} {for(i=1;i<length(a);i++) printf "%s ",$a[i]; print $a[length(a)]}' testUnix.txt
array array array
SA AUS ARZ
awk 'NR>=1 { print $3 " " $5 " " $8 }' testUnix.txt

How to get array dimension in 1 direction in awk multidimension array

Is there any way to get only one dimension length in awk array like in php
look at this simple example
awk 'BEGIN{
a[1,1]=1;
a[1,2]=2;
a[2,1]=3;
a[2,3]=2;
print length(a)
}'
Here length of array is 4 which includes each field as an entity, my interest is to get how many rows are there in array, in real code of mine I have n number of fields setting array like this
for(i=1;i<=NF;i++)A[FNR,i]=$i
problem is fields are not fixed in my file, sometimes fields are varying in each row, so I cannot calculate even like this length(array)/NF
Is there any solution ?
Use GNU awk since it has true mufti-dimensional arrays:
awk 'BEGIN{
a[1][1]=1;
a[1][2]=2;
a[1][3]=3;
a[2][1]=4;
a[2][2]=5;
print length(a)
print length(a[1])
print length(a[2])
}'
2
3
2
This can be achieved by counting unique index in array, try something like this
awk '
function _get_rowlength(Arr,fnumber, i,t,c){
for(i in Arr){
split(i,sep,SUBSEP)
if(!(sep[fnumber] in t))
{
c++
t[sep[fnumber]]
}
}
return c;
}
BEGIN{
a[1,1]=1;
a[1,2]=2;
a[2,1]=3;
a[2,3]=2;
print _get_rowlength(a,1)
}'
Resulting
$ ./tester
2
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk

Resources