Using "comm" to find matches between two arrays

Using "comm" to find matches between two arrays - arrays

I have two arrays, I am trying to find matching values using comm. Array1 contains some additional information in each element that I strip out for the comparison. However, I would like to keep that information after the comparison is complete.
For example:
Array1=("abc",123,"hello" "def",456,"world")
Array2=("abc")
declare -a Array1
declare -a Array2
I then compare the two arrays:
oldIFS=$IFS IFS=$'\n\t'
array3=($(comm -12 <(echo "${Array1[*]}" | awk -F "," {'print $1'} | sort) <(echo "${Array2[*]}" | sort)))
IFS=$oldIFS
Which finds the match of abc:
echo ${test3[0]}
abc
However what I want is remaining values from array1 that were not part of my comm statement.
abc,123,hello
EDIT: For more clarification
The arrays in this example are populated with dummy data.
My real example is pulling information from server logs which I am saving into array1. array1 contains (userIDs,hostIPs,count) that I want to cross reference against a list of userID's (array2). My goal is to find out what userIDs exsist in array1 and array2 and save those ID's with the additional information from array1 (hostIPs,count) into array3
array1 is populated from a variable that is is the results of a curl command that generates a splunk search. The data returned looks like this:
"uniqueID=<ID>","<IP>","<hostname>",1
I save the results of the splunk report as $splunk, and then decalare array1 with the results of $splunk - the header information since the results come back in csv format
array1=( $(echo $splunk | sed 's/ /\n/g' | sed 1d) )
array2 is generated from a master file that I have stored locally. That contains all the application ID's in our ecosystem. For example
uid=<ID>
I cat the contents of the master file into array2
array2=( $(cat master.txt) )
I then want to find what IDs from array1 exsist in array2 and save that as array3. This requires some massaging of the data in array1 to make it match the format of array2.
oldIFS=$IFS IFS=$'\n\t'
array3=($(comm -12 <(echo "${array1[*]}" | sed 's/ /\n/g' | awk -F "\"," {'print $1'} | sed 's/\"//g' | sed 's/|/ /g' | awk -F$'=' -v OFS=$'=' '{ $1 = "uid" }1' | grep -i "OU=People" | sed 's/OU/ou/g' | sort) <(echo "${array2[*]}" | sort)))
IFS=$oldIFS
array 3 will then contain lines that match in both arrays
uid=<ID>
uid=<ID>
However I am looking for something more along the line of
"uid=<ID>","<IP>","<hostname>",1
"uid=<ID>","<IP>","<hostname>",1

I would do it like this:
join -t, \
<(printf '%s\n' "${Array1[#]}" | sort -t, -k1,1) \
<(printf '%s\n' "${Array2[#]}" | sort)
Use the join command with , as the field delimiter. The first "file" is the first array, one element per line, sorted on the first field (comma delimited); the second "file" is the second array, one element per line, sorted.
The output will be every line where the first element of the first file matches the element from the second file; for the example input it's
abc,123,hello
This makes only one assumption, namely that no array element contains a newline. To make it more robust (assuming GNU Coreutils), we can use NUL as the delimiter:
join -z -t, \
<(printf '%s\0' "${Array1[#]}" | sort -z -t, -k1,1) \
<(printf '%s\0' "${Array2[#]}" | sort -z)
This prints the output separated by NUL as well; to read the result into an array, we can use readarray:
readarray -d '' -t Array3 < <(
join -z -t, \
<(printf '%s\0' "${Array1[#]}" | sort -z -t, -k1,1) \
<(printf '%s\0' "${Array2[#]}" | sort -z)
)
readarray -d requires Bash 4.4 or newer. For older Bash, you can use a loop:
while IFS= read -r -d '' element; do
Array3+=("$element")
done < <(
join -z -t, \
<(printf '%s\0' "${Array1[#]}" | sort -z -t, -k1,1) \
<(printf '%s\0' "${Array2[#]}" | sort -z)
)

I don't know how to do this with comm, but I do have a solution for you with sed and grep. The following commands match on the regex uid=X,, where the string/array is in the form of uid=x or (uid=x uid=y) respectively.
# Array 2 (B) is a string
$ A=("uid=1,10.10.10.1,server1,1" "uid=2,10.10.10.2,server2,1")
$ B="uid=1"
$ echo ${A[#]} | grep -oE "([^ ]*${B},[^ ]*)"
uid=1,10.10.10.1,server1,1
# Array 2 (D) is an array
$ C=(${A[#]} "uid=3,10.10.10.3,server3,1" "uid=4,10.10.10.4,server4,1")
$ D=(${B} "uid=3")
$ echo ${C[*]} | grep -oE "([^ ]*($(echo ${D[#]} | sed 's/ /,|/g'))[^ ]*)"
uid=1,10.10.10.1,server1,1
uid=3,10.10.10.3,server3,1
# Content of arrays
$ echo ${A[#]}
uid=1,10.10.10.1,server1,1 uid=2,10.10.10.2,server2,1
$ echo ${B}
uid=1
$ echo ${C[#]}
uid=1,10.10.10.1,server1,1 uid=2,10.10.10.2,server2,1 uid=3,10.10.10.3,server3,1 uid=4,10.10.10.4,server4,1
$ echo ${D[#]}
uid=1 uid=3

Related

Parse variables from string and add them to an array with Bash

In Bash, how can I get the strings between acolades (without the '_value' suffix) from for example
"\\*\\* ${host_name_value}.${host_domain_value} - ${host_ip_value}\\*\\*"
and put them into an array?
The result for the above example should be something like:
var_array=("host_name" "host_domain")
The string could also contain other stuff such as:
"${package_updates_count_value} ${package_updates_type_value} updates"
The result for the above example should be something like:
var_array=("package_updates_count" "package_updates_type")
All variables end with _value. There could 1 or more variables in the string.
Not sure what would be the most efficient way and how I'd best handle this. Regex? Sed?

input='\\*\\* ${host_name_value}.${host_domain_value} \\*\\*'
# would also work with cat input or the like.
myarray=($(echo "$input" | awk -F'$' \
'{for(i=1;i<=NF;i++) {match($i, /{([^}]*)_value}/, a); print a[1]}}'))
Split your line(s) on $. Check if a column contains { }. If it does, print what's after { and before _value}. (If not, it will print out the empty string, which bash array creation will ignore.)

If there are only two variables, this will work.
input='\\*\\* ${host_name_value}.${host_domain_value} \\*\\*'
first=$(echo $input | sed -r -e 's/[}].+//' -e 's/.+[{]//')
last=$(echo $input | sed -r -e 's/.+[{]//' -e 's/[}].+//')
output="var_array=(\"$first\" \"$last\")"
Maybe not very efficient and beautiful, but it works well.

Starting with a string variable:
$ str='\\*\\* ${host_name_value}.${host_domain_value} - ${host_ip_value}\\*\\*'
Use grep -o to print all matching words.
$ grep -o '\${\w*_value}' <<< "$str"
${host_name_value}
${host_domain_value}
${host_ip_value}
Then remove ${ and _value}.
$ grep -o '\${\w*_value}' <<< "$str" | sed 's/^\${//; s/_value}$//'
host_name
host_domain
host_ip
Finally, use readarray to safely read the results into an array.
$ readarray -t var_array < <(grep -o '\${\w*_value}' <<< "$str" | sed 's/^\${//; s/_value}$//')
$ declare -p var_array
declare -a var_array=([0]="host_name" [1]="host_domain" [2]="host_ip")

diff two arrays each containing files paths into a third array (for removal)

In the function below you will see notes on several attempts to solve this problem; each attempt has a note indicating what went wrong. Between my attempts there is a line from another question here which purports to solve some element of the matter. Again, I've added a note indicating what that is supposed to solve. My brain is mush at this point. What is the stupid simple thing I've overlooking?
function func_removeDestinationOrphans() {
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 )
for (( i = 0 ; i < ${#A_Destination_orphans[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans[${i}]}" # path to each track
done
printf '%b\n' ""
# https://stackoverflow.com/questions/2312762/compare-difference-of-two-arrays-in-bash
# echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u ## original
# Array3=(`echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u `) ## store in array
# A_Destination_orphans_diff=(`echo "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | tr ' ' '\n' | sort | uniq -u `) # drops file path after space
# printf "%s\0" "${Array1[#]}" "${Array2[#]}" | sort -z | uniq -zu ## newlines and white spaces
# A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu )) # throws warning and breaks at space but not newline
# printf '%s\n' "${Array1[#]}" "${Array2[#]}" | sort | uniq -u ## manage spaces
# A_Destination_orphans_diff=($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )) # breaks at space and newline
# A_Destination_orphans_diff="($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u ))" # creates string surrounded by ()
# A_Destination_orphans_diff=("$( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )") # creates string
# A_Destination_orphans_diff=($( printf '%s\n' ${A_Destination_dubUnders[#]} ${A_Destination_orphans[#]} | sort | uniq -u )) # drops file path after space
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans_diff[${i}]}" # path to each track
done
printf '%b\n' ""
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
echo # rm "${A_Destination_orphans_diff[i]}"
done
func_EnterToContinue
}

This throws warning and breaks at space but not newline because you build the array with direct assignment of syntax construct. When an entry contains spaces, it also splits break to a new entry.
A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu ))
To avoid the issue of the method above, you can mapfile/readarray a null delimited entries stream.
mapfile -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
In case your shell version is too old to support mapfile you can perform the same task with IFS=$'\37' read -r -d '' -a array.
$'\37' is shell's C-Style string syntax with octal code 37, which is ASCII 31 US for Unit Separator:
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)

To remove all files not present in A_Destination_dubUnders array you could:
func_removeDestinationOrphans() {
find "${directory_PMPRoot_destination}" -type f -print0 |
sort -z |
join -z -v1 -t '' - <(printf "%s\0" "${A_Destination_dubUnders[#]}" | sort -z) |
xargs -0 echo rm
}
Use join or comm to find elements not present in one list and present in another list. I am usually wrong about -v1, so try with -v2 if it echoes the elements from wrong list (I do not understand if you want to remove files present in A_Destination_dubUnders list or not present, you did not specify that).
Note that function name() is a mix of ksh and posix function definition. Just name() {. See bash hackers wiki obsolete

Here is the working version with modifications thanks to suggested input from the first two respondents (thanks!).
function func_removeDestinationOrphans() {
printf '%s\n' " → Purge playlist orphans: " ""
printf '%b\n' "First we will remove any files not present in your proposed playlist. "
func_EnterToContinue
bash_version="$( bash --version | head -n1 | cut -d " " -f4 | cut -d "(" -f1 )"
if printf '%s\n' "4.4.0" "${bash_version}" | sort -V -C ; then
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 ) # readarray or mapfile -d fails before bash 4.4.0
readarray -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
else
while IFS= read -r -d $'\0'; do
A_Destination_orphans+=( "$REPLY" )
done < <( find "${directory_PMPRoot_destination}" -type f -print0 )
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
fi
if [[ ! "${A_Destination_orphans_diff[*]}" = '' ]] ; then
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
rm "${A_Destination_orphans_diff[i]}"
done
fi
}
If you would like to see the entire Personal Music Player sync script, you can find that via my GitHub.

How do I rebuild an array without null elements?

I am trying to create a new array of strings without null elements from an array of strings with null elements.
Code
#!/bin/bash
inlist=(a b c d) # inlist to be processed
outlist=(a b) # outlist to be deleted from inlist
for i in "${outlist[#]}"; do
inlist=( "${inlist[#]/$i}" ) # use outlist to remove elements from inlist
done
for i in "${!inlist[#]}"; do # create new inlist without null elements
# if []; then
templist+=( "${inlist[i]}" )
# fi
done
inlist=("${templist[#]}")
unset templist
for i in "${!inlist[#]}"; do
echo "$i" "${inlist[i]}"
done
Unexpected result
0
1
2 c
3 d
Expected result
0 c
1 d
Once the array handling is working, I want to then extend the script to handle lists of files, something like
Extension
mapfile -t inlist < inlist.txt
mapfile -t outlist < outlist.txt
inlist.txt
file1.txt
file2.txt
file3.txt
file4.txt
outlist1.txt
file1.txt
file2.txt
I am learning bash and working through some of the basic concepts around operators, expansion and substitution.
Appreciate any explanations or verbose code suggestions.
The problem seems to be the for loop not ignoring null elements when adding them to temporary array.
Thanks in advance

templist still has all the same null strings as inlist. You want something like
for i in "${inlist[#]}"; do
if [ -n "$i" ]; then
templist+=( "$i" )
fi
done
Now inlist=("${templist[#]}") will reset inlist as desired.
You could also use
for i in "${!inlist[#]}"; do
if [ -z "${inlist[i]}" ]; then
unset "inlist[i]"
fi
done
which leaves inlist in a slightly different state:
$ declare -p inlist
declare -a inlist=([2]="c" [3]="d")
but inlist=("${inlist[#]}") will ignore the actual indices when building the new array.
Give your two input files,
$ comm -23 inlist.txt outlist.txt
file3.txt
file4.txt

Use join or comm to extract elements that aren't in one list, but are in the other.
Below I printf the arrays as zero separated streams, sort them, then comm on them and then readarray into inlist.
inlist=(a b c d)
outlist=(a b)
IFS= readarray -d '' inlist < <(comm -z -23 <(printf "%s\0" "${inlist[#]}" | sort -z) <(printf "%s\0" "${outlist[#]}" | sort -z))
declare -p inlist
will output:
declare -a inlist=([0]="c" [1]="d")
Notes:
this will probably be very fast
-z for comm is a gnu extension
you will lose the element order, as elements are sorted before comm.
On bash version pre 4.4 that doesn't have -d option with readarray, you can read the array line by line and append to an array:
inlist=(a b c d)
outlist=(a b)
while IFS= read -d '' -r a; do
tmplist+=("$a")
done < <(comm -z -23 <(printf "%s\0" "${inlist[#]}" | sort -z) <(printf "%s\0" "${outlist[#]}" | sort -z))
declare -p tmplist

Assigning an Array Parsed With jq to Bash Script Array

I parsed a json file with jq like this :
# cat test.json | jq '.logs' | jq '.[]' | jq '._id' | jq -s
It returns an array like this : [34,235,436,546,.....]
Using bash script i described an array :
# declare -a msgIds = ...
This array uses () instead of [] so when I pass the array given above to this array it won't work.
([324,32,45..]) this causes problem. If i remove the jq -s, an array forms with only 1 member in it.
Is there a way to solve this issue?

We can solve this problem by two ways. They are:
Input string:
// test.json
{
"keys": ["key1","key2","key3"]
}
Approach 1:
1) Use jq -r (output raw strings, not JSON texts) .
KEYS=$(jq -r '.keys' test.json)
echo $KEYS
# Output: [ "key1", "key2", "key3" ]
2) Use #sh (Converts input string to a series of space-separated strings). It removes square brackets[], comma(,) from the string.
KEYS=$(<test.json jq -r '.keys | #sh')
echo $KEYS
# Output: 'key1' 'key2' 'key3'
3) Using tr to remove single quotes from the string output. To delete specific characters use the -d option in tr.
KEYS=$((<test.json jq -r '.keys | #sh')| tr -d \')
echo $KEYS
# Output: key1 key2 key3
4) We can convert the comma-separated string to the array by placing our string output in a round bracket().
It also called compound Assignment, where we declare the array with a bunch of values.
ARRAYNAME=(value1 value2 .... valueN)
#!/bin/bash
KEYS=($((<test.json jq -r '.keys | #sh') | tr -d \'\"))
echo "Array size: " ${#KEYS[#]}
echo "Array elements: "${KEYS[#]}
# Output:
# Array size: 3
# Array elements: key1 key2 key3
Approach 2:
1) Use jq -r to get the string output, then use tr to delete characters like square brackets, double quotes and comma.
#!/bin/bash
KEYS=$(jq -r '.keys' test.json | tr -d '[],"')
echo $KEYS
# Output: key1 key2 key3
2) Then we can convert the comma-separated string to the array by placing our string output in a round bracket().
#!/bin/bash
KEYS=($(jq -r '.keys' test.json | tr -d '[]," '))
echo "Array size: " ${#KEYS[#]}
echo "Array elements: "${KEYS[#]}
# Output:
# Array size: 3
# Array elements: key1 key2 key3

To correctly parse values that have spaces, newlines (or any other arbitrary characters) just use jq's #sh filter and bash's declare -a. (No need for a while read loop or any other pre-processing)
// foo.json
{"data": ["A B", "C'D", ""]}
str=$(jq -r '.data | #sh' foo.json)
declare -a arr="($str)" # must be quoted like this
$ declare -p arr
declare -a arr=([0]="A B" [1]="C'D" [2]="")
The reason that this works correctly is that #sh will produce a space-separated list of shell-quoted words:
$ echo "$str"
'A B' 'C'\''D' ''
and this is exactly the format that declare expects for an array definition.

Use jq -r to output a string "raw", without JSON formatting, and use the #sh formatter to format your results as a string for shell consumption. Per the jq docs:
#sh:
The input is escaped suitable for use in a command-line for a POSIX shell. If the input is an array, the output will be a series of space-separated strings.
So can do e.g.
msgids=($(<test.json jq -r '.logs[]._id | #sh'))
and get the result you want.

From the jq FAQ (https://github.com/stedolan/jq/wiki/FAQ):
𝑸: How can a stream of JSON texts produced by jq be converted into a bash array of corresponding values?
A: One option would be to use mapfile (aka readarray), for example:
mapfile -t array <<< $(jq -c '.[]' input.json)
An alternative that might be indicative of what to do in other shells is to use read -r within a while loop. The following bash script populates an array, x, with JSON texts. The key points are the use of the -c option, and the use of the bash idiom while read -r value; do ... done < <(jq .......):
#!/bin/bash
x=()
while read -r value
do
x+=("$value")
done < <(jq -c '.[]' input.json)

++ To resolve this, we can use a very simple approach:
++ Since I am not aware of you input file, I am creating a file input.json with the following contents:
input.json:
{
"keys": ["key1","key2","key3"]
}
++ Use jq to get the value from the above file input.json:
Command: cat input.json | jq -r '.keys | #sh'
Output: 'key1' 'key2' 'key3'
Explanation: | #sh removes [ and "
++ To remove ' ' as well we use tr
command: cat input.json | jq -r '.keys | #sh' | tr -d \'
Explanation: use tr delete -d to remove '
++ To store this in a bash array we use () with `` and print it:
command:
KEYS=(`cat input.json | jq -r '.keys | #sh' | tr -d \'`)
To print all the entries of the array: echo "${KEYS[*]}"

create arrays from for loop output

I'm trying to understand what I'm doing wrong here, but can't seem to determine the cause. I would like to create a set of arrays from an output for a for loop in bash. Below is the code I have so far:
for i in `onedatastore list | grep pure02 | awk '{print $1}'`;
do
arr${i}=($(onedatastore show ${i} | sed 's/[A-Z]://' | cut -f2 -d\:)) ;
echo "Output of arr${i}: ${arr${i}[#]}" ;
done
The output for the condition is as such:
107
108
109
What I want to do is based on these unique IDs is create arrays:
arr107
arr108
arr109
The arrays will have data like such in each:
[oneadmin#opennebula/]$ arr107=($(onedatastore show 107 | sed 's/[A-Z]://' | cut -f2 -d\:))
[oneadmin#opennebula/]$ echo ${arr107[#]}
DATASTORE 107 INFORMATION 107 pure02_vm_datastore_1 oneadmin oneadmin 0 IMAGE vcenter vcenter /var/lib/one//datastores/107 FILE READY DATASTORE CAPACITY 60T 21.9T 38.1T - PERMISSIONS um- u-- --- DATASTORE TEMPLATE CLONE_TARGET="NONE" DISK_TYPE="FILE" DS_MAD="vcenter" LN_TARGET="NONE" RESTRICTED_DIRS="/" SAFE_DIRS="/var/tmp" TM_MAD="vcenter" VCENTER_CLUSTER="CLUSTER01" IMAGES
When I try this in the script section though I get output errors as such:
./test.sh: line 6: syntax error near unexpected token `$(onedatastore show ${i} | sed 's/[A-Z]://' | cut -f2 -d\:)'
I can't seem to figure out the syntax to use on this scenario.
In the end what I want to do is be able to compare different datastores and based on which on has more free space, deploy VMs to it.
Hope someone can help. Thanks

You can use the eval (potentially unsafe) and declare (safer) commands:
for i in $(onedatastore list | grep pure02 | awk '{print $1}');
do
declare "arr$i=($(onedatastore show ${i} | sed 's/[A-Z]://' | cut -f2 -d\:))"
eval echo 'Output of arr$i: ${arr'"$i"'[#]}'
done

readarray or mapfile, added in bash 4.0, will read directly into an array:
while IFS= read -r i <&3; do
readarray -t "arr$i" < <(onedatastore show "$i" | sed 's/[A-Z]://' | cut -f2 -d:)
done 3< <(onedatastore list | awk '/pure02/ {print $1}')
Better, back through bash 3.x, one can use read -a to read to an array:
shopt -s pipefail # cause pipelines to fail if any element does
while IFS= read -r i <&3; do
IFS=$'\n' read -r -d '' -a "arr$i" \
< <(onedatastore show "$i" | sed 's/[A-Z]://' | cut -f2 -d: && printf '\0')
done 3< <(onedatastore list | awk '/pure02/ {print $1}')
Alternately, one can use namevars to create an alias for an array with an arbitrarily-named array in bash 4.3:
while IFS= read -r i <&3; do
declare -a "arr$i"
declare -n arr="arr$i"
# this is buggy: expands globs, string-splits on all characters in IFS, etc
# ...but, well, it's what the OP is asking for...
arr=( $(onedatastore show "$i" | sed 's/[A-Z]://' | cut -f2 -d:) )
done 3< <(onedatastore list | awk '/pure02/ {print $1}')

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Using "comm" to find matches between two arrays - arrays

Related

Parse variables from string and add them to an array with Bash

diff two arrays each containing files paths into a third array (for removal)

How do I rebuild an array without null elements?

Assigning an Array Parsed With jq to Bash Script Array

create arrays from for loop output

Categories

Resources