Picking out elements not in an 1-D array? - arrays

I have a 1-D array
x1 = [1, 2, 3, …, 10]
which is stored in the file x1.dat as one record (all on one line), separated by commas. x1.dat reads
1,2,3,4,5,..., 10
And there are two arrays
array1 = [1,3], and array2= [4,7]
(elements in array1 and array2 are some elements of the array x1).
I want to select all the element which is neither in array1 nor in array2.
The desired output will read
2,5,6,8,9,10
I tried with awk:
$awk 'BEGIN{array1 = (1,3); array2 = (4,7)} {for (i=1; i<=NF;i++) if ((!($i in a1)) && (!($i in a2))) {print $i }}' x1.dat
This does not work. Could you please help me to correct it or give a better way to do this selection?

You didn't give the text format of your data file. I assume it is one element per line.
You have a couple of problems in your codes.
variable assignment, you cannot assign an awk array like that.
the in usage is checking the array (hashtable actually) keys, not values.
it would be easier if you put the array1 and 2 in file, or input string, not in codes, but I am keeping it there for showing how to solve the problem exactly as you described
better read version:
awk -v arr1="<yourArray1Str>" -v arr2="<yourArray2Str>"
'BEGIN{
split(arr1,a,",");
split(arr2,b,",");
for(x in a)k[a[x]]=1;
for(x in b)k[b[x]]=1}
!k[$0]' file
with your example:
kent$ cat f
1
2
3
4
5
kent$ awk -v arr1="2,4,3" -v arr2="1,3,4" 'BEGIN{split(arr1,a,",");split(arr2,b,",");for(x in a)k[a[x]]=1;for(x in b)k[b[x]]=1}!k[$0]' f
5

Related

BASH:Sort the array and put sorted keys into another array [duplicate]

So I am quite struggling with arrays in shell scripting, especially dealing with sorting the key values. Here's what I have
declare -A array
Array[0]=(0)
Array[1]=(4)
Array[2]=(6)
Array[3]=(1)
So in each array we have (0,4,6,1), if we sort them to the largest to the smallest, it would be (6,4,1,0). Now, I wonder if I could sort the key of the value, and put them in a new array like this(sort of like ranking them):
newArray[0]=(2) # 2 which was the key for 6
newArray[1]=(1) # 1 which was the key for 4
newArray[2]=(3) # 3 which was the key for 1
newArray[3]=(0) # 0 which was the key for 0
I've tried some solutions but they are so much hard coded and not working for some situations. Any helps would be appreciated.
Create a tuple of index+value.
Sort over value.
Remove values.
Read into an array.
array=(0 4 6 1)
tmp=$(
# for every index in the array
for ((i = 0; i < ${#array[#]}; ++i)); do
# output the index, space, an array value on every line
echo "$i ${array[i]}"
done |
# sort lines using Key as second column Numeric Reverse
sort -k2nr |
# using space as Delimiter, extract first Field from each line
cut -d' ' -f1
)
# Load tmp into an array separated by newlines.
readarray -t newArray <<<"$tmp"
# output
declare -p newArray
outputs:
declare -a newArray=([0]="2" [1]="1" [2]="3" [3]="0")

BASH: sorting an associative array with their keys

So I am quite struggling with arrays in shell scripting, especially dealing with sorting the key values. Here's what I have
declare -A array
Array[0]=(0)
Array[1]=(4)
Array[2]=(6)
Array[3]=(1)
So in each array we have (0,4,6,1), if we sort them to the largest to the smallest, it would be (6,4,1,0). Now, I wonder if I could sort the key of the value, and put them in a new array like this(sort of like ranking them):
newArray[0]=(2) # 2 which was the key for 6
newArray[1]=(1) # 1 which was the key for 4
newArray[2]=(3) # 3 which was the key for 1
newArray[3]=(0) # 0 which was the key for 0
I've tried some solutions but they are so much hard coded and not working for some situations. Any helps would be appreciated.
Create a tuple of index+value.
Sort over value.
Remove values.
Read into an array.
array=(0 4 6 1)
tmp=$(
# for every index in the array
for ((i = 0; i < ${#array[#]}; ++i)); do
# output the index, space, an array value on every line
echo "$i ${array[i]}"
done |
# sort lines using Key as second column Numeric Reverse
sort -k2nr |
# using space as Delimiter, extract first Field from each line
cut -d' ' -f1
)
# Load tmp into an array separated by newlines.
readarray -t newArray <<<"$tmp"
# output
declare -p newArray
outputs:
declare -a newArray=([0]="2" [1]="1" [2]="3" [3]="0")

Finding elements in common between two ksh or bash arrays efficiently

I am writing a Korn shell script. I have two arrays (say, arr1 and arr2), both containing strings, and I need to check which elements from arr1 are present (as whole strings or substrings) in arr2. The most intuitive solution is having nested for loops, and checking if each element from arr1 can be found in arr2 (through grep) like this:
for arr1Element in ${arr1[*]}; do
for arr2Element in ${arr2[*]}; do
# using grep to check if arr1Element is present in arr2Element
echo $arr2Element | grep $arr1Element
done
done
The issue is that arr2 has around 3000 elements, so running a nested loop takes a long time. I am wondering if there is a better way to do this in Bash.
If I were doing this in Java, I could have calculated hashes for elements in one of the arrays, and then looked for those hashes in the other array, but I don't think Bash has any functionality for doing something like this (unless I was willing to write a hash calculating function in Bash).
Any suggestions?
Since version 4.0 Bash has associative arrays:
$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world
You can use this in the same way you would a Java Map.
declare -A map
for el in "${arr1[#]}"; do
map[$el]="x"
done
for el in "${arr2[#]}"; do
if [ -n "${map[$el]}" ] ; then
echo "${el}"
fi
done
Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!
BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm.
Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:
comm -12 <(printf '%s\n' "${arr1[#]}" | sort -u) \
<(printf '%s\n' "${arr2[#]}" | sort -u)
If your arrays are pre-sorted, you can remove the sorts (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep-based approach).
Since you're OK with using grep, and since you want to match substrings as well as full strings, one approach is to write:
printf '%s\n' "${arr2[#]}" \
| grep -o -F "$(printf '%s\n' "${arr1[#]}")
and let grep optimize as it sees fit.
Here's a bash/awk idea:
# some sample arrays
$ arr1=( my first string "hello wolrd")
$ arr2=( my last stringbean strings "well, hello world!)
# break array elements into separate lines
$ printf '%s\n' "${arr1[#]}"
my
first
string
hello world
$ printf '%s\n' "${arr2[#]}"
my
last
stringbean
strings
well, hello world!
# use the 'printf' command output as input to our awk command
$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[#]}" ) \
<( printf '%s\n' "${arr2[#]}" )
array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
NR==FNR : for file #1 only: store elements into awk array named 'a'
next : process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ...
for (i in a) : for each index 'i' in array 'a' ...
if ($0 ~ a[i] ) : see if a[i] is a substring of the current line ($0) from file #2 and if so ...
print "array1... : output info about the match
A test run using the following data:
arr1 == 3300 elements
arr2 == 500 elements
When all arr2 elements have a substring/pattern match in arr1 (ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.
Obviously (?) need to reduce the volume of repetitive actions ...
for an exact string match the comm solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds)
for a substring/pattern match I was able to get a egrep solution to run in about 5 seconds (see my other answer/post)
An egrep solution for substring/pattern matching ...
egrep -f <(printf '.*%s.*\n' "${arr1[#]}") \
<(printf '%s\n' "${arr2[#]}")
egrep -f : take patterns to search from the file designated by the -f, which in this case is ...
<(printf '.*%s.*\n' "${arr1[#]}") : convert arr1 elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix
<(printf '%s\n' "${arr2[#]}") : convert arr2 elements into 1 string per line
When run against a sample data set like:
arr1 == 3300 elements
arr2 == 500 elements
... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep but not as bad as seen with my other answer (bash/awk) ... and of course not as fast the comm solution which eliminates the repetitive processing.

Access a bash array in awk loop

I have a bash array like
myarray = (1 2 3 4 5 ... n)
Also I am reading a file with an input of only one line for example:
1 2 3 4 5 ... n
I am reading it line by line into an array and printing it with:
awk 'BEGIN{FS=OFS="\t"}
NR>=1{for (i=1;i<=NF;i++) a[i]+=$i}
END{for (i=1;i<NF;i++) print OFS a[i]}' myfile.txt
myarray has the same size as a. Now myarray starts with the index 0 and a with index 1. My main problem though is how I can pass the bash array to my awk expression so that I can use it inside the print loop with the corresponding elements. So what I tried was this:
awk -v array="${myarray[*]}"
'BEGIN{FS=OFS="\t"}
NR>=1{for (i=1;i<=NF;i++) a[i]+=$i}
END{for (i=1;i<NF;i++) print OFS a[i] OFS array[i-1]}' myfile.txt
This doens't work though. I don't get any output for myarray. My desired output in this example would be:
1 1
2 2
3 3
4 4
5 5
...
n n
To my understanding, you just need to feed awk with the bash array in a correct way. That is, by using split():
awk -v bash_array="${myarray[*]}"
'BEGIN{split(bash_array,array); FS=OFS="\t"}
NR>=1{for (i=1;i<=NF;i++) a[i]+=$i}
END{for (i=1;i<NF;i++) print a[i], array[i]}' file
Since the array array[] is now in awk, you don't have to care about the indices, so you can call them normally, without worrying about the ones in bash starting from 0.
Note also that print a,b is the same (and cleaner) as print a OFS b, since you already defined OFS in the BEGIN block.

How to get array dimension in 1 direction in awk multidimension array

Is there any way to get only one dimension length in awk array like in php
look at this simple example
awk 'BEGIN{
a[1,1]=1;
a[1,2]=2;
a[2,1]=3;
a[2,3]=2;
print length(a)
}'
Here length of array is 4 which includes each field as an entity, my interest is to get how many rows are there in array, in real code of mine I have n number of fields setting array like this
for(i=1;i<=NF;i++)A[FNR,i]=$i
problem is fields are not fixed in my file, sometimes fields are varying in each row, so I cannot calculate even like this length(array)/NF
Is there any solution ?
Use GNU awk since it has true mufti-dimensional arrays:
awk 'BEGIN{
a[1][1]=1;
a[1][2]=2;
a[1][3]=3;
a[2][1]=4;
a[2][2]=5;
print length(a)
print length(a[1])
print length(a[2])
}'
2
3
2
This can be achieved by counting unique index in array, try something like this
awk '
function _get_rowlength(Arr,fnumber, i,t,c){
for(i in Arr){
split(i,sep,SUBSEP)
if(!(sep[fnumber] in t))
{
c++
t[sep[fnumber]]
}
}
return c;
}
BEGIN{
a[1,1]=1;
a[1,2]=2;
a[2,1]=3;
a[2,3]=2;
print _get_rowlength(a,1)
}'
Resulting
$ ./tester
2
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk

Resources