awk using an index key over a range - arrays

I have an awk script that I normally run in parallel using an outside variable $a.
awk -v a=$a '$4>a-5 && $4<a+5 {print $10,$4}' INFILE
It would of course run much faster using an array so I tried something like this to get it to do the same thing ($2 in LISTFILE being the search value for $4 in INFILE
awk 'FNR==NR{a[$2]=($2-5);next}$4 in a{if ($4>a[$4] && $4<a[$4]+10 {print} LISTFILE INFILE
This of course did not work because awk scanned until it reached the key and then started the testing the if statement, so only the downstream range was found. Unfortunately this isn't a continuous list, so often there is no $2-5 value, otherwise I would use that as the key for the array.
obviously I know how to do this using a combo of awk and bash, but I was wondering if there was an awk only solution for this.

My first answer addresses the actual question asked and fixes the awk script. But perhaps I have missed the point. If you want speed, and don't mind making more use of your multi-core processor, you can use GNU parallel. Here's an implementation that will launch 4 jobs at a time:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
parallel -j 4 "awk -v var={} '$awk_cmd' INFILE" :::: LISTFILE
As you can see, this will read INFILE up to four times concurrently. This answer, after adjustment of the number of jobs, should provide very similar performance to your parallel implementation that you describe using your shell. Therefore, you may like to split up your LISTFILE into smaller chunks and set awk_cmd to the command posted in my previous answer. There may be an optimal way to process you input, but that would largely depend on the size of INFILE and the number of elements in LISTFILE. HTH.
TESTING:
Create LISTFILE:
paste - - < <(seq 16) > LISTFILE
Create INFILE:
awk 'BEGIN { for (i=1; i<=9999999; i++) { print i, i, i, int(i * rand()), i, i, i, i, i, i } }' > INFILE
RESULTS:
TEST1:
time awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE >/dev/null
real 0m45.198s
user 0m45.090s
sys 0m0.160s
TEST2:
time for i in $(seq 1 2 16); do awk -v var="$i" '$4 > var - 5 && $4 < var + 5 { print $10, $4 }' INFILE; done >/dev/null
real 0m55.335s
user 0m54.433s
sys 0m0.953s
TEST3:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
time parallel --colsep "\t" -j 4 "awk -v var={2} '$awk_cmd' INFILE" :::: LISTFILE >/dev/null
real 0m28.190s
user 1m42.750s
sys 0m1.757s
My reply to THIS answer:
1:
The awk1 script does not run much faster than the awk script.
A 15% time saving is pretty significant in my opinion.
I suspect because it scans the LISTFILE for every line in the INFILE.
Yes, essentially. The awk1 script loops through INFILE just once.
So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE).
Close. But don't forget that by using an array, we actually remove any duplicate values in LISTFILE.
This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
This statement is therefore only true when LISTFILE contains no duplicates. Even if LISTFILE never contains any dups, having to read a single file multiple times is best avoided.
2:
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
What four minute result? When benchmarking this sort of thing, you should stop writing the output to disk. If your machine has some background process going on when your running your tests, you will only end up biasing your results with the write speed of your disk. Use /dev/null instead.
3:
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
If you remove the pipe to sort and uniq you will get a better idea of where the time difference is. You will find that doing $4 > i - 5 && $4 < i + 5 is grossly different to doing $4 < i + 5 && $4 > i - 5. If awkout.txt is the same as awkout.txt, you are spending time processing duplicates.
4:
The second command you posted here avoids this test: $4 > i - 5 && $4 < i + 5. I wouldn't think that that alone would warrant a 90% improvement in runtime. Something smells wrong. Would you mind re-running your tests writing to /dev/null and posting the contents of LISTFILE and INFILE? If those two files are confidential, could you provide some example files with the amount of content equal to the originals?
Other thoughts:
To me, it looks like something along these lines would also work:
awk 'FNR==NR { for (i=$2-4;i<$2+5;i++) a[i]; next } $4 in a { b[$10,$4] } END { print length b }' LISTFILE INFILE

It looks like you just need to add the keys of LISTFILE to an array, then, as you process INFILE (line by line), test each key in your array with your 'if' statement. You can do this using the following construct or similar:
for (i in a) { print i, a[i] }
Here's some untested code that may help get you started. Notice how I have not assigned any values to my keys:
awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE

Steves answer above is the correct answer to the question. Below is a comparison of array and non array ways to handle the problem.
I created a test program to look at two different scenarios and the results from each. The test programs code is here:
echo time for bash
time for line in `awk '{print $2}' $1` ; do awk -v a=$line '$4>a-5&&$4<a+5{print $4,$10}' $2 ; done | sort | uniq -c > bashout.txt
echo time for awk
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4>i-5&&$4<i+5) print $10,$4}}' $1 $2 |sort | uniq -c > awkout.txt
echo time for awk2
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4<i+5&&$4>i-5) print $10,$4}}' $1 $2 |sort | uniq -c > awk2out.txt
echo time for awk3
time awk '{a=$2;b=$1;for (i=a-4;i<a+5;i++) print b,i}' $1 > LIST2;time awk 'FNR==NR{a[$2];next}$4 in a{print $10,$4}' LIST2 $2 | sort | uniq -c > awk3out.txt
Here is the output:
time for bash
real 2m22.394s
user 2m15.938s
sys 0m6.409s
time for awk
real 2m1.719s
user 2m0.919s
sys 0m0.782s
time for awk2
real 1m49.146s
user 1m47.607s
sys 0m1.524s
time for awk3
real 0m0.006s
user 0m0.000s
sys 0m0.001s
real 0m12.788s
user 0m12.096s
sys 0m0.695s
4 observations/questions
The awk1 script does not run much faster than the awk script. I suspect because it scans the LISTFILE for every line in the INFILE. So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE). This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
Making an expanded LIST2 from the LISTFILE and using that as the array makes the program run significantly faster, at the cost of increasing the memory footprint. Considering how small the list I"m looking at is (only 200-300 long) that seems to be the way to go, even over doing this in parallel.

Related

Check if an indexed array in bash is sparse or dense

I have a dynamically generated, indexed array in bash and want to know whether it is sparse or dense.
An array is sparse iff there are unset indices before the last entry. Otherwise the array is dense.
The check should work in every case, even for empty arrays, very big arrays (exceeding ARG_MAX when expanded), and of course arrays with arbitrary entries (for instance null entries or entries containing *, \, spaces, and linebreaks). The latter should be fairly easy, as you probably don't want to expand the values of the array anyways.
Ideally, the check should be efficient and portable.
Here are some basic test cases to check your solution.
Your check can use the hard-coded global variable name a for compatibility with older bash versions. For bash 4.3 and higher you may want to use local -n isDense_array="$1" instead, so that you can specify the array to be checked.
isDense() {
# INSERT YOUR CHECK HERE
# if array `a` is dense, return 0 (success)
# if array `a` is sparse, return any of 1-255 (failure)
}
test() {
isDense && result=dense || result=sparse
[[ "$result" = "$expected" ]] ||
echo "Test in line $BASH_LINENO failed: $expected array considered $result"
}
expected=dense
a=(); test
a=(''); test
a=(x x x); test
expected=sparse
a=([1]=x); test
a=([1]=); test
a=([0]=x [2]=x); test
a=([4]=x [5]=x [6]=x); test
a=([0]=x [3]=x [4]=x [13]=x); test
To benchmark your check, you can use
a=($(seq 9999999))
time {
isDense
unset 'a[0]'; isDense
a[0]=1; unset 'a[9999998]'; isDense
a=([0]=x [9999999999]=x); isDense
}
Approach
Non-empty, dense arrays have indices from 0 to ${#a[*]}-1. Due to the pigeonhole principle, the last index of a sparse array must be greater or equal ${#a[#]}.
Bash Script
To get the last index, we assume that the list of indices ${!a[#]} is in ascending order. Bash's manual does not specify any order, but (at least for bash 5 and below) the implementation guarantees this order (in the source code file array.c search for array_keys_to_word_list).
isDense() {
[[ "${#a[*]}" = 0 || " ${!a[*]}" == *" $((${#a[*]}-1))" ]]
}
For small arrays this works very well. For huge arrays the check is a bit slow because of the ${!a[*]}. The benchmark from the question took 9.8 seconds.
Loadable Bash Builtin
The approach in this answer only needs the last index. But bash only allows to extract all indices using ${!a[*]} which is unnecessary slow. Internally, bash knows what the last index is. So if you wanted, you could write a loadable builtin that has access to bash's internal data structures.
Of course this is not a really practical solution. If the performance really did matter that much, you shoudn't use a bash script. Nevertheless, I wrote such a builtin for the fun of it.
Loadable bash builtin
The space and time complexity of above builtin is indepent of the size and structure of the array. Checking isdense a should be as fast as something like b=1.
UPDATE: (re)ran tests in an Ubuntu 20 VM (much better times than previous tests in a cygwin/bash env; times are closer to those reported by Socowi)
NOTE: I populated my array with 10 million entries ({0..9999999}).
Using the same assumption as Socowi ...
To get the last index of an array, we assume that the list of indices ${!a[#]} is in ascending order. Bash's manual does not specify any order, but (at least for bash 5 and below) the implementation guarantees this order
... we can make the same assumption about the output from typeset -p a, namely, output is ordered by index.
Finding the last index:
$ a[9999999]='[abcdef]'
$ typeset -p a | tail -c 40
9999998]="9999998" [9999999]="[abcdef]")
1st attempt using awk to strip off the last index:
$ typeset -p a | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999
This is surprisingly slow (> 3.5 minutes for cygwin/bash) while running at 100% (single) cpu utilization and eating up ~4 GB of memory.
2nd attempt using tail -c to limit the data fed to awk:
$ typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999
This came in at a (relatively) blazing speed of ~1.6 seconds (ubuntu/bash).
3rd attempt saving last array value in local variable before parsing with awk
NOTES:
as Socowi's pointed out (in comments), the contents of the last element could contain several characters (spaces, brackets, single/double quotes, etc) that make parsing quite complicated
one workaround is to save the last array value in a variable, parse the typeset -p output, then put the value back
accessing the last array value can be accomplished via array[-1] (requires bash 4.3+)
This also comes in at a (relatively) blazing speed of ~1.6 seconds (ubuntu/bash):
$ lastv="${a[-1]}"
$ a[-1]=""
$ typeset -p a | tail -c 40 | awk -F'[][]' '{print $(NF-1)}'
9999999
$ a[-1]="${lastv}"
Function:
This gives us the following isDense() function:
initial function
unset -f isDense
isDense() {
local last=$(typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}')
[[ "${#a[#]}" = 0 || "${#a[#]}" -ne "${last}" ]]
}
latest function (variable=last array value)
unset -f isDense
isDense() {
local lastv="${a[-1]}"
a[-1]=""
local last=$(typeset -p a | tail -c 40 | awk -F'[][]' '{print $(NF-1)}')
[[ "${#a[#]}" = 0 || "${#a[#]}" -ne "${last}" ]]
rc=$?
a[-1]="${lastv}"
return "${rc}"
}
Benchmark:
from the question (minus the 4th test - I got tired of waiting to repopulate my 10-million entry array) which means ...
keep in mind the benchmark/test is calling isDense() 3x times
ran each function a few times in ubuntu/bash and these were the best times ...
Socowi's function:
real 0m11.717s
user 0m9.486s
sys 0m1.982s
oguz ismail's function:
real 0m10.450s
user 0m9.899s
sys 0m0.546s
My initial typeset|tail-c|awk function:
real 0m4.514s
user 0m3.574s
sys 0m1.442s
Latest test (variable=last array value) with Socowi's declare|tr|tail-n function:
real 0m5.306s
user 0m4.130s
sys 0m2.670s
Latest test (variable=last array value) with original typeset|tail-c|awk function:
real 0m4.305s
user 0m3.247s
sys 0m1.761s
isDense() {
local -rn array="$1"
((${#array[#]})) || return 0
local -ai indices=("${!array[#]}")
((indices[-1] + 1 == ${#array[#]}))
}
To follow the self-contained call convention:
test() {
local result
isDense "$1" && result=dense || result=sparse
[[ "$result" = "$2" ]] ||
echo "Test in line $BASH_LINENO failed: $2 array considered $result"
}
a=(); test a dense
a=(''); test a dense
a=(x x x); test a dense
a=([1]=x); test a sparse
a=([1]=); test a sparse
a=([0]=x [2]=x); test a sparse
a=([4]=x [5]=x [6]=x); test a sparse
a=([0]=x [3]=x [4]=x [13]=x); test a sparse

How to implement awk using loop variables for the row?

I have a file with n rows and 4 columns, and I want to read the content of the 2nd and 3rd columns, row by row. I made this
awk 'NR == 2 {print $2" "$3}' coords.txt
which works for the second row, for example. However, I'd like to include that code inside a loop, so I can go row by row of coords.txt, instead of NR == 2 I'd like to use something like NR == i while going over different values of i.
I'll try to be clearer. I don't want to wxtract the 2nd and 3rd columns of coords.txt. I want to use every element idependently. For example, I'd like to be able to implement the following code
for (i=1; i<=20; i+=1)
awk 'NR == i {print $2" "$3}' coords.txt > auxfile
func(auxfile)
end
where func represents anything I want to do with the value of the 2nd and 3rd columns of each row.
I'm using SPP, which is a mix between FORTRAN and C.
How could I do this? Thank you
It is of course inefficient to invoke awk 20 times. You'd want to push the logic into awk so you only need to parse the file once.
However, one method to pass a shell variable to awk is with the -v option:
for ((i=1; i<20; i+=2)) # for example
do
awk -v line="$i" 'NR == line {print $2, $3}' file
done
Here i is the shell variable, and line is the awk variable.
something like this should work, there is no shell loop needed.
awk 'BEGIN {f="aux.aux"}
NR<21 {close(f); print $2,$3 > f; system("./mycmd2 "f)}' file
will call the command with the temp filename for the first 20 lines, the file will be overwritten at each call. Of course, if your function takes arguments or input from stdin instead of file name there are easier solution.
Here ./mycmd2 is an executable which takes a filename as an argument. Not sure how you call your function but this is generic enough...
Note also that there is no error handling for the external calls.
the hideous system( ) only way in awk would be like
system("printf \047%s\\n\047 \047" $2 "\047 \047" $3 "\047 | func \047/dev/stdin\047; ");
if the func( ) OP mentioned can be directly called by GNU parallel, or xargs, and can take in values of $2 + $3 as its $1 $2, then OP can even make it all multi-threaded like
{mawk/mawk2/gawk} 'BEGIN { OFS=ORS="\0"; } { print $2, $3; } (NR==20) { exit }' file \
\
| { parallel -0 -N 2 -j 3 func | or | xargs -0 -n 2 -P 3 func }

Saving the output of a for loop in an array in Bash

I am trying to write a custom script to monitor the disk usage space of "n" number of servers. I have two arrays, one array consists of the actual usage and the other array consists of the allowed limit. I would like to loop through the used storage array; determine the percentage, round it off to the nearest integer and output the same on the console to be later saved in an array.
I have the following piece of code that does this:
readarray -t percentage_storage_limit <<< "$(for ((j=0; j < ${#storage_usage_array[#]}; j++));
do $(awk "BEGIN {
ac=100*${storage_usage_array[$j]}/${storage_limit_array[$j]};
i=int(ac);
print (ac-i<0.5)?i:i+1
}");
done)";
The length of both storage_usage_array and storage_limit_array are the same. An index in storage_usage_array corresponds to the storage used on a server and an index on storage_limit_array corresponds to the limit on the same server.
Although the above statement runs as expected, I see a "command not found error" as follow, which is causing these output to not be saved in the "percentage_storage_limit" array.
8: command not found
4: command not found
am I missing something here? Any help would be really appreciated.
I think you getting over-complicated syntax-wise. I would just accumulate the array within the for loop
percentage_storage_limit=()
for ((j=0; j < ${#storage_usage_array[#]}; j++)); do
percentage_storage_limit+=( $(
awk -v u="${storage_usage_array[$j]}" -v l="${storage_limit_array[$j]}" '
BEGIN {
ac = 100 * u / l
i = int(ac)
print (ac-i < 0.5) ? i : i+1
}
'
) )
done
The reason it doesn't work, is that whan you enclose awk in $(...) you tell bash to execute it's output, thus you want to execute 8 or 4 and bash errors to you that it didn't find such command. Just don't enclose awk in $(...), you want to capture it's output, not execute it's output. And it would be better to use < <(...) then <<<"$(...)":
readarray -t percentage_storage_limit < <(
for ((j=0; j < ${#storage_usage_array[#]}; j++)); do
awk "BEGIN {
ac=100*${storage_usage_array[$j]}/${storage_limit_array[$j]};
i=int(ac);
print (ac-i<0.5)?i:i+1
}";
done
)
Anyway Glenn's answer shows the 'good' way to do this, without readarray call.

Load field 1 and print at the END{} equivalent awk in Perl

I have the following AWK script that counts occurences of elements in field 1 and when finishes to read entire file, prints each element and the times of repetitions.
awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' file
I'm very new with perl and I don't know how would be the equivalent. What I have so far is below, but it has incorrect syntax. Thanks in advance.
perl -lane '$a{$F[1]}++ END{foreach $a {print $a} }' file
____________________________________UPDATE
______________________________________
Hi, thanks both for your answers. The real input file has 34 million lines and the execution time is 3 or more times faster between awk and Perl. Is awk faster than perl?
awk '{a[$1]++}END{for(i in a){print i"-->"a[i]}}' file #--> 2:45 aprox
perl -lane '$a{$F[0]}++;END{foreach my $k (keys %a){ print "$k --> $a{$k}" } }' file #--> 7 min aprox
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file # -->9 min aprox
Okay, Ger, one more time :-)
I upgraded my Perl to the latest version available to me and made a file like what you described (34.5 million lines each having a 16 digit integer in the 1st and only column):
schumack#linux2 52> wc -l listbig
34521909 listbig
schumack#linux2 53> head -3 listbig
1111111111111111
3333333333333333
4444444444444444
I then ran a specialized Perl line (works for this file but is not the same as the awk line). As before I timed the runs using /usr/bin/time:
schumack#linux2 54> /usr/bin/time -f '%E %P' /usr/local/bin/perl -lne 'chomp; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' listbig
5555555555555555-->4547796
1111111111111111-->9715747
9999999999999999-->826872
3333333333333333-->9922465
1212121212121212-->826872
4444444444444444-->5374669
2222222222222222-->1653744
8888888888888888-->826872
7777777777777777-->826872
0:12.20 99%
schumack#linux2 55> /usr/bin/time -f '%E %P' awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' listbig
1111111111111111-->9715747
2222222222222222-->1653744
3333333333333333-->9922465
4444444444444444-->5374669
5555555555555555-->4547796
1212121212121212-->826872
7777777777777777-->826872
8888888888888888-->826872
9999999999999999-->826872
0:12.61 99%
Both perl and awk ran very fast on the 34.5 million line file and were within a half second of each other.
Curious as what type of machine / OS / Perl version you are currently using. I tested on an ASUS laptop that is about 4 years old, has Intel I7. I am using Ubuntu 16.04 and Perl v5.26.1
Anyways, thanks for the reason to play around with Perl!
Have fun,
Ken
Input file would obviously make a difference, but Perl 5.22.1 was slightly faster than Awk 4.1.3 below on my 33.5 million line test file (12.23 vs 12.52 seconds).
schumack#daddyo2 10-02T18:25:17 54> wc -l listbig
33521910 listbig
schumack#daddyo2 10-02T18:25:58 55> /usr/bin/time -f '%E %P' awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' listbig
1-->9434310
2-->1605840
3-->9635040
4-->5218980
5-->4416060
7-->802920
8-->802920
9-->802920
12-->802920
0:12.52 99%
schumack#daddyo2 10-02T18:26:17 56> /usr/bin/time -f '%E %P' perl -lne '$_=~s/^(\S+) .*/$1/; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' listbig
1-->9434310
5-->4416060
2-->1605840
3-->9635040
12-->802920
8-->802920
9-->802920
4-->5218980
7-->802920
0:12.23 99%
Equivalent to your awk line
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file
By -a the line is broken into fields in #F so you want $F[0] as a key in a hash %a with the value of the counter handled by ++. The hash is iterated over keys and printed in the END block.
However, the efficiency comparison comes up. One way to improve this is to not fetch all fields on the line, done with -a, since only the first one is needed. Between two ways that come to mind
perl -nE'$a{(/(\S+)/)[0]}++; END { ... }'
and
perl -nE'$a{(split " ", $_, 2)[0]}++; END { ... }'
the split is significantly faster with its 3.63s vs 4.41s for regex, on a 8M-line file.
This is still behind 1.99s for your awk line. So it seems that awk is faster for this task.
Summary of my timings for an 8-million line file (average of a few runs)
awk (question) 1.99s
perl (split) 3.63s
perl (regex) 4.41s
perl (like awk) 5.61s
These timings vary over runs by a few tens of miliseconds (a few 0.01s).
This destructive method is the fastest I came up with:
perl -lne '$_=~s/\s.*//; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' file
However, it still is not quite as fast as awk.
You can go through a2p
$ cat file
1
1
2
3
3
3
$ perl -lane '$a{$F[0]}++;END{foreach my $k (keys %a){ print "$k --> $a{$k}" } }' file
1 --> 2
2 --> 1
3 --> 3
$ awk '{a[$1]++} END{ for(i in a){print i" --> "a[i]} }' file
1 --> 2
2 --> 1
3 --> 3

Split a string directly into array

Suppose I want to pass a string to awk so that once I split it (on a pattern) the substrings become the indexes (not the values) of an associative array.
Like so:
$ awk -v s="A:B:F:G" 'BEGIN{ # easy, but can these steps be combined?
split(s,temp,":") # temp[1]="A",temp[2]="B"...
for (e in temp) arr[temp[e]] #arr["A"], arr["B"]...
for (e in arr) print e
}'
A
B
F
G
Is there a awkism or gawkism that would allow the string s to be directly split into its components with those components becoming the index entries in arr?
The reason is (bigger picture) is I want something like this (pseudo awk):
awk -v s="1,4,55" 'BEGIN{[arr to arr["1"],arr["5"],arr["55"]} $3 in arr {action}'
No, there is no better way to map separated substrings to array indices than:
split(str,tmp); for (i in tmp) arr[tmp[i]]
FWIW if you don't like that approach for doing what your final pseudo-code does:
awk -v s="1,4,55" 'BEGIN{split(s,tmp,/,/); for (i in tmp) arr[tmp[i]]} $3 in arr{action}'
then another way to get the same behavior is
awk -v s=",1,4,55," 'index(s,","$3","){action}'
Probably useless and unnecessarily complex but I'll open the game with while, match and substr:
$ awk -v s="A:B:F:G" '
BEGIN {
while(match(s,/[^:]+/)) {
a[substr(s,RSTART,RLENGTH)]
s=substr(s,RSTART+RLENGTH)
}
for(i in a)
print i
}'
A
B
F
G
I'm eager to see (if there are) some useful solutions. I tried playing around with asorts and such.
Other way kind awkism
cat file
1 hi
2 hello
3 bonjour
4 hola
5 konichiwa
Run it,
awk 'NR==FNR{d[$1]; next}$1 in d' RS="," <(echo "1,2,4") RS="\n" file
you get,
1 hi
2 hello
4 hola

Resources