Linux shell passing column position dynamically - loops

question is: how to pass column position (e.g., $2) dynamically through a loop.
Example file temp1
a 1 2
a 2 3
b 1 1
b 3 2
c 1 5
c 2 6
code so far (does not work :-))
#!/bin/bash
twopq () {
awk -v c1="$1" -v c2="$2" '{ if ($1==c1 && c2 == 1) {print}}' temp1 > temp2
}
twopq a $2
twopq b $3
Desired output in temp2 from 1st loop (1st col = 'a' and 2nd col = 1)
a 1 2
desired output in temp2 from 2nd loop (1st col= b and 3rd col = 1)
b 1 1
my pb is to pass the "$" through my loop to tell I'm looking for col2 in the first loop and col3 in the second loop
thanks for the help!

Assumptions:
1st argument is a value we're looking for in the 1st column of the input file
2nd argument is the column number we're looking for that has a value of 1
print all lines that match the search criteria
Adding a couple lines to demonstrate multiple matches:
$ cat temp1
a 1 2
a 2 3
b 1 1
b 3 2
c 1 5
c 2 6
d 1 5 # new line
d 1 9 # new line
A few tweaks to OP's current code:
twopq () { awk -v val="$1" -v colnum="$2" '$1==val && $(colnum)==1' temp1; }
Taking for a test drive:
$ twopq a 2
a 1 2
$ twopq b 3
b 1 1
$ twopq d 2
d 1 5
d 1 9
NOTES:
once the output is verified OP can update the function as needed to capture the output to temp2 (eg, > temp2 to overwrite on each function call; >> temp2 to append with each function call)
alternatively, route the output from the function call to the output file (eg, twopq a 2 > temp2, twopq b 3 >> temp2)

Like this:
#!/bin/bash
twopq () {
awk -v c1="$1" '($1==c1) {
for (i=2; i<=NF; i++)
if ($i == 1) {print;exit}
}' temp1 | tee -a temp2
}
twopq a 2
twopq b 3
Output
a 1 2
b 1 1

did you tried to use $c2 inside if condition ?
awk -v c1="$1" -v c2="$2" '{ if ($1==c1 && $c2 == 1) {print}}' temp1 > temp2

Related

How to print lines with multiple associative arrays and conditions using awk

I want to print all lines from file 1 where the values of $1 and $4 are found in $1 and $4 of file 2 AND where the value in file 1 $2 is greater than or equal to the value in file 2 $2 AND where the value in file 1 $3 is less than or equal to the value in file 2 $3.
file 1
1 110201809 117658766 a
1 168095261 182305990 b
1 215456074 233436403 c
2 9465687 12905490 d
2 28765309 35235120 e
2 48958595 64702082 f
file 2
1 245371026 249210707 a
2 937388 46504962 h
2 937388 162731186 b
2 2954974 6777829 c
2 9465687 12996275 d
2 14539477 44757554 d
2 14766820 30080818 m
2 16531332 23584565 n
2 17340076 26206255 o
2 18535880 24452180 p
2 28830071 35289330 q
2 36206662 47273732 r
2 48958495 64703082 f
Desired output only prints the lines from file 1 that meet the condition.
desired output
2 9465687 12905490 d
2 48958595 64702082 f
I've tried the following which gave an empty file:
awk 'NR==FNR{ a[$1,$4]= $0; b[$2] = $2 ; c[$3] = $3; next } ($1 $4 in a) && ($2 >= b[$2]) && ($3 <= c[$3])' file2 file1>desired output
I would do this by collecting the second and third columns in separate hashes, e.g.:
parse.awk
NR==FNR {
g[$1,$4] = $2
h[$1,$4] = $3
next
}
($1 SUBSEP $4 in g) && g[$1,$4] >= $2 && h[$1,$4] <= $3
Run it like this:
awk -f parse.awk file1 file2
Output:
2 9465687 12996275 d
2 48958495 64703082 f

Filter column from file based on header matching a regex

I have the following file
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
And I'd like to remove the columns with the _N tag on header (or selecting all the others)
So the output should be
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
I found some answers but none were doing this exactly
I know awk can do this but I don't understand how to do it by myself (I'm not good at awk) with this language.
Thanks for the help :)
awk 'NR==1{for(i=1;i<=NF;i++)if(!($i~/_N$/)){a[i]=1;m=i}}
{for(i=1;i<=NF;i++)if(a[i])printf "%s%s",$i,(i==m?RS:FS)}' f|column -t
outputs:
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
if ( (tgt == "") || ($i !~ tgt) ) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf?OFS:ORS)
}
}
$ awk -v tgt="_N" -f tst.awk file | column -t
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ awk -f tst.awk file | column -t
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
$ awk -v tgt="blop" -f tst.awk file | column -t
foo_foo baz_N toto_N
1 0 0
1 0 0
The main difference between this and #Kent's solution is performance and the impact will vary based on the percentage of fields you want to print on each line.
The above when reading the first line of the file creates an array of the field numbers to print and then for every line of the input file it just prints those fields in a loop. So if you wanted to print 3 out of 100 fields then this script would just loop through 3 iterations/fields on each input line.
#Kent's solution also creates an array of the field numbers to print but then for every line of the input file it visits every field to test if it's in that array before printing or not. So if you wanted to print 3 out of 100 fields then #Kent's script would loop through all 100 iterations/fields on each input line.

How to import data with markers - but excluding those markers?

When I go to import a matrix of data, in the first row of the first column there is a marker for every new time data is acquired and this marker is interfering with how MATLAB imports the data.
Is there a way to code this out?
for example:
'>1 6 1 1 -0.00161
1 6 1 2 -0.00140
1 6 1 3 -0.00145
1 6 1 4 -0.00153
1 6 1 5 -0.00120
1 6 1 6 -0.00076
I would prefer to not manually remove the > from the data as there will be potentially thousands.
If you're under *nix system or you have cygwin then you can get rid of these > if you send this output to the command sed. For instance:
user#host $ cat out.txt
>0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g'
If you need to store this new output to a file:
user#host $ cat out.txt
0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g' > out_without_unneeded_symbols.txt
user#host $ cat out_without_unneeded_symbols.txt
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
If this output is taken from some program at current dir:
user#host $ ./some_program |sed 's/>//g'
Here is one possible implementation in MATLAB:
% read file lines as a cell array of strings
fid = fopen('file.dat', 'rt');
C = textscan(fid, '%s', 'Delimiter','');
C = C{1};
fclose(fid);
% find marker locations
markers = strncmp('>', C, 1);
% remove markers
C = regexprep(C, '^>', '');
% parse numbers into a numeric matrix
X = regexp(C, '\s+', 'split');
X = str2double(vertcat(X{:}));
The result:
% the full matrix
>> X
X =
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
% only the marked rows
>> X(markers,:)
ans =
0 5 3 4
1 5 3 6

Awk array with dynamic indices

Lets say I have a tab delimited file lookup.txt
070-031 070-291 030-031
1 2 X
2 3 1
3 4 2
4 5 3
5 6 4
6 7 5
7 8 6
8 9 7
And I have the following files with values to lookup from
$cat 030-031.txt
Line1 070-291 4
Line2 070-031 3
$cat 070-031.txt
Line1 030-031 5
Line2 070-291 8
I would like script.awk to return
$script.awk 030-031.txt lookup.txt
Line1 070-291 4 2
Line2 070-031 3 2
and
$script.awk 070-031.txt lookup.txt
Line1 030-031 5 6
Line2 070-291 8 7
The only thing I can think to do is to create two separate expanded lookup.txt eg
$cat lookup_030-031.txt
070-031:1 X
070-031:2 1
070-031:3 2
070-031:4 3
070-031:5 4
070-031:6 5
070-031:7 6
070-031:8 7
070-291:2 X
070-291:3 1
070-291:4 2
070-291:5 3
070-291:6 4
070-291:7 5
070-291:8 6
070-291:9 7
and then
awk 'NR==FNR { a[$1]=$2;next}{print $0,a[$2":"$3]}' lookup_030-031.txt 030-031.txt
This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each. Many thanks
AMENDED
Glenn Jackman's answer is a perfect solution to the initial question and his second answer is more efficient. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to handle
$cat 030-031
070-031 3
070-031 6
and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Only Glens first answer handles repeated lookups. His second returns the last values found.
OK, I see now. You have to read the lookup file into a big datastructure, then referencing with the individual files is easy.
$ cat script.awk
BEGIN {OFS = "\t"}
NR==1 {
for (i=1; i<=NF; i++)
label[i] = $i
next
}
NR==FNR {
for (i=1; i<=NF; i++)
for (j=1; j<=NF; j++)
if (i != j)
value[label[i],$i,label[j]] = $j
next
}
FNR==1 {
split(FILENAME, a, /\./)
j = a[1]
}
{
$(NF+1) = value[$1,$2,j]
print
}
$ awk -f script.awk lookup.txt 030-031.txt
070-291 4 2
070-031 3 2
$ awk -f script.awk lookup.txt 070-031.txt
030-031 5 6
070-291 8 7
This version is a bit more compact, and passes the filenames in your preferred order:
$ script.awk
BEGIN {OFS = "\t"}
NR==1 {
split(FILENAME, a, /\./)
dest = a[1]
}
NR==FNR {
src[$1]=$2
next
}
FNR==1 {
for (i=1; i<=NF; i++)
col[$i]=i
next
}
{
for (from in src)
if ($col[from] == src[from])
print from, src[from], $col[dest]
}
$ awk -f script.awk 030-031.txt lookup.txt
070-031 3 2
070-291 4 2
$ awk -f script.awk 070-031.txt lookup.txt
030-031 5 6
070-291 8 7
This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each.
Your dataset is small enough to where you have the option of keeping the lookups in memory.
In a BEGIN section, read "lookup.txt" into a two-dimension (nested) array so that:
lookup['070-031'][4] = 3
lookup['070-291'][5] = 3
The run through all the data files all at once:
script.awk 070-031.txt 070-291.txt

How to output counts for list of active/inactive inputs?

I have this input file (1=active, 0=inactive)
a 1
a 0
b 1
b 1
b 0
c 0
c 0
c 0
c 0
.
.
.
And want output like this:
X repeats active count inactive count
a 2 times 1 1
b 3 times 2 1
c 4 times 0 4
I tried:
awk -F "," '{if ($2==1) a[$1]++; } END { for (i in a); print i, a[i] }'file name
But that did not work.
How can I get the output?
Just to give you an idea this awk should work:
awk '$2{a[$1]++; next} {b[$1]++; if (!($1 in a)) a[$1]=0} END{for (i in a) print i, a[i], b[i], (a[i]+b[i])}' file
a 1 1 2
b 2 1 3
c 0 4 4
You can format the output way you want.
You can try
awk -f r.awk input.txt
where input.awk is your data file, and r.awk is
{
X[$1]++
if ($2) a[$1]++
else ia[$1]++
}
END {
printf "X\tRepeat\tActive\tInactive\n"
for (i in X) {
printf "%s\t%d\t%d\t%d\n", i, X[i], a[i], ia[i]
}
}
This is GNU awk
awk '{a[$1]++; if ($2!=0) {b[$1]++;c[$1]+=0} else {c[$1]++;b[$1]+=0}}END {for (i in a) print i, a[i], b[i], c[i]}' file
Here is another simple way to do it with awk
awk '{a[$1]++;b[$1]+=$2} END { for (i in a) print i,a[i],b[i],a[i]-b[i]}' file
a 2 1 1
b 3 2 1
c 4 0 4
No test is needed, just sum the column $2 and this gives number of hits.
awk '
{ repeats[$1]++; counts[$1,$2]++ }
END {
for (key in repeats)
print key, repeats[key], counts[key,1]+0, counts[key,0]+0
}
' file

Resources