I have a file with n rows and 4 columns, and I want to read the content of the 2nd and 3rd columns, row by row. I made this
awk 'NR == 2 {print $2" "$3}' coords.txt
which works for the second row, for example. However, I'd like to include that code inside a loop, so I can go row by row of coords.txt, instead of NR == 2 I'd like to use something like NR == i while going over different values of i.
I'll try to be clearer. I don't want to wxtract the 2nd and 3rd columns of coords.txt. I want to use every element idependently. For example, I'd like to be able to implement the following code
for (i=1; i<=20; i+=1)
awk 'NR == i {print $2" "$3}' coords.txt > auxfile
func(auxfile)
end
where func represents anything I want to do with the value of the 2nd and 3rd columns of each row.
I'm using SPP, which is a mix between FORTRAN and C.
How could I do this? Thank you
It is of course inefficient to invoke awk 20 times. You'd want to push the logic into awk so you only need to parse the file once.
However, one method to pass a shell variable to awk is with the -v option:
for ((i=1; i<20; i+=2)) # for example
do
awk -v line="$i" 'NR == line {print $2, $3}' file
done
Here i is the shell variable, and line is the awk variable.
something like this should work, there is no shell loop needed.
awk 'BEGIN {f="aux.aux"}
NR<21 {close(f); print $2,$3 > f; system("./mycmd2 "f)}' file
will call the command with the temp filename for the first 20 lines, the file will be overwritten at each call. Of course, if your function takes arguments or input from stdin instead of file name there are easier solution.
Here ./mycmd2 is an executable which takes a filename as an argument. Not sure how you call your function but this is generic enough...
Note also that there is no error handling for the external calls.
the hideous system( ) only way in awk would be like
system("printf \047%s\\n\047 \047" $2 "\047 \047" $3 "\047 | func \047/dev/stdin\047; ");
if the func( ) OP mentioned can be directly called by GNU parallel, or xargs, and can take in values of $2 + $3 as its $1 $2, then OP can even make it all multi-threaded like
{mawk/mawk2/gawk} 'BEGIN { OFS=ORS="\0"; } { print $2, $3; } (NR==20) { exit }' file \
\
| { parallel -0 -N 2 -j 3 func | or | xargs -0 -n 2 -P 3 func }
Sample input:
a 54 65 43
b 45 12 98
c 99 0 12
d 3 23 0
Sample output:
c,d
Basically I want to check if there's a value of zero in each line, if yes, print the index(a,b,c,d).
My code:
for(i=1;i<=NF;i++)if(i==0){print$1} I got a syntax error
Thanks.
another approach
$ awk '/\y0\y/{print $1}' file
c
d
\y is the word-boundary operator. Might be only in gawk.
The code needs a set of braces.
awk '{ for(i=1;i<=NF;i++)if($i==0) print $1}' filename
(The print doesn't need braces so I took those out.)
If the first field doesn't ever contain a number, maybe start the loop from 2.
The general form of an Awk script is a sequence of
condition { action }
pairs, where the latter needs braces around it. In the absence of a condition, an action is taken on each line, unconditionally.
To make your code work, you need change it to:
$ awk '{for(i=1;i<=NF;i++)if($i==0) print $1}' file
c
d
You need to put the code inside a block ({} pair).
You have to use $i instead of i in the if condition, $i means the ith column.
Although it's not needed here, it's better to add a space between command and paramter. (print $1)
And it's better to improve it a little bit:
awk '{for(i=1;i<=NF;i++)if($i==0) {print $1;next}}' file
Add next to avoid print $1 multiple times when there're more than one 0 in the line.
Given the columns are space separated, you can do it this way too:
awk '/( |^)0( |$)/{print $1}' file
This one does not require GNU awk.
/( |^)0( |$)/ is a RegEx, and in the command it's short for $0 ~ /( |^)0( |$)/.
^ means line beginnings, $ line endings here.
I have a big file with 1 column and 800,000 rows
Example:
123
234
...
5677
222
444
I want to transpose it into 20 numbers per line.
Example:
123,234,....
5677,
222,
444,....
I tried using while loop like this
while [ $(wc -l < list.dat) -ge 1 ]
do
cat list.dat | head -20 | awk -vORS=, '{ print $1 }'| sed 's/,$/\n/' >> sample1.dat
sed -i -e '1,20d' list.dat
done
but this is insanely slow.
Can anyone suggest a faster solution?
pr is the right tool for this, for example:
$ seq 100 | pr -20ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
For your file, try pr -20ats, list.dat
Based on width of column text, you might run into the error pr: page width too narrow. In that case, try:
$ seq 100000 100100 | pr -40ats,
pr: page width too narrow
$ seq 100000 100100 | pr -J -W79 -40ats,
100000,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,100011,100012,100013,100014,100015,100016,100017,100018,100019,100020,100021,100022,100023,100024,100025,100026,100027,100028,100029,100030,100031,100032,100033,100034,100035,100036,100037,100038,100039
100040,100041,100042,100043,100044,100045,100046,100047,100048,100049,100050,100051,100052,100053,100054,100055,100056,100057,100058,100059,100060,100061,100062,100063,100064,100065,100066,100067,100068,100069,100070,100071,100072,100073,100074,100075,100076,100077,100078,100079
100080,100081,100082,100083,100084,100085,100086,100087,100088,100089,100090,100091,100092,100093,100094,100095,100096,100097,100098,100099,100100
Formula for -W value is (col-1)*len(delimiter) + col where col is number of columns required
From man pr
pr - convert text files for printing
-a, --across
print columns across rather than down, used together with -COLUMN
-t, --omit-header
omit page headers and trailers; implied if PAGE_LENGTH <= 10
-s[CHAR], --separator[=CHAR]
separate columns by a single character, default for CHAR is the character without -w and 'no
char' with -w. -s[CHAR] turns off line truncation of all 3 column options (-COLUMN|-a -COLUMN|-m)
except -w is set
-COLUMN, --columns=COLUMN
output COLUMN columns and print columns down, unless -a is used. Balance number of lines in the columns
on each page
-J, --join-lines
merge full lines, turns off -W line truncation, no column alignment, --sep-string[=STRING] sets separa‐
tors
-W, --page-width=PAGE_WIDTH
set page width to PAGE_WIDTH (72) characters always, truncate lines, except -J option is set, no inter‐
ference with -S or -s
See also Why is using a shell loop to process text considered bad practice?
If you don't wish to use any other external binaries, you can refer the below SO link answering a similar question in depth.
bash: combine five lines of input to each line of output
If you want to use sed:
sed -n '21~20 { x; s/^\n//; s/\n/, /g; p;}; 21~20! H;' list.dat
The first command
21~20 { x; s/^\n//; s/\n/, /g; p;},
is triggered at lines matching 21+(n*20); n>=0. Here everything that was put in the hold space at complement lines via the second command:
21~20! H;
is processed:
x;
puts the content of the hold buffer (20 lines) in the pattern space and places the current line (21+(n*20)) in the hold buffer. In the pattern space:
s/^\n//
removes the trailing new line and:
s/\n/, /g
does the desired substitution.:
p;
prints the now 20-columned row.
After that the next line is read in the hold buffer and the process continues.
I have an awk script that I normally run in parallel using an outside variable $a.
awk -v a=$a '$4>a-5 && $4<a+5 {print $10,$4}' INFILE
It would of course run much faster using an array so I tried something like this to get it to do the same thing ($2 in LISTFILE being the search value for $4 in INFILE
awk 'FNR==NR{a[$2]=($2-5);next}$4 in a{if ($4>a[$4] && $4<a[$4]+10 {print} LISTFILE INFILE
This of course did not work because awk scanned until it reached the key and then started the testing the if statement, so only the downstream range was found. Unfortunately this isn't a continuous list, so often there is no $2-5 value, otherwise I would use that as the key for the array.
obviously I know how to do this using a combo of awk and bash, but I was wondering if there was an awk only solution for this.
My first answer addresses the actual question asked and fixes the awk script. But perhaps I have missed the point. If you want speed, and don't mind making more use of your multi-core processor, you can use GNU parallel. Here's an implementation that will launch 4 jobs at a time:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
parallel -j 4 "awk -v var={} '$awk_cmd' INFILE" :::: LISTFILE
As you can see, this will read INFILE up to four times concurrently. This answer, after adjustment of the number of jobs, should provide very similar performance to your parallel implementation that you describe using your shell. Therefore, you may like to split up your LISTFILE into smaller chunks and set awk_cmd to the command posted in my previous answer. There may be an optimal way to process you input, but that would largely depend on the size of INFILE and the number of elements in LISTFILE. HTH.
TESTING:
Create LISTFILE:
paste - - < <(seq 16) > LISTFILE
Create INFILE:
awk 'BEGIN { for (i=1; i<=9999999; i++) { print i, i, i, int(i * rand()), i, i, i, i, i, i } }' > INFILE
RESULTS:
TEST1:
time awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE >/dev/null
real 0m45.198s
user 0m45.090s
sys 0m0.160s
TEST2:
time for i in $(seq 1 2 16); do awk -v var="$i" '$4 > var - 5 && $4 < var + 5 { print $10, $4 }' INFILE; done >/dev/null
real 0m55.335s
user 0m54.433s
sys 0m0.953s
TEST3:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
time parallel --colsep "\t" -j 4 "awk -v var={2} '$awk_cmd' INFILE" :::: LISTFILE >/dev/null
real 0m28.190s
user 1m42.750s
sys 0m1.757s
My reply to THIS answer:
1:
The awk1 script does not run much faster than the awk script.
A 15% time saving is pretty significant in my opinion.
I suspect because it scans the LISTFILE for every line in the INFILE.
Yes, essentially. The awk1 script loops through INFILE just once.
So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE).
Close. But don't forget that by using an array, we actually remove any duplicate values in LISTFILE.
This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
This statement is therefore only true when LISTFILE contains no duplicates. Even if LISTFILE never contains any dups, having to read a single file multiple times is best avoided.
2:
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
What four minute result? When benchmarking this sort of thing, you should stop writing the output to disk. If your machine has some background process going on when your running your tests, you will only end up biasing your results with the write speed of your disk. Use /dev/null instead.
3:
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
If you remove the pipe to sort and uniq you will get a better idea of where the time difference is. You will find that doing $4 > i - 5 && $4 < i + 5 is grossly different to doing $4 < i + 5 && $4 > i - 5. If awkout.txt is the same as awkout.txt, you are spending time processing duplicates.
4:
The second command you posted here avoids this test: $4 > i - 5 && $4 < i + 5. I wouldn't think that that alone would warrant a 90% improvement in runtime. Something smells wrong. Would you mind re-running your tests writing to /dev/null and posting the contents of LISTFILE and INFILE? If those two files are confidential, could you provide some example files with the amount of content equal to the originals?
Other thoughts:
To me, it looks like something along these lines would also work:
awk 'FNR==NR { for (i=$2-4;i<$2+5;i++) a[i]; next } $4 in a { b[$10,$4] } END { print length b }' LISTFILE INFILE
It looks like you just need to add the keys of LISTFILE to an array, then, as you process INFILE (line by line), test each key in your array with your 'if' statement. You can do this using the following construct or similar:
for (i in a) { print i, a[i] }
Here's some untested code that may help get you started. Notice how I have not assigned any values to my keys:
awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE
Steves answer above is the correct answer to the question. Below is a comparison of array and non array ways to handle the problem.
I created a test program to look at two different scenarios and the results from each. The test programs code is here:
echo time for bash
time for line in `awk '{print $2}' $1` ; do awk -v a=$line '$4>a-5&&$4<a+5{print $4,$10}' $2 ; done | sort | uniq -c > bashout.txt
echo time for awk
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4>i-5&&$4<i+5) print $10,$4}}' $1 $2 |sort | uniq -c > awkout.txt
echo time for awk2
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4<i+5&&$4>i-5) print $10,$4}}' $1 $2 |sort | uniq -c > awk2out.txt
echo time for awk3
time awk '{a=$2;b=$1;for (i=a-4;i<a+5;i++) print b,i}' $1 > LIST2;time awk 'FNR==NR{a[$2];next}$4 in a{print $10,$4}' LIST2 $2 | sort | uniq -c > awk3out.txt
Here is the output:
time for bash
real 2m22.394s
user 2m15.938s
sys 0m6.409s
time for awk
real 2m1.719s
user 2m0.919s
sys 0m0.782s
time for awk2
real 1m49.146s
user 1m47.607s
sys 0m1.524s
time for awk3
real 0m0.006s
user 0m0.000s
sys 0m0.001s
real 0m12.788s
user 0m12.096s
sys 0m0.695s
4 observations/questions
The awk1 script does not run much faster than the awk script. I suspect because it scans the LISTFILE for every line in the INFILE. So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE). This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
Making an expanded LIST2 from the LISTFILE and using that as the array makes the program run significantly faster, at the cost of increasing the memory footprint. Considering how small the list I"m looking at is (only 200-300 long) that seems to be the way to go, even over doing this in parallel.