I have a file with n rows and 4 columns, and I want to read the content of the 2nd and 3rd columns, row by row. I made this
awk 'NR == 2 {print $2" "$3}' coords.txt
which works for the second row, for example. However, I'd like to include that code inside a loop, so I can go row by row of coords.txt, instead of NR == 2 I'd like to use something like NR == i while going over different values of i.
I'll try to be clearer. I don't want to wxtract the 2nd and 3rd columns of coords.txt. I want to use every element idependently. For example, I'd like to be able to implement the following code
for (i=1; i<=20; i+=1)
awk 'NR == i {print $2" "$3}' coords.txt > auxfile
func(auxfile)
end
where func represents anything I want to do with the value of the 2nd and 3rd columns of each row.
I'm using SPP, which is a mix between FORTRAN and C.
How could I do this? Thank you
It is of course inefficient to invoke awk 20 times. You'd want to push the logic into awk so you only need to parse the file once.
However, one method to pass a shell variable to awk is with the -v option:
for ((i=1; i<20; i+=2)) # for example
do
awk -v line="$i" 'NR == line {print $2, $3}' file
done
Here i is the shell variable, and line is the awk variable.
something like this should work, there is no shell loop needed.
awk 'BEGIN {f="aux.aux"}
NR<21 {close(f); print $2,$3 > f; system("./mycmd2 "f)}' file
will call the command with the temp filename for the first 20 lines, the file will be overwritten at each call. Of course, if your function takes arguments or input from stdin instead of file name there are easier solution.
Here ./mycmd2 is an executable which takes a filename as an argument. Not sure how you call your function but this is generic enough...
Note also that there is no error handling for the external calls.
the hideous system( ) only way in awk would be like
system("printf \047%s\\n\047 \047" $2 "\047 \047" $3 "\047 | func \047/dev/stdin\047; ");
if the func( ) OP mentioned can be directly called by GNU parallel, or xargs, and can take in values of $2 + $3 as its $1 $2, then OP can even make it all multi-threaded like
{mawk/mawk2/gawk} 'BEGIN { OFS=ORS="\0"; } { print $2, $3; } (NR==20) { exit }' file \
\
| { parallel -0 -N 2 -j 3 func | or | xargs -0 -n 2 -P 3 func }
Related
I want to multiply all entries in a array with numbers like 3.17 * 10^-7, but Bash can't do that. I tried with awk and bc, but it doesn't work. I would be obliged if someone can help me.
Input data example (overall 4000 datafile):
TecN210500-0100.plt
TecN210500-0200.plt
TecN210500-0300.plt
TecN210500-0400.plt
......
Here is my code:
#!/bin/bash
ZS=($(find . -name "*.plt"))
i=1
Variable=$(awk "BEGIN{print 10 ** -7}")
Solutiontime=$(awk "BEGIN{print 3.17 * $Variable}")
for Dataname in ${ZS[#]}
do
Cut=${Dataname:13}
Timesteps=${Cut:0:${#Cut}-4}
Array[i]=$Timesteps
i=$((i++))
p=$((i++))
done
Amount=$p
for ((i=1;i<10;i++))
do
Array[i]=${i}00
done
for (($i=1;i<$Amount+1;i++))
do
Array[i]=$(awk "BEGIN{print ${Array[i]} * $Solutiontime}")
done
Array[0]=Solutiontime
First loop:
Extract e.i. the "0100".
Second loop:
"Delete" the leading zero -> e.i. "100"
Last loop:
Multiply with time step -> e.i. "100 * 3.17*10^-7"
Do a little parameter expansion trimming on the filename, and then let awk do the math for you.
#!/bin/bash
for f in *.plt; do
num=${f##*-} # remove the stuff before the final -
num=${num%.*} # remove the stuff before the last .
num=${num#0} # remove the left-hand zero
awk "BEGIN {print $num * 3.17 * 10**-7}"
done
Or, done entirely with awk:
#!/bin/bash
for f in *.plt; do
awk -v f="$f" 'BEGIN {gsub(/^TecN[[:digit:]]+-0?|.plt$/, "", f); print f * 3.17 * 10**-7}'
done
awk to the rescue!
awk 'BEGIN{print 3.17 * 10^-7 }'
3.17e-07
iteration 1
awk -F'[-.]' '{printf "%s %e\n",substr($1,5),$2*3.17*10^-7}' file
210500 3.170000e-05
210500 6.340000e-05
210500 9.510000e-05
210500 1.268000e-04
for the posted file names used as input.
iteration 2
If you need just the computed numbers, simple drop the first field
awk -F'[-.]' '{printf "%e\n",$2*3.17*10^-7}' file
3.170000e-05
6.340000e-05
9.510000e-05
1.268000e-04
this will be the output of the script. I strongly suggest moving whatever logic you have inside the awk script rather than working on shell level with the array.
Im trying to put the contents of a awk command in to a bash array however im having a bit of trouble.
>>test.sh
f_checkuser() {
_l="/etc/login.defs"
_p="/etc/passwd"
## get mini UID limit ##
l=$(grep "^UID_MIN" $_l)
## get max UID limit ##
l1=$(grep "^UID_MAX" $_l)
awk -F':' -v "min=${l##UID_MIN}" -v "max=${l1##UID_MAX}" '{ if ( $3 >= min && $3 <= max && $7 != "/sbin/nologin" ) print $0 }' "$_p"
}
...
Used files:
Sample File: /etc/login.defs
>>/etc/login.defs
### Min/max values for automatic uid selection in useradd
UID_MIN 1000
UID_MAX 60000
Sample File: /etc/passwd
>>/etc/passwd
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
admin:x:1000:1000:Administrator,,,:/home/admin:/bin/bash
daniel:x:1001:1001:Daniel,,,:/home/daniel:/bin/bash
The output looks like:
admin:x:1000:1000:Administrator,,,:/home/admin:/bin/bash
daniel:x:1001:1001:User,,,:/home/user:/bin/bash
respectively (awk ... print $1 }' "$_p")
admin
daniel
Now my problem is to save the awk output in an Array to use it as variable.
>>test.sh
...
f_checkuser
echo "Array items and indexes:"
for index in ${!LOKAL_USERS[*]}
do
printf "%4d: %s\n" $index ${array[$index]}
done
It could/should look like this example.
Array items and indexes:
0: admin
1: daniel
Specially i would become all Users of a System (not root,bin,sys,ssh,...) without blocked users in an array.
Perhaps someone has another idea to solve my Problem?
Are you trying to set the output of one script to an array? There is a bash has a way of doing this. For example,
a=( $(seq 1 10) ); echo ${a[1]}
will populate the array a with elements 1 to 10 and will print 2, the second line generated by seq (array index starts at zero). Simply replace the contents of $(...) with your script.
For those coming to this years later ...
bash 4 introduced readarray (aka mapfile) exactly for this purpose.
See also Bash capturing output of awk into array
One solution that works:
array=()
f_checkuser(){
...
...
tempfile="localuser.tmp"
touch ${tempfile}
awk -F':'...'{... print $1 }' "$_p" > ${HOME}/${tempfile}
getArrayfromFile "${tempfile}"
}
getArrayfromFile() {
i=0
while read line # Read a line
do
array[i]=$line # Put it into the array
i=$(($i + 1))
done < $1
}
f_checkuser
echo "Array items and indexes:"
for index in ${!array[*]}
do
printf "%4d: %s\n" $index ${array[$index]}
done
Output:
Array items and indexes:
0: daniel
1: admin
But I like more to observe without a new temp-file.
So, have someone any another idea without a temp-file?
var1=$(echo $getDate | awk '{print $1} {print $2}')
var2=$(echo $getDate | awk '{print $3} {print $4}')
var3=$(echo $getDate | awk '{print $5} {print $6}')
Instead of repeating like the code above, I need to:
loop the same command
increment the values ({print $1} {print $2})
store the value in an array
I was doing something like below but I am stuck maybe someone can help me please:
COMMAND=`find $locationA -type f | wc -l`
getDate=$(find $locationA -type f | xargs ls -lrt | awk '{print $6} {print $7}')
a=1
b=2
for i in $COMMAND
do
i=$(echo $getDate | awk '{print $a} {print $b}')
myarray+=('$i')
a=$((a+1))
b=$((b+1))
done
PS - using ksh
Problem: $COMMAND stores the number of files found in $locationA. I need to loop through the amount of files found and store their dates in an array.
I don't get the meaning of your example code (what is the 'for' loop supposed to do? What is the content of the variable COMMAND?), but in your question you ask to store something in an array, while in the code you wish to simplify, you don't use an array, but simple variables (var1, var2, ....).
If I understand your requirement correctly, your variable getDate contains a string of several words, which are separated by spaces, and you want to assign the first two words to var1, the following two words to var2, and so on. Is this correct?
Now the edited code is at least a bit clearer, though I still don't understand, why you use i as a loop variable, and overwrite it in the first statement inside the loop.
However, a few comments:
If you push '$i' into your array, you will get a literal '$' sign, followed by the letter 'i'. To add a variable i containing to numbers, you need double quotes ("$i").
I don't understand why you want to loop over the cotnent of the variable COMMAND. This variable will always hold a single number, which means that the loop will be executed exactly once.
You could use a counting loop, incrementing loop variable by 2 on each iteration. You would have to precalculate the number of iterations beforehand.
Perhaps an easier alternative, which would work in bash or in zsh (I did not try other shells) is to first turn your variable in an array,
tmparr=($(echo $getDate|fmt -w 1))
and then use a loop to collect pairs of this element:
myarray=()
for ((i=0; i<${#tmparr[*]}; i+=2))
do
myarray+=("${tmparr[$i]} ${tmparr[$((i+1))]}")
done
${myarray[0]} will hold a string consisting of the first to words from getDate, etc.
This one should work on zsh, at least with newer versions:
myarray=()
echo $g|fmt -w 1|paste -s -d " \n"|while read s; do myarray+=("$s"); done
This leaves the first pair in ${myarray[1]}, etc.
It doesn't work with bash (and old zsh versions), because these shells would execute the body of the loop in a subshell.
ADDED:
On a second thought, in zsh this one would be simpler:
myarray=("${(f)$(echo $g|fmt -w 1|paste -s -d ' \n')}")
I have an awk script that I normally run in parallel using an outside variable $a.
awk -v a=$a '$4>a-5 && $4<a+5 {print $10,$4}' INFILE
It would of course run much faster using an array so I tried something like this to get it to do the same thing ($2 in LISTFILE being the search value for $4 in INFILE
awk 'FNR==NR{a[$2]=($2-5);next}$4 in a{if ($4>a[$4] && $4<a[$4]+10 {print} LISTFILE INFILE
This of course did not work because awk scanned until it reached the key and then started the testing the if statement, so only the downstream range was found. Unfortunately this isn't a continuous list, so often there is no $2-5 value, otherwise I would use that as the key for the array.
obviously I know how to do this using a combo of awk and bash, but I was wondering if there was an awk only solution for this.
My first answer addresses the actual question asked and fixes the awk script. But perhaps I have missed the point. If you want speed, and don't mind making more use of your multi-core processor, you can use GNU parallel. Here's an implementation that will launch 4 jobs at a time:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
parallel -j 4 "awk -v var={} '$awk_cmd' INFILE" :::: LISTFILE
As you can see, this will read INFILE up to four times concurrently. This answer, after adjustment of the number of jobs, should provide very similar performance to your parallel implementation that you describe using your shell. Therefore, you may like to split up your LISTFILE into smaller chunks and set awk_cmd to the command posted in my previous answer. There may be an optimal way to process you input, but that would largely depend on the size of INFILE and the number of elements in LISTFILE. HTH.
TESTING:
Create LISTFILE:
paste - - < <(seq 16) > LISTFILE
Create INFILE:
awk 'BEGIN { for (i=1; i<=9999999; i++) { print i, i, i, int(i * rand()), i, i, i, i, i, i } }' > INFILE
RESULTS:
TEST1:
time awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE >/dev/null
real 0m45.198s
user 0m45.090s
sys 0m0.160s
TEST2:
time for i in $(seq 1 2 16); do awk -v var="$i" '$4 > var - 5 && $4 < var + 5 { print $10, $4 }' INFILE; done >/dev/null
real 0m55.335s
user 0m54.433s
sys 0m0.953s
TEST3:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
time parallel --colsep "\t" -j 4 "awk -v var={2} '$awk_cmd' INFILE" :::: LISTFILE >/dev/null
real 0m28.190s
user 1m42.750s
sys 0m1.757s
My reply to THIS answer:
1:
The awk1 script does not run much faster than the awk script.
A 15% time saving is pretty significant in my opinion.
I suspect because it scans the LISTFILE for every line in the INFILE.
Yes, essentially. The awk1 script loops through INFILE just once.
So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE).
Close. But don't forget that by using an array, we actually remove any duplicate values in LISTFILE.
This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
This statement is therefore only true when LISTFILE contains no duplicates. Even if LISTFILE never contains any dups, having to read a single file multiple times is best avoided.
2:
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
What four minute result? When benchmarking this sort of thing, you should stop writing the output to disk. If your machine has some background process going on when your running your tests, you will only end up biasing your results with the write speed of your disk. Use /dev/null instead.
3:
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
If you remove the pipe to sort and uniq you will get a better idea of where the time difference is. You will find that doing $4 > i - 5 && $4 < i + 5 is grossly different to doing $4 < i + 5 && $4 > i - 5. If awkout.txt is the same as awkout.txt, you are spending time processing duplicates.
4:
The second command you posted here avoids this test: $4 > i - 5 && $4 < i + 5. I wouldn't think that that alone would warrant a 90% improvement in runtime. Something smells wrong. Would you mind re-running your tests writing to /dev/null and posting the contents of LISTFILE and INFILE? If those two files are confidential, could you provide some example files with the amount of content equal to the originals?
Other thoughts:
To me, it looks like something along these lines would also work:
awk 'FNR==NR { for (i=$2-4;i<$2+5;i++) a[i]; next } $4 in a { b[$10,$4] } END { print length b }' LISTFILE INFILE
It looks like you just need to add the keys of LISTFILE to an array, then, as you process INFILE (line by line), test each key in your array with your 'if' statement. You can do this using the following construct or similar:
for (i in a) { print i, a[i] }
Here's some untested code that may help get you started. Notice how I have not assigned any values to my keys:
awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE
Steves answer above is the correct answer to the question. Below is a comparison of array and non array ways to handle the problem.
I created a test program to look at two different scenarios and the results from each. The test programs code is here:
echo time for bash
time for line in `awk '{print $2}' $1` ; do awk -v a=$line '$4>a-5&&$4<a+5{print $4,$10}' $2 ; done | sort | uniq -c > bashout.txt
echo time for awk
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4>i-5&&$4<i+5) print $10,$4}}' $1 $2 |sort | uniq -c > awkout.txt
echo time for awk2
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4<i+5&&$4>i-5) print $10,$4}}' $1 $2 |sort | uniq -c > awk2out.txt
echo time for awk3
time awk '{a=$2;b=$1;for (i=a-4;i<a+5;i++) print b,i}' $1 > LIST2;time awk 'FNR==NR{a[$2];next}$4 in a{print $10,$4}' LIST2 $2 | sort | uniq -c > awk3out.txt
Here is the output:
time for bash
real 2m22.394s
user 2m15.938s
sys 0m6.409s
time for awk
real 2m1.719s
user 2m0.919s
sys 0m0.782s
time for awk2
real 1m49.146s
user 1m47.607s
sys 0m1.524s
time for awk3
real 0m0.006s
user 0m0.000s
sys 0m0.001s
real 0m12.788s
user 0m12.096s
sys 0m0.695s
4 observations/questions
The awk1 script does not run much faster than the awk script. I suspect because it scans the LISTFILE for every line in the INFILE. So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE). This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
Making an expanded LIST2 from the LISTFILE and using that as the array makes the program run significantly faster, at the cost of increasing the memory footprint. Considering how small the list I"m looking at is (only 200-300 long) that seems to be the way to go, even over doing this in parallel.
I have a file consisting of digits. Usually, each line contains one single number. I would like to count the number of lines in the file that begin with digit '0'. If it's the case, then I would like to do some post-processing.
Although I'm able to retrieve correctly the corresponding line numbers, the total number of retrieved lines is not correct. Below, I'm posting the code that I'm using.
linesToRemove=$(awk '/^0/ { print NR; }' ${inputFile});
# linesToRemove=$(grep -n "^0" ${inputFile} | cut -d":" -f1);
linesNr=${#linesToRemove} # <- here, the error
# linesNr=${#linesToRemove[#]} # <- here, the error
if [ "${linesNr}" -gt "0" ]; then
# do something here, e.g. remove corresponding lines.
awk -v n=$linesToRemove 'NR == n {next} {print}' ${anotherFile} > ${outputFile}
fi
Also, as for the awk-based command, how could I use a shell-variable? I tried the command below, but it's not working correctly, since 'myIndex' is interpreted as a text and not as a variable.
linesToRemove=$(awk -v myIndex="$myIndex" '/^myIndex/ { print NR;}' ${inputFile});
Given the line numbers starting with 0 found in ${inputFile}, I would like to remove the corresponding lines numbers from ${anotherFile}. An example for both ${inputFile} and ${anotherFile} is given below:
// ${inputFile}
0
1
3
0
// ${anotherFile}
2.617300e+01 5.886700e+01 -1.894697e-01 1.251225e+02
5.707397e+01 2.214040e+02 8.607959e-02 1.229114e+02
1.725900e+01 1.734360e+02 -1.298053e-01 1.250318e+02
2.177940e+01 1.249531e+02 1.538853e-01 1.527150e+02
// ${outputFile}
5.707397e+01 2.214040e+02 8.607959e-02 1.229114e+02
1.725900e+01 1.734360e+02 -1.298053e-01 1.250318e+02
In the example above, I need to delete lines 0 and 3 from ${anotherFile}, given that those lines correspond to the lines starting with 0 in ${inputFile}.
If you want to count the number of lines in the file that begins with 0, then this line is wrong.
linesToRemove=$(awk '/^0/ { print NR; }' ${inputFile});
The above says to print the line number when the line start with 0, and your linesToRemove variable will contain all the line numbers, not the total number of lines. Use END{} block to capture the total. eg
linesToRemove=$(awk '/^0/ {c++}END{print c}' ${inputFile});
As for your 2nd question on using variable inside awk, use the regex operator ~. And then set your myIndex variable to include the ^ anchor
linesToRemove=$(awk -v myIndex="^$myIndex" '$0 ~ myIndex{ print NR;}' ${inputFile});
finally, if you just want to remove those lines that start with 0, then just simply remove it
awk '/^0/{next}{print $0>FILENAME}' file
If you want to remove lines from another file using what is captured in input file, here's one way
paste -d"|" inputfile anotherfile | awk '!/^0/{gsub(/^.*\|/,"");print}'
Or just one awk command
awk 'FNR==NR && /^0/{a[FNR]} NR>FNR && (!(FNR in a))' inputfile anotherfile
crude explanation: FNR==NR && /^0/ means process the first file whole line starts with 0 and put its line number into array a. NR>FNR means process the next file and if line number not in array, print the line. See the gawk documentation for what FNR,NR etc means
I think you have to do the following to assign an array:
linesToRemove=( $(awk '/^0/ { print NR; }' ${inputFile}) )
And to get the number of elements do (as you have in a commented line):
linesNr=${#linesToRemove[#]}
To remove the lines from from the file you could do something like:
sedCmd=""
for lineNr in ${linesToRemove[#]}; do
sedCmd="$sedCmd;${lineNr}d"
done
sed "$sedCmd" ${anotherFile} > ${outputFile}
In general if you do this:
linesToRemove=$(awk '/^0/ { print NR; }' ${inputFile});
instead of this:
linesToRemove=$(awk '/^0/ { print NR; }' ${inputFile});
linesNr=${#linesToRemove}
use this:
linesToRemove=$(awk '/^0/ { print NR; }' ${inputFile});
linesNr=${echo $linesToRemove|awk '{print NF}'}
POC :
cat temp.sh
#!/usr/bin/ksh
lines=$(awk '/^d/{print NR}' script.sh)
nooflines=$(echo $lines|awk '{print NF}')
echo $nooflines
torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay> temp.sh
8
torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay>
It greatly depends on the post-processing you are doing, but do you really need the actual count? Why not do something like this:
if grep ^0 $inputfile > /dev/null; then
# There is at least one line with a leading 0
:
fi
grep -v ^0 $inputfile | process-lines-without-leading-zero
grep ^0 $inputfile | process-lines-with-leading-zero
Or, even just:
if grep ^0 $inputfile | process-lines-with-leading-zero; then
# some post processing
:
fi
--EDIT--
Based on what you've said in your comment, I would recommend a different approach. If I understand you correctly, you want to read file a, looking for lines of the form ^0[0-9]*,
and then remove those line numbers from file b. Doing it one line at a time is pretty slow if the files get big. Just do:
cmd=$( grep '^0[0-9]*$' a | sed 's/$/d;/g' )
sed "$cmd" b
The assignment to cmd forms a sed command to delete the lines. Invoking sed on b will omit those lines. You'll need to redirect the sed output appropriately (perhaps to a temp file and then back to b, or just use 'sed -i' if you're using gnu sed.)
Given the large number of edits to this question, it seems easiest to start a new answer. Your problem can be solved with a simple one-liner:
$ sed "$( grep -n ^0 $inputFile | sed 's/:.*/d;/g' )" $anotherFile > $outputFile