Using awk to process a database - database

I have a directory on my computer which contains an entire database I found online for my research. This database contains thousands of files, so to do what I need I've been looking into file i/o stuff. A programmer friend suggested using bash/awk. I've written my code:
#!/usr/bin/env awk
ls -l|awk'
BEGIN {print "Now running"}
{if(NR == 17 / $1 >= 0.4 / $1 <= 2.5)
{print $1 > wavelengths.txt;
print $2 > reflectance.txt;
print $3 > standardDev.txt;}}END{print "done"}'
When I put this into my console, I'm already in the directory of the files I need to access. The data I need begins on line 17 of EVERY file. The data looks like this:
some number some number some number
some number some number some number
. . .
. . .
. . .
I want to access the data when the first column has a value of 0.4 (or approximately) and get the information up until the first column has a value of approximately 2.5. The first column represents wavelengths. I want to verify they are all the same for each file later, so I copy them into a file. The second column represents reflectance and I want this to be a separate file because later I'll take this information and build a data matrix from it. And the third column is the standard deviation of the reflectance.
The problem I am having now is that when I run this code, I get the following error: No such file or directory
Please, if anyone can tell me why I might be getting this error, or can guide me as to how to write the code for what I am trying to do... I will be so grateful.

The main problem is that you need to quote the names of the output file names as they are strings not variables. Use:
print $1 > "wavelengths.txt"
instead of:
print $1 > wavelengths.txt

Excellent attempt, but this is because you should never parse the output of ls. Still, you were probably looking for ls -1, not ls -l. awk can also accept a glob of files. For example, in the desired directory, you can run:
awk -f /path/to/script.awk *
Contents of script.awk:
BEGIN {
print "Now running"
}
NR == 17 && $1 >= 0.4 && $1 <= 2.5 {
print $1 > "wavelengths.txt"
print $2 > "reflectance.txt"
print $3 > "standardDev.txt"
}
END {
print "Done"
}

Related

Strange Memory Behavior handling TSV

I have a .tsv and I need to figure out the frequencies variables in a specific column and organize that data in descending order. I run a script in c which downloads a buffer and saves it to a .tsv file with a date stamp for a name in the same directory as my code. I then open my Terminal and run the following command, per this awesome SO answer:
cat 2016-09-06T10:15:35Z.tsv | awk -F '\t' '{print $1}' * | LC_ALL=C sort | LC_ALL=C uniq -c | LC_ALL=C sort -nr > tst.tsv
To break this apart by pipes, what this does is:
cat the .tsv file to get its contents into the pipe
awk -F '\t' '{print $1}' * breaks the file's contents up by tab and pushes the contents of the first column into the pipe
LC_ALL=C sort takes the contents of the pipe and sorts them to have like-values next to one another, then pushes that back into the pipe
LC_ALL=C uniq -c takes the stuff in the pipe and figures our how many times each value occurs and then pushes that back into the pipe (e.g, Max 3, if the name Max shows up 3 times)
Finally, LC_ALL=C sort -nr sorts the stuff in the pipe again to be in descending order, and then prints it to stdout, which I pipe into a file.
Here is where things get interesting. If I do all of this in the same directory as the c code which downloaded my .tsv file to begin with, I get super wacky results which appear to be a mix of my actual .tsv file, some random corrupted garbage, and the contents of the c code which got it in the first place. Here is an example:
( count ) ( value )
1 fprintf(f, " %s; out meta qt; rel %s; out meta qt; way %s; out meta qt; >; out meta qt;", box_line, box_line, box_line);
1 fclose(f);
1 char* out_file = request_osm("cmd_tmp.txt", true);
1 bag_delete(lines_of_request);
1
1
1
1
1
1??g?
1??g?
1?
1?LXg$E
... etc. Now if you scroll up in that, you also find some correct values, from the .tsv I was parsing:
( count ) ( value )
1 312639
1 3065411
1 3065376
1 300459
1 2946076
... etc. And if I move my .tsv into its own folder, and then cd into that folder and run that same command again, it works perfectly.
( count ) ( value )
419362 452999
115770 136420
114149 1380953
72850 93290
51180 587015
45833 209668
31973 64756
31216 97928
30586 1812906
Obviously I have a functional answer to my problem - just put the file in its own folder before parsing it. But I think that this memory corruption suggests there may be some larger issue at hand I should fix now, and I'd rather get on top of it that kick it down the road with a temporary symptomatic patch, so to speak.
I should mention that my c code does use system(cmd) sometimes.
The second command is the problem:
awk -F '\t' '{print $1}' *
See the asterisks at the end? It tells awk to process all files in the current directory. Instead, you want to just process standard input (the pipe output).
Just remove the asterisks and it should work.

Bash: awk output to array

Im trying to put the contents of a awk command in to a bash array however im having a bit of trouble.
>>test.sh
f_checkuser() {
_l="/etc/login.defs"
_p="/etc/passwd"
## get mini UID limit ##
l=$(grep "^UID_MIN" $_l)
## get max UID limit ##
l1=$(grep "^UID_MAX" $_l)
awk -F':' -v "min=${l##UID_MIN}" -v "max=${l1##UID_MAX}" '{ if ( $3 >= min && $3 <= max && $7 != "/sbin/nologin" ) print $0 }' "$_p"
}
...
Used files:
Sample File: /etc/login.defs
>>/etc/login.defs
### Min/max values for automatic uid selection in useradd
UID_MIN 1000
UID_MAX 60000
Sample File: /etc/passwd
>>/etc/passwd
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
admin:x:1000:1000:Administrator,,,:/home/admin:/bin/bash
daniel:x:1001:1001:Daniel,,,:/home/daniel:/bin/bash
The output looks like:
admin:x:1000:1000:Administrator,,,:/home/admin:/bin/bash
daniel:x:1001:1001:User,,,:/home/user:/bin/bash
respectively (awk ... print $1 }' "$_p")
admin
daniel
Now my problem is to save the awk output in an Array to use it as variable.
>>test.sh
...
f_checkuser
echo "Array items and indexes:"
for index in ${!LOKAL_USERS[*]}
do
printf "%4d: %s\n" $index ${array[$index]}
done
It could/should look like this example.
Array items and indexes:
0: admin
1: daniel
Specially i would become all Users of a System (not root,bin,sys,ssh,...) without blocked users in an array.
Perhaps someone has another idea to solve my Problem?
Are you trying to set the output of one script to an array? There is a bash has a way of doing this. For example,
a=( $(seq 1 10) ); echo ${a[1]}
will populate the array a with elements 1 to 10 and will print 2, the second line generated by seq (array index starts at zero). Simply replace the contents of $(...) with your script.
For those coming to this years later ...
bash 4 introduced readarray (aka mapfile) exactly for this purpose.
See also Bash capturing output of awk into array
One solution that works:
array=()
f_checkuser(){
...
...
tempfile="localuser.tmp"
touch ${tempfile}
awk -F':'...'{... print $1 }' "$_p" > ${HOME}/${tempfile}
getArrayfromFile "${tempfile}"
}
getArrayfromFile() {
i=0
while read line # Read a line
do
array[i]=$line # Put it into the array
i=$(($i + 1))
done < $1
}
f_checkuser
echo "Array items and indexes:"
for index in ${!array[*]}
do
printf "%4d: %s\n" $index ${array[$index]}
done
Output:
Array items and indexes:
0: daniel
1: admin
But I like more to observe without a new temp-file.
So, have someone any another idea without a temp-file?

awk: filtering multiple files in a loop and only print a file if the number of records in that file exceeds a certian value

I have 100-200 text files that I would like to filter rows based upon conditions being met in 2 columns. In addition to this I only want to print the resulting files if there are more than 20 rows of data in the file.
My script for the first part is:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if($2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
This works fine but then I have some empty files as neither of those conditions are met and some files with only 1 or 2 lines which I suspect have low quality data anyway. Now after the above I want only the files with 20 lines of data or more:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if(FNR>19 && $2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
The second script (with the FNR) right above seems ineffectual, I still get empty files.
How can I get this loop to work as the original above with the extra condition of having 20 lines of data in each file or more.
Thanks,
The shell creates the output file as soon as it runs the command (the > redirection creates the file immediately). You will always get empty files this way. If you don't want that then have awk write directly to the file so it only gets created when necessary.
for ID in {001..178}
do
awk -v outfile=FLD0${ID}.new.txt 'FNR>19 && $2 == "chr15" && $5>9 { print > outfile }' FLD0${ID}.txt
done;
You could even run awk once on all the files instead of once-per-file if you wanted to.
awk 'FNR>19 && $2 == "chr15" && $5>9 { print > (FILENAME".new") }' FLD{001..178}.txt
(Slightly different output file name format for that one but that's just because I was being lazy. You could fix that with split()/etc.)

awk using an index key over a range

I have an awk script that I normally run in parallel using an outside variable $a.
awk -v a=$a '$4>a-5 && $4<a+5 {print $10,$4}' INFILE
It would of course run much faster using an array so I tried something like this to get it to do the same thing ($2 in LISTFILE being the search value for $4 in INFILE
awk 'FNR==NR{a[$2]=($2-5);next}$4 in a{if ($4>a[$4] && $4<a[$4]+10 {print} LISTFILE INFILE
This of course did not work because awk scanned until it reached the key and then started the testing the if statement, so only the downstream range was found. Unfortunately this isn't a continuous list, so often there is no $2-5 value, otherwise I would use that as the key for the array.
obviously I know how to do this using a combo of awk and bash, but I was wondering if there was an awk only solution for this.
My first answer addresses the actual question asked and fixes the awk script. But perhaps I have missed the point. If you want speed, and don't mind making more use of your multi-core processor, you can use GNU parallel. Here's an implementation that will launch 4 jobs at a time:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
parallel -j 4 "awk -v var={} '$awk_cmd' INFILE" :::: LISTFILE
As you can see, this will read INFILE up to four times concurrently. This answer, after adjustment of the number of jobs, should provide very similar performance to your parallel implementation that you describe using your shell. Therefore, you may like to split up your LISTFILE into smaller chunks and set awk_cmd to the command posted in my previous answer. There may be an optimal way to process you input, but that would largely depend on the size of INFILE and the number of elements in LISTFILE. HTH.
TESTING:
Create LISTFILE:
paste - - < <(seq 16) > LISTFILE
Create INFILE:
awk 'BEGIN { for (i=1; i<=9999999; i++) { print i, i, i, int(i * rand()), i, i, i, i, i, i } }' > INFILE
RESULTS:
TEST1:
time awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE >/dev/null
real 0m45.198s
user 0m45.090s
sys 0m0.160s
TEST2:
time for i in $(seq 1 2 16); do awk -v var="$i" '$4 > var - 5 && $4 < var + 5 { print $10, $4 }' INFILE; done >/dev/null
real 0m55.335s
user 0m54.433s
sys 0m0.953s
TEST3:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
time parallel --colsep "\t" -j 4 "awk -v var={2} '$awk_cmd' INFILE" :::: LISTFILE >/dev/null
real 0m28.190s
user 1m42.750s
sys 0m1.757s
My reply to THIS answer:
1:
The awk1 script does not run much faster than the awk script.
A 15% time saving is pretty significant in my opinion.
I suspect because it scans the LISTFILE for every line in the INFILE.
Yes, essentially. The awk1 script loops through INFILE just once.
So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE).
Close. But don't forget that by using an array, we actually remove any duplicate values in LISTFILE.
This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
This statement is therefore only true when LISTFILE contains no duplicates. Even if LISTFILE never contains any dups, having to read a single file multiple times is best avoided.
2:
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
What four minute result? When benchmarking this sort of thing, you should stop writing the output to disk. If your machine has some background process going on when your running your tests, you will only end up biasing your results with the write speed of your disk. Use /dev/null instead.
3:
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
If you remove the pipe to sort and uniq you will get a better idea of where the time difference is. You will find that doing $4 > i - 5 && $4 < i + 5 is grossly different to doing $4 < i + 5 && $4 > i - 5. If awkout.txt is the same as awkout.txt, you are spending time processing duplicates.
4:
The second command you posted here avoids this test: $4 > i - 5 && $4 < i + 5. I wouldn't think that that alone would warrant a 90% improvement in runtime. Something smells wrong. Would you mind re-running your tests writing to /dev/null and posting the contents of LISTFILE and INFILE? If those two files are confidential, could you provide some example files with the amount of content equal to the originals?
Other thoughts:
To me, it looks like something along these lines would also work:
awk 'FNR==NR { for (i=$2-4;i<$2+5;i++) a[i]; next } $4 in a { b[$10,$4] } END { print length b }' LISTFILE INFILE
It looks like you just need to add the keys of LISTFILE to an array, then, as you process INFILE (line by line), test each key in your array with your 'if' statement. You can do this using the following construct or similar:
for (i in a) { print i, a[i] }
Here's some untested code that may help get you started. Notice how I have not assigned any values to my keys:
awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE
Steves answer above is the correct answer to the question. Below is a comparison of array and non array ways to handle the problem.
I created a test program to look at two different scenarios and the results from each. The test programs code is here:
echo time for bash
time for line in `awk '{print $2}' $1` ; do awk -v a=$line '$4>a-5&&$4<a+5{print $4,$10}' $2 ; done | sort | uniq -c > bashout.txt
echo time for awk
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4>i-5&&$4<i+5) print $10,$4}}' $1 $2 |sort | uniq -c > awkout.txt
echo time for awk2
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4<i+5&&$4>i-5) print $10,$4}}' $1 $2 |sort | uniq -c > awk2out.txt
echo time for awk3
time awk '{a=$2;b=$1;for (i=a-4;i<a+5;i++) print b,i}' $1 > LIST2;time awk 'FNR==NR{a[$2];next}$4 in a{print $10,$4}' LIST2 $2 | sort | uniq -c > awk3out.txt
Here is the output:
time for bash
real 2m22.394s
user 2m15.938s
sys 0m6.409s
time for awk
real 2m1.719s
user 2m0.919s
sys 0m0.782s
time for awk2
real 1m49.146s
user 1m47.607s
sys 0m1.524s
time for awk3
real 0m0.006s
user 0m0.000s
sys 0m0.001s
real 0m12.788s
user 0m12.096s
sys 0m0.695s
4 observations/questions
The awk1 script does not run much faster than the awk script. I suspect because it scans the LISTFILE for every line in the INFILE. So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE). This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
Making an expanded LIST2 from the LISTFILE and using that as the array makes the program run significantly faster, at the cost of increasing the memory footprint. Considering how small the list I"m looking at is (only 200-300 long) that seems to be the way to go, even over doing this in parallel.

What is the shell script instruction to divide a file with sorted lines to small files?

I have a large text file with the next format:
1 2327544589
1 3554547564
1 2323444333
2 3235434544
2 3534532222
2 4645644333
3 3424324322
3 5323243333
...
And the output should be text files with a suffix in the name with the number of the first column of the original file keeping the number of the second column in the corresponding output file as following:
file1.txt:
2327544589
3554547564
2323444333
file2.txt:
3235434544
3534532222
4645644333
file3.txt:
3424324322
5323243333
...
The script should run on Solaris but I'm also having trouble with the instruction awk and options of another instruccions like -c with cut; its very limited so I am searching for common commands on Solaris. I am not allowed to change or install anything on the system. Using a loop is not very efficient because the script takes too long with large files. So aside from using the awk instruction and loops, any suggestions?
Something like this perhaps:
$ awk 'NF>1{print $2 > "file"$1".txt"}' input
$ cat file1.txt
2327544589
3554547564
2323444333
or if you have bash available, try this:
#!/bin/bash
while read a b
do
[ -z $a ] && continue
echo $b >> "file"$a".txt"
done < input
output:
$ paste file{1..3}.txt
2327544589 3235434544 3424324322
3554547564 3534532222 5323243333
2323444333 4645644333

Resources