How to partition app's output into several files? - file

So, I have an app. And it's not written by me. And it's a command line app. It outputs some strings which I am able to write to a file, like that:
anApp -input myFile.txt > myFileOutput.txt
The problem is that the output is way too large and the computer runs out of memory. Is it possible to do something like that:
anApp -input myFile.txt > i=0; for each 100000 lines; touch newFile%d $(i++); $cat 100000lines >> newFile%d $(i++); done
Because it is rather a clumsy pseudocode, I am also adding explanation:
For each 100000 lines (for instance)
Create a new file called: newFile# - where # is a number from 0 to n
Write those 100000 lines to a newly created file.
I think there may be also another option - to keep the output of the anApp in cash. However, the file's huge, it contains some results and if it will be lost... It's not something which I would like to happen.

One option would be to use split:
anApp -input myFile.txt | split -l 100000 - myFileOutput
This will generate files with names like myFileOutputaa, myFileOutputab, etc.
For more control over the names of the output files, you could use awk:
NR % 100000 == 1 { close(outfile); outfile = sprintf("myFileOutput%02d", i++) }
{ print > outfile }
You can save that script to a file and run it like:
anApp -input myFile.txt | awk -f script.awk

Related

Output results from cat into different files with names specified into an array

I would like to do cat on several files, which names are stored in an array:
cat $input | grep -v "#" | cut -f 1,2,3
Here the content of the array:
echo $input
1.blastp 2.blastp 3.blastp 4.blastp 5.blastp 6.blastp 7.blastp 8.blastp 9.blastp 10.blastp 11.blastp 12.blastp 13.blastp 14.blastp 15.blastp 16.blastp 17.blastp 18.blastp 19.blastp 20.blastp
This will work just nicely. Now, I am struggling in storing the results into proper output files. So I want to also store the output into files which names are stored into another array:
echo $out_in
1_pairs.tab 2_pairs.tab 3_pairs.tab 4_pairs.tab 5_pairs.tab 6_pairs.tab 7_pairs.tab 8_pairs.tab 9_pairs.tab 10_pairs.tab 11_pairs.tab 12_pairs.tab 13_pairs.tab 14_pairs.tab 15_pairs.tab 16_pairs.tab 17_pairs.tab 18_pairs.tab 19_pairs.tab 20_pairs.tab
cat $input | grep -v "#" | cut -f 1,2,3 > "$out_in"
My problem is:
When I don't use the "" I will get 'ambiguous redirect' error.
When I use them, a single file will be created that comes by the name:
1_pairs.tab?2_pairs.tab?3_pairs.tab?4_pairs.tab?5_pairs.tab?6_pairs.tab?7_pairs.tab?8_pairs.tab?9_pairs.tab?10_pairs.tab?11_pairs.tab?12_pairs.tab?13_pairs.tab?14_pairs.tab?15_pairs.tab?16_pairs.tab?17_pairs.tab?18_pairs.tab?19_pairs.tab?20_pairs.tab
I don't get why the input array is read with no problem but that's not the case for the output array...
any ideas?
Thanks a lot!
D.
You cannot redirect output that way, the output is a stream of characters and the redirection can not know when to switch to the next file. You need a loop over input files.
Assuming that the file names do not contain spaces:
for fn in $input; do
grep -v "$fn" | cut -f 1,2,3 >"${fn%%.*}_pairs.tab"
done

bash Reading array from file

I've already read a lot of questions concerning reading and writing in ARRAY in bash. I could not find the solution to my issue.
Actually, I've got a file that contains the path of a lot of files.
cat MyFile
> ~/toto/file1.txt
> ~/toto/file2.txt
> ~/toto/file3.txt
> ~/toto/file4.txt
> ~/toto/file5.txt
I fill an array ARRAY to contain this list:
readarray ARRAY < MyFile.txt
or
while IFS= read -r line
do
printf 'TOTO %s\n' "$line"
ARRAY+=("${line}")
done <MyFile.txt
or
for line in $(cat ${MyFile.txt}) ;
do echo "==> $line";
ARRAY+=($line) ;
done
All those methods work well to fill the ARRAY,
echo "0: ${ARRAY[1]}"
echo "1: ${ARRAY[2]}"
> 0: ~/toto/file1.txt
> 1: ~/toto/file2.txt
This is awesome.
but my problem is that if I try to diff the content of the file it does not work, it looks like the it does not expand the content of the file
diff ${ARRAY[1]} ${ARRAY[2]}
diff: ~/toto/file1.txt: No such file or directory
diff: ~/toto/file2.txt: No such file or directory
but when a print the content:
echo diff ${ARRAY[1]} ${ARRAY[2]}
diff ~/toto/file1.txt ~/toto/file2.txt
and execute it I get the expected diff in the file
diff ~/toto/file1.txt ~/toto/file2.txt
3c3
< Param = {'AAA', 'BBB'}
---
> Param = {'AAA', 'CCC'}
whereas if I fill ARRAY manually this way:
ARRAY=(~/toto/file1.txt ~/toto/file2.txt)
diff works well.
Does anyone have an idea?
Thanks a lot
Regards,
Thomas
Tilde expansion does not happen when you use variable substitution from ${ARRAY[index]}.
Put the full path to the files in MyFile.txt and run your code again.

Strange Memory Behavior handling TSV

I have a .tsv and I need to figure out the frequencies variables in a specific column and organize that data in descending order. I run a script in c which downloads a buffer and saves it to a .tsv file with a date stamp for a name in the same directory as my code. I then open my Terminal and run the following command, per this awesome SO answer:
cat 2016-09-06T10:15:35Z.tsv | awk -F '\t' '{print $1}' * | LC_ALL=C sort | LC_ALL=C uniq -c | LC_ALL=C sort -nr > tst.tsv
To break this apart by pipes, what this does is:
cat the .tsv file to get its contents into the pipe
awk -F '\t' '{print $1}' * breaks the file's contents up by tab and pushes the contents of the first column into the pipe
LC_ALL=C sort takes the contents of the pipe and sorts them to have like-values next to one another, then pushes that back into the pipe
LC_ALL=C uniq -c takes the stuff in the pipe and figures our how many times each value occurs and then pushes that back into the pipe (e.g, Max 3, if the name Max shows up 3 times)
Finally, LC_ALL=C sort -nr sorts the stuff in the pipe again to be in descending order, and then prints it to stdout, which I pipe into a file.
Here is where things get interesting. If I do all of this in the same directory as the c code which downloaded my .tsv file to begin with, I get super wacky results which appear to be a mix of my actual .tsv file, some random corrupted garbage, and the contents of the c code which got it in the first place. Here is an example:
( count ) ( value )
1 fprintf(f, " %s; out meta qt; rel %s; out meta qt; way %s; out meta qt; >; out meta qt;", box_line, box_line, box_line);
1 fclose(f);
1 char* out_file = request_osm("cmd_tmp.txt", true);
1 bag_delete(lines_of_request);
1
1
1
1
1
1??g?
1??g?
1?
1?LXg$E
... etc. Now if you scroll up in that, you also find some correct values, from the .tsv I was parsing:
( count ) ( value )
1 312639
1 3065411
1 3065376
1 300459
1 2946076
... etc. And if I move my .tsv into its own folder, and then cd into that folder and run that same command again, it works perfectly.
( count ) ( value )
419362 452999
115770 136420
114149 1380953
72850 93290
51180 587015
45833 209668
31973 64756
31216 97928
30586 1812906
Obviously I have a functional answer to my problem - just put the file in its own folder before parsing it. But I think that this memory corruption suggests there may be some larger issue at hand I should fix now, and I'd rather get on top of it that kick it down the road with a temporary symptomatic patch, so to speak.
I should mention that my c code does use system(cmd) sometimes.
The second command is the problem:
awk -F '\t' '{print $1}' *
See the asterisks at the end? It tells awk to process all files in the current directory. Instead, you want to just process standard input (the pipe output).
Just remove the asterisks and it should work.

awk: filtering multiple files in a loop and only print a file if the number of records in that file exceeds a certian value

I have 100-200 text files that I would like to filter rows based upon conditions being met in 2 columns. In addition to this I only want to print the resulting files if there are more than 20 rows of data in the file.
My script for the first part is:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if($2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
This works fine but then I have some empty files as neither of those conditions are met and some files with only 1 or 2 lines which I suspect have low quality data anyway. Now after the above I want only the files with 20 lines of data or more:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if(FNR>19 && $2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
The second script (with the FNR) right above seems ineffectual, I still get empty files.
How can I get this loop to work as the original above with the extra condition of having 20 lines of data in each file or more.
Thanks,
The shell creates the output file as soon as it runs the command (the > redirection creates the file immediately). You will always get empty files this way. If you don't want that then have awk write directly to the file so it only gets created when necessary.
for ID in {001..178}
do
awk -v outfile=FLD0${ID}.new.txt 'FNR>19 && $2 == "chr15" && $5>9 { print > outfile }' FLD0${ID}.txt
done;
You could even run awk once on all the files instead of once-per-file if you wanted to.
awk 'FNR>19 && $2 == "chr15" && $5>9 { print > (FILENAME".new") }' FLD{001..178}.txt
(Slightly different output file name format for that one but that's just because I was being lazy. You could fix that with split()/etc.)

What is the shell script instruction to divide a file with sorted lines to small files?

I have a large text file with the next format:
1 2327544589
1 3554547564
1 2323444333
2 3235434544
2 3534532222
2 4645644333
3 3424324322
3 5323243333
...
And the output should be text files with a suffix in the name with the number of the first column of the original file keeping the number of the second column in the corresponding output file as following:
file1.txt:
2327544589
3554547564
2323444333
file2.txt:
3235434544
3534532222
4645644333
file3.txt:
3424324322
5323243333
...
The script should run on Solaris but I'm also having trouble with the instruction awk and options of another instruccions like -c with cut; its very limited so I am searching for common commands on Solaris. I am not allowed to change or install anything on the system. Using a loop is not very efficient because the script takes too long with large files. So aside from using the awk instruction and loops, any suggestions?
Something like this perhaps:
$ awk 'NF>1{print $2 > "file"$1".txt"}' input
$ cat file1.txt
2327544589
3554547564
2323444333
or if you have bash available, try this:
#!/bin/bash
while read a b
do
[ -z $a ] && continue
echo $b >> "file"$a".txt"
done < input
output:
$ paste file{1..3}.txt
2327544589 3235434544 3424324322
3554547564 3534532222 5323243333
2323444333 4645644333

Resources