Strange Memory Behavior handling TSV - c

I have a .tsv and I need to figure out the frequencies variables in a specific column and organize that data in descending order. I run a script in c which downloads a buffer and saves it to a .tsv file with a date stamp for a name in the same directory as my code. I then open my Terminal and run the following command, per this awesome SO answer:
cat 2016-09-06T10:15:35Z.tsv | awk -F '\t' '{print $1}' * | LC_ALL=C sort | LC_ALL=C uniq -c | LC_ALL=C sort -nr > tst.tsv
To break this apart by pipes, what this does is:
cat the .tsv file to get its contents into the pipe
awk -F '\t' '{print $1}' * breaks the file's contents up by tab and pushes the contents of the first column into the pipe
LC_ALL=C sort takes the contents of the pipe and sorts them to have like-values next to one another, then pushes that back into the pipe
LC_ALL=C uniq -c takes the stuff in the pipe and figures our how many times each value occurs and then pushes that back into the pipe (e.g, Max 3, if the name Max shows up 3 times)
Finally, LC_ALL=C sort -nr sorts the stuff in the pipe again to be in descending order, and then prints it to stdout, which I pipe into a file.
Here is where things get interesting. If I do all of this in the same directory as the c code which downloaded my .tsv file to begin with, I get super wacky results which appear to be a mix of my actual .tsv file, some random corrupted garbage, and the contents of the c code which got it in the first place. Here is an example:
( count ) ( value )
1 fprintf(f, " %s; out meta qt; rel %s; out meta qt; way %s; out meta qt; >; out meta qt;", box_line, box_line, box_line);
1 fclose(f);
1 char* out_file = request_osm("cmd_tmp.txt", true);
1 bag_delete(lines_of_request);
1
1
1
1
1
1??g?
1??g?
1?
1?LXg$E
... etc. Now if you scroll up in that, you also find some correct values, from the .tsv I was parsing:
( count ) ( value )
1 312639
1 3065411
1 3065376
1 300459
1 2946076
... etc. And if I move my .tsv into its own folder, and then cd into that folder and run that same command again, it works perfectly.
( count ) ( value )
419362 452999
115770 136420
114149 1380953
72850 93290
51180 587015
45833 209668
31973 64756
31216 97928
30586 1812906
Obviously I have a functional answer to my problem - just put the file in its own folder before parsing it. But I think that this memory corruption suggests there may be some larger issue at hand I should fix now, and I'd rather get on top of it that kick it down the road with a temporary symptomatic patch, so to speak.
I should mention that my c code does use system(cmd) sometimes.

The second command is the problem:
awk -F '\t' '{print $1}' *
See the asterisks at the end? It tells awk to process all files in the current directory. Instead, you want to just process standard input (the pipe output).
Just remove the asterisks and it should work.

Related

Sed: Better way to address the n-th line where n are elements of an array

We know that the sed command loops over each line of a file and for each line, it loops over the given commands list and does something. But when the file is extremely large, the time and resource cost on the repeating operation may be terrible.
Suppose that I have an array of line numbers which I want to use as addresses to delete or print with sed command (e.g. A=(20000 30000 50000 90000)) and there is a VERY LARGE object file.
The easiest way may be:
(Remark by #John1024, careful about the line number changes for each loop)
( for NL in ${A[#]}; do sed "$NL d" $very_large_file; done; )>.temp_file;
cp .temp_file $very_large_file; rm .temp_file
The problem of the code above is that, for each indexed line number of the array, it needs to loop over the whole file.
To avoid this, one can:
#COMM=`echo "${A[#]}" | sed 's/\s/d;/g;s/$/d'`;
#sed -i "$COMM" $very_large_file;
#Edited: Better with direct parameter expansion:
sed -i "${A[#]/%/d;}" $very_large_file;
It first print the array and replace its SPACE and the END_OF_LINE with the d command of sed, so that the string looks like "20000d;30000d;50000d;90000d"; on the second line, we treat this string as the command list of sed. The result is that with this code, it only loops over the file for once.
More over, for in-place operation (argument -i), one cannot quit using q with sed even though the greatest line number of interest has passed, because if so, the lines after the that line (e.g. 90001+) will disappear (It seems that the in-place operation is just overwriting the file with stdout).
Better ideas?
(Reply to #user unknown:) I think it could be even more efficient if we manage to "quit" the loop once all indexed lines have passed. We can't, using sed -i, for the aforementioned reasons. Printing each line to a file cost more time than copying a file (e.g. cat file1 > file2 and cp file1 file2). We may benefit from this concept, using any other methods or tools. This is what I expect.
PS: The points of this question are "Lines location" and "Efficiency"; the "delete lines" operation is just an example. For real tasks, there are much more - append/insert/substituting, field separating, cases judgement followed by read from/write to files, calculations etc.
In order words, it may invoke all kind of operations, creating sub-shells or not, caring about the variable passing, ... so, the tools to use should allow me to line processing, and the problem is how to get myself onto the lines of interest, doing all kinds operations.
Any comments are appreciated.
First make a copy to a testfile for checking the results.
You want to sort the linenumbers, highest first.
echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn
You can feed commands into ed using printf:
printf "%s\n" "command1" "command2" w q testfile | ed -s testfile
Combine these
printf "%s\n" $(echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn | sed 's/$/d/') w q |
ed -s testfile
Edit (tx #Ed_Morton):
This can be written in less steps with
printf "%s\n" $(printf '%sd\n' "${a[#]}" | sort -rn ) w q | ed -s testfile
I can not remove the sort, because each delete instruction is counting the linenumbers from 1.
I tried to find a command for editing the file without redirecting to another, but I started with the remark that you should make a copy. I have no choice, I have to upvote the straight forward awk solution that doesn't need a sort.
sed is for doing s/old/new, that is all, and when you add a shell loop to the mix you've really gone off the rails (see https://unix.stackexchange.com/q/169716/133219). To delete lines whose numbers are stored in an array is (using seq to generate input since no sample input/output provided in the question):
$ a=( 3 7 8 )
$ seq 10 |
awk -v a="${a[*]}" 'BEGIN{split(a,tmp); for (i in tmp) nrs[tmp[i]]} !(NR in nrs)'
1
2
4
5
6
9
10
and if you wanted to stop processing with awk once the last target line has been deleted and let tail finish the job then you could figure out the max value in the array up front and then do awk on just the part up to that last target line:
max=$( printf '%s\n' "${a[#]}" | sort -rn | head -1 )
head -"$max" file | awk '...' file > out
tail +"$((max+1))" file >> out
idk if that'd really be any faster than just letting awk process the whole file since awk is very efficient, especially when you're not referencing any fields and so it doesn't do any field splitting, but you could give it a try.
You could generate an intermediate sed command file from your lines.
echo ${A[#]} | sort -n > lines_to_delete
min=`head -1` lines_to_delete
max=`head -1` lines_to_delete
# skip to first and from last line, delete the others
sed -i -e 1d -e ${linecount}d -e 's#$#d#' lines_to_delete
head -${min} input > output
sed -f lines_to_delete input >> output
tail -${max} input >> output
mv output input

awk: filtering multiple files in a loop and only print a file if the number of records in that file exceeds a certian value

I have 100-200 text files that I would like to filter rows based upon conditions being met in 2 columns. In addition to this I only want to print the resulting files if there are more than 20 rows of data in the file.
My script for the first part is:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if($2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
This works fine but then I have some empty files as neither of those conditions are met and some files with only 1 or 2 lines which I suspect have low quality data anyway. Now after the above I want only the files with 20 lines of data or more:
for ID in {001..178}
do
cat FLD0${ID}.txt | awk '{ if(FNR>19 && $2 == "chr15" && $5>9) { print; } }' > FLD0${ID}.new.txt
done;
The second script (with the FNR) right above seems ineffectual, I still get empty files.
How can I get this loop to work as the original above with the extra condition of having 20 lines of data in each file or more.
Thanks,
The shell creates the output file as soon as it runs the command (the > redirection creates the file immediately). You will always get empty files this way. If you don't want that then have awk write directly to the file so it only gets created when necessary.
for ID in {001..178}
do
awk -v outfile=FLD0${ID}.new.txt 'FNR>19 && $2 == "chr15" && $5>9 { print > outfile }' FLD0${ID}.txt
done;
You could even run awk once on all the files instead of once-per-file if you wanted to.
awk 'FNR>19 && $2 == "chr15" && $5>9 { print > (FILENAME".new") }' FLD{001..178}.txt
(Slightly different output file name format for that one but that's just because I was being lazy. You could fix that with split()/etc.)

Printing duplicate rows as many times it is duplicate in the input file using UNIX

Suppose I have a sorted file:
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
RAM,124455,DUETCHE BANK,$240
And I want output as:
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
With **uniq -d file** I am able to find duplicate records but its printing the record only once even if it is repeated. I want to print as many times it is duplicated.
Thanks in advance.
The following should do what you want, assuming your file is called Input.txt.
uniq -d Input.txt | xargs -I {} grep {} Input.txt
xargs -I {} basically tells xargs to substitute the input that is being piped in whenever it sees {} in a later command.
grep {} Input.txt will be called with each line of input from the pipe, where the line of input will get substituted where {} is.
Why does this work? We are using uniq -d to find the duplicate entries, and then using them as input patterns to grep to match all the lines which contain those entries. Thus, only duplicate entries are printed, and they are printed exactly as many times as they appear in the file.
Update: Printing the duplicates occurences only, not the first occurrence, in a way that is compatible with ksh, since the OP does not apparently have bash on his system.
uniq -d Input.txt | xargs -L 1 | while read line
do
grep "$line" Input.txt | tail -n +2;
done
Note that in the above scripts, we are assuming that no line is a substring of another line.
This should give you the output that you want. It repeats each duplicate line N-1 times. Unfortunately the output isn't sorted, so you'd have to pipe it through sort again.
Assuming the input file is input.txt:
awk -F '\n' '{ a[$1]++ } END { for (b in a) { while(--a[b]) { print b } } }' input.txt | sort

What is the shell script instruction to divide a file with sorted lines to small files?

I have a large text file with the next format:
1 2327544589
1 3554547564
1 2323444333
2 3235434544
2 3534532222
2 4645644333
3 3424324322
3 5323243333
...
And the output should be text files with a suffix in the name with the number of the first column of the original file keeping the number of the second column in the corresponding output file as following:
file1.txt:
2327544589
3554547564
2323444333
file2.txt:
3235434544
3534532222
4645644333
file3.txt:
3424324322
5323243333
...
The script should run on Solaris but I'm also having trouble with the instruction awk and options of another instruccions like -c with cut; its very limited so I am searching for common commands on Solaris. I am not allowed to change or install anything on the system. Using a loop is not very efficient because the script takes too long with large files. So aside from using the awk instruction and loops, any suggestions?
Something like this perhaps:
$ awk 'NF>1{print $2 > "file"$1".txt"}' input
$ cat file1.txt
2327544589
3554547564
2323444333
or if you have bash available, try this:
#!/bin/bash
while read a b
do
[ -z $a ] && continue
echo $b >> "file"$a".txt"
done < input
output:
$ paste file{1..3}.txt
2327544589 3235434544 3424324322
3554547564 3534532222 5323243333
2323444333 4645644333

Shell script cut the beginning and the end of a file

So I have a file and I'd like cut the first 33 lines and the last 6 lines of it. What I am trying to do is get the whole file in a cat command (cat file) and then use the "head" and "tail" commands to remove those parts, but I don't know how to do so.
Eg (this is just the idea)
cat file - head -n 33 file - tail -n 6 file
How am I supposed to do this? Is it possible to do it with "sed" (how)? Thanks in advance.
This is probably what you want:
$ tail -n +34 file | head -n -6
See the tail
-n, --lines=K
output the last K lines, instead of the last 10; or use -n +K to output lines starting with the Kth
and head
-n, --lines=[-]K
print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file
man pages.
Example:
$ cat file
one
two
three
four
five
six
seven
eight
$ tail -n +4 file | head -n -2
four
five
six
Notice that you don't need the cat (see UUOC).
first count total lines, then print the middle part: (read file twice)
l=$(wc -l file)
awk -v l="$l" 'NR>33&&NR<l-6' file
or load the file in array, then print lines you need : (read file once)
awk '{a[NR]=$0}END{for(i=34;i<NR-6;i++)print a[i]}' file
or awk with head, don't think so much in this way: (read file twice):
awk 'NR>33' file|head -n-6
sed -n '1,33b; 34{N;N;N;N;N};N;P;D' file
this will work +
This might work for you (GNU sed):
sed '1,33d;:a;$d;N;s/\n/&/6;Ta;P;D' file

Resources