Faster drop-in replacement for bash `cut` for specific application - c

I have a very large tab separated file. The tab separated file is binary and will be streamed by the tool samtools (which is very fast and not the bottleneck). Now I want to output only the content up to the first tab.
In my current piped command cut is the bottleneck:
samtools view -# 15 -F 0x100 file.bam | cut -f 1 | pigz > out.gz
I tried using awk '{print $1}'. This is not sufficiently faster I also tried using parallelin combination withcut` but this also does not increase the speed much.
I guess it would be better to have a tool which just outputs the string until first tab and then completely skips the entire line.
Do you have a suggestion for a tool which is faster for my purpose? Ideally, one would write a small C program I guess but my C is a bit rusty so would take too long for me.

You are interested in a small C program that just outputs lines from stdin until first tab.
In C you can do this easily with something like this:
#include <stdio.h>
#include <string.h>
#define MAX_LINE_LENGTH 1024
int main(void) {
char buf[MAX_LINE_LENGTH];
while(fgets(buf, sizeof(buf), stdin) != NULL) {
buf[strcspn(buf, "\n\t")] = '\0';
fputs(buf, stdout);
fputc('\n', stdout);
}
return 0;
}
It simply reads lines with fgets from stdin. The string is terminated with a NUL byte at the first tab \t. The same applies if there is a \n to have no extra line feeds in the output, just in case there is no tab on an input line.
Whether this is much faster in your use case I cannot say, but it should at least provide a starting point for trying out your idea.

GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
You might give a try other implementation of AWK, according to test done in 2009¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! nawk was found faster than gawk and mawk was found faster than nawk. You would need to run test with your data to find if using other implementation give noticeable boost.
¹so versions available in 2022 might give different result

In the question OP has mentioned that awk '{print $1}' is not sufficiently faster than cut; in my testing I'm seeing awk running about twice as fast as cut, so not sure how OP is using awk ... or if I'm missing something (basic) with my testing ...
OP has mentioned a 'large' tab-delimited file with up to 400 characters per line; we'll simulate this with the following code that generates a ~400MB file:
$ cat sam_out.awk
awk '
BEGIN { OFS="\t"; x="1234567890"
for (i=1;i<=40;i++) filler=filler x
for (i=1;i<=1000000;i++) print x,filler
}'
$ . ./sam_out.awk | wc
1000000 2000000 412000000
Test calls:
$ cat sam_tests.sh
echo "######### pipe to cut"
time . ./sam_out.awk | cut -f1 - > /dev/null
echo "######### pipe to awk"
time . ./sam_out.awk | awk '{print $1}' > /dev/null
echo "######### process-sub to cut"
time cut -f1 <(. ./sam_out.awk) > /dev/null
echo "######### process-sub to awk"
time awk '{print $1}' <(. ./sam_out.awk) > /dev/null
NOTE: also ran all 4 tests with output written to 4 distinct output files; diff of the 4 output files showed all were the same (wc: 1000000 1000000 11000000; head -1: 1234567890)
Results of running the tests:
######### pipe to cut
real 0m1.177s
user 0m0.205s
sys 0m1.454s
######### pipe to awk
real 0m0.582s
user 0m0.166s
sys 0m0.759s
######### process-sub to cut
real 0m1.265s
user 0m0.351s
sys 0m1.746s
######### process-sub to awk
real 0m0.655s
user 0m0.097s
sys 0m0.968s
NOTES:
test system: Ubuntu 10.04, cut (GNU coreutils 8.30), awk (GNU Awk 5.0.1)
earlier version of this answer showed awk running 14x-15x times faster than cut; that system: cygwin 3.3.5, cut (GNU coreutils 8.26), awk (GNU Awk 5.1.1)

You might consider process-substitutions instead of a pipeline.
$ < <( < <(samtools view -# 15 -F 0x100 file.bam) cut -f1 ) pigz
Note: I'm using process substitution to generate stdin and avoid using another FIFO. This seems to be much faster.
I've written a simple test script sam_test.sh that generates some output:
#!/usr/bin/env bash
echo {1..10000} | awk 'BEGIN{OFS="\t"}{$1=$1;for(i=1;i<=1000;++i) print i,$0}'
and compared the output of the following commands:
$ ./sam_test.sh | cut -f1 | awk '!(FNR%3)'
$ < <(./sam_test.sh) cut -f1 | awk '!(FNR%3)'
$ < <( < <(./sam_test.sh) cut -f1 ) awk '!(FNR%3)'
The later of the three cases is in 'runtime' significantly faster. Using strace -c , we can see that each pipeline adds a significant amount of wait4 syscalls. The final version is then also significantly faster (factor 700 in the above case).
Output of test case (short):
$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.22 0.643249 160812 4 1 wait4
0.30 0.001951 5 334 294 openat
0.21 0.001331 5 266 230 stat
0.04 0.000290 20 14 12 execve
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.648287 728 890 549 total
$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.72 0.256664 85554 3 1 wait4
0.45 0.001181 3 334 294 openat
0.29 0.000757 2 266 230 stat
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.259989 295 881 547 total
$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.43 0.002863 1431 2 1 wait4
19.68 0.001429 4 334 294 openat
14.87 0.001080 3 285 242 stat
10.00 0.000726 51 14 12 execve
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.007261 7 909 557 total
Output of test case (full):
$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.22 0.643249 160812 4 1 wait4
0.30 0.001951 5 334 294 openat
0.21 0.001331 5 266 230 stat
0.04 0.000290 20 14 12 execve
0.04 0.000276 6 42 mmap
0.04 0.000229 76 3 clone
0.03 0.000178 3 49 4 close
0.02 0.000146 3 39 fstat
0.02 0.000109 9 12 mprotect
0.01 0.000080 5 16 read
0.01 0.000053 2 18 rt_sigprocmask
0.01 0.000052 3 16 rt_sigaction
0.01 0.000038 3 10 brk
0.01 0.000036 18 2 munmap
0.01 0.000034 5 6 2 access
0.00 0.000029 3 8 1 fcntl
0.00 0.000024 3 7 lseek
0.00 0.000019 4 4 3 ioctl
0.00 0.000019 9 2 pipe
0.00 0.000018 3 5 getuid
0.00 0.000018 3 5 getgid
0.00 0.000018 3 5 getegid
0.00 0.000017 3 5 geteuid
0.00 0.000013 4 3 dup2
0.00 0.000013 13 1 faccessat
0.00 0.000009 2 4 2 arch_prctl
0.00 0.000008 4 2 getpid
0.00 0.000008 4 2 prlimit64
0.00 0.000005 5 1 sysinfo
0.00 0.000004 4 1 write
0.00 0.000004 4 1 uname
0.00 0.000004 4 1 getppid
0.00 0.000003 3 1 getpgrp
0.00 0.000002 2 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.648287 728 890 549 total
$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.72 0.256664 85554 3 1 wait4
0.45 0.001181 3 334 294 openat
0.29 0.000757 2 266 230 stat
0.11 0.000281 20 14 12 execve
0.08 0.000220 5 42 mmap
0.06 0.000159 79 2 clone
0.05 0.000138 3 45 2 close
0.05 0.000125 3 39 fstat
0.03 0.000083 6 12 mprotect
0.02 0.000060 3 16 read
0.02 0.000054 3 16 rt_sigaction
0.02 0.000042 2 16 rt_sigprocmask
0.01 0.000038 6 6 2 access
0.01 0.000035 17 2 munmap
0.01 0.000027 2 10 brk
0.01 0.000019 3 5 getuid
0.01 0.000018 3 5 geteuid
0.01 0.000017 3 5 getgid
0.01 0.000017 3 5 getegid
0.00 0.000010 1 7 lseek
0.00 0.000009 2 4 3 ioctl
0.00 0.000008 4 2 getpid
0.00 0.000007 1 4 2 arch_prctl
0.00 0.000005 5 1 sysinfo
0.00 0.000004 4 1 uname
0.00 0.000003 3 1 getppid
0.00 0.000003 3 1 getpgrp
0.00 0.000003 1 2 prlimit64
0.00 0.000002 2 1 rt_sigreturn
0.00 0.000000 0 1 write
0.00 0.000000 0 1 pipe
0.00 0.000000 0 3 dup2
0.00 0.000000 0 8 1 fcntl
0.00 0.000000 0 1 faccessat
------ ----------- ----------- --------- --------- ----------------
100.00 0.259989 295 881 547 total
$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.43 0.002863 1431 2 1 wait4
19.68 0.001429 4 334 294 openat
14.87 0.001080 3 285 242 stat
10.00 0.000726 51 14 12 execve
2.67 0.000194 4 42 mmap
1.83 0.000133 3 39 fstat
1.67 0.000121 121 1 clone
1.58 0.000115 2 41 close
0.88 0.000064 6 10 2 access
0.87 0.000063 5 12 mprotect
0.73 0.000053 3 16 rt_sigaction
0.70 0.000051 4 12 rt_sigprocmask
0.66 0.000048 3 16 read
0.48 0.000035 3 10 brk
0.48 0.000035 3 9 getuid
0.44 0.000032 16 2 munmap
0.41 0.000030 3 8 1 fcntl
0.41 0.000030 3 9 geteuid
0.40 0.000029 3 9 getegid
0.34 0.000025 2 9 getgid
0.22 0.000016 5 3 dup2
0.21 0.000015 3 4 3 ioctl
0.19 0.000014 2 7 lseek
0.18 0.000013 13 1 faccessat
0.12 0.000009 2 4 2 arch_prctl
0.11 0.000008 4 2 prlimit64
0.08 0.000006 3 2 getpid
0.06 0.000004 4 1 write
0.06 0.000004 4 1 rt_sigreturn
0.06 0.000004 4 1 uname
0.06 0.000004 4 1 sysinfo
0.06 0.000004 4 1 getppid
0.06 0.000004 4 1 getpgrp
------ ----------- ----------- --------- --------- ----------------
100.00 0.007261 7 909 557 total

In the end I hacked a small C program which directly filters the BAM file and also writes to a gzip -- with a lot of help of the htslib developers (which is the basis for samtools).
So piping is not needed any more. This solution is about 3-4 times faster than the solution with the C code above (from Stephan).
See here:
https://github.com/samtools/samtools/issues/1672

if you just need first field why not just
{m,n,g}awk NF=1 FS='\t'
In terms of performance, I don't have a tab file handy but i do have 12.5mn rows 1.85 GB .txt with plenty of multi-byte UTF-8 in it that's "=" separated :
rows = 12,494,275. | ascii+utf8 chars = 1,285,316,715. | bytes = 1,983,544,693.
- 4.44s mawk 2
- 4.95s mawk 1
- 10.48s gawk 5.1.1
- 40.07s nawk
Why some enjoy pushing for the slow awks is beyond me.
=
in0: 35.8MiB 0:00:00 [ 357MiB/s] [ 357MiB/s] [> ] 1% ETA 0:00:00
out9: 119MiB 0:00:04 [27.0MiB/s] [27.0MiB/s] [ <=> ]
in0: 1.85GiB 0:00:04 [ 428MiB/s] [ 428MiB/s] [======>] 100%
( pvE 0.1 in0 < "${m3t}" | mawk2 NF=1 FS==; )
4.34s user 0.45s system 107% cpu 4.439 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 1.85GiB 0:00:04 [ 384MiB/s] [ 384MiB/s] [=====>] 100%
out9: 119MiB 0:00:04 [24.2MiB/s] [24.2MiB/s] [ <=>]
( pvE 0.1 in0 < "${m3t}" | mawk NF=1 FS==; )
4.83s user 0.47s system 107% cpu 4.936 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 1.85GiB 0:00:10 [ 180MiB/s] [ 180MiB/s] [ ==>] 100%
out9: 119MiB 0:00:10 [11.4MiB/s] [11.4MiB/s] [ <=>]
( pvE 0.1 in0 < "${m3t}" | gawk NF=1 FS==; )
10.36s user 0.56s system 104% cpu 10.476 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 4.25MiB 0:00:00 [42.2MiB/s] [42.2MiB/s] [> ] 0% ETA 0:00:00
out9: 119MiB 0:00:40 [2.98MiB/s] [2.98MiB/s] [<=> ]
in0: 1.85GiB 0:00:40 [47.2MiB/s] [47.2MiB/s] [=====>] 100%
( pvE 0.1 in0 < "${m3t}" | nawk NF=1 FS==; )
39.79s user 0.88s system 101% cpu 40.068 total
1 52888940993baac8299b49ee2f5bdee7 stdin
But these pale compared to using the right FS to collect everything away :
barely 1.95 secs
( pvE 0.1 in0 < "${m3t}" | mawk2 NF-- FS='=.*$'; )
1.83s user 0.42s system 115% cpu 1.951 total
1 52888940993baac8299b49ee2f5bdee7 stdin
By comparison, even gnu-cut that is purely C-code binary is slower :
( pvE 0.1 in0 < "${m3t}" | gcut -d= -f 1; )
2.53s user 0.50s system 113% cpu 2.674 total
1 52888940993baac8299b49ee2f5bdee7 stdin
You can save a tiny bit (1.772 secs) more using a more verbose approach :
( pvE 0.1 in0 < "${m3t}" | mawk2 '{ print $1 }' FS='=.*$'; )
1.64s user 0.42s system 116% cpu 1.772 total
1 52888940993baac8299b49ee2f5bdee7 stdin
Unfortunately, complex FS really isn't gawk's forte, even after you give it a helping boost with the byte-level flag :
( pvE 0.1 in0 < "${m3t}" | gawk -F'=.+$' -be NF--; )
20.23s user 0.59s system 102% cpu 20.383 total
52888940993baac8299b49ee2f5bdee7 stdin

Related

AWK loop to parse file

I have trouble understandig an awk command which I want to change slightly (but can't because I don't understand the code enough).
The result of this awk command is to put together text files having 6 columns. In the output file, the first column is a mix of all first column of the input file. The other columns of the output file are the other column of the input file with added blank if needed, to still match with the first column values.
First, I would like to only parse some specific columns from these files and not all 6. I couldn't figure out where to specify it in the awk loop.
Secondly, the header of the columns are not the first row of the output file anymore. It would be nice to have it as header in the output file as well.
Thirdly, I need to know from which file the data comes from. I know that the command take the files in the order they appear when doing ls -lh *mosdepth.summary.txt so I can deduce that the first 6 columns are from file 1, the 6 next from file 2, ect. However, I would like to automatically have this information in the output file to reduce the potential human errors I can do by infering the origin of the data.
Here is the awk command
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }
END {
for(X in COL)
{
printf("%s", X);
for(N=1; N<=FNUM; N++) printf("%s", A[X, N]);
printf("\n");
}
}' *mosdepth.summary.txt > Se_combined.coverage.txt
the input file look like this
cat file1
chrom length bases mean min max
contig_1_pilon 223468 603256 2.70 0 59
contig_2_pilon 197061 1423255 7.22 0 102
contig_6_pilon 162902 1372153 8.42 0 80
contig_19_pilon 286502 1781926 6.22 0 243
contig_29_pilon 263348 1251842 4.75 0 305
contig_32_pilon 291449 1819758 6.24 0 85
contig_34_pilon 51310 197150 3.84 0 29
contig_37_pilon 548146 4424483 8.07 0 399
contig_41_pilon 7529 163710 21.74 0 59
cat file2
chrom length bases mean min max
contig_2_pilon 197061 2098426 10.65 0 198
contig_19_pilon 286502 1892283 6.60 0 233
contig_32_pilon 291449 2051790 7.04 0 172
contig_37_pilon 548146 6684861 12.20 0 436
contig_42_pilon 14017 306188 21.84 0 162
contig_79_pilon 17365 883750 50.89 0 1708
contig_106_pilon 513441 6917630 13.47 0 447
contig_124_pilon 187518 374354 2.00 0 371
contig_149_pilon 1004879 13603882 13.54 0 801
the wrong output looks like this
contig_149_pilon 1004879 13603882 13.54 0 801
contig_79_pilon 17365 883750 50.89 0 1708
contig_1_pilon 223468 603256 2.70 0 59
contig_106_pilon 513441 6917630 13.47 0 447
contig_2_pilon 197061 1423255 7.22 0 102 197061 2098426 10.65 0 198
chrom length bases mean min max length bases mean min max
contig_37_pilon 548146 4424483 8.07 0 399 548146 6684861 12.20 0 436
contig_41_pilon 7529 163710 21.74 0 59
contig_6_pilon 162902 1372153 8.42 0 80
contig_42_pilon 14017 306188 21.84 0 162
contig_29_pilon 263348 1251842 4.75 0 305
contig_19_pilon 286502 1781926 6.22 0 243 286502 1892283 6.60 0 233
contig_124_pilon 187518 374354 2.00 0 371
contig_34_pilon 51310 197150 3.84 0 29
contig_32_pilon 291449 1819758 6.24 0 85 291449 2051790 7.04 0 172
EDIT:
Thanks to several input from several users I manage to answer my points 1 and 3 like this
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ B[FNUM]=F; COL[$1]; C=$1; $1=""; A[C, FNUM]=$4}
END {
printf("%s\t", "contig")
for (N=1; N<=FNUM; N++)
{ printf("%.5s\t", B[N])}
printf("\n")
for(X in COL)
{
printf("%s\t", X);
for(N=1; N<=FNUM; N++)
{ printf("%s\t", A[X, N]);
}
printf("\n");
}
}' file1.txt file2.txt > output.txt
with output
contig file1 file2
contig_149_pilon 13.54
contig_79_pilon 50.89
contig_1_pilon 2.70
contig_106_pilon 13.47
contig_2_pilon 7.22 10.65
chrom mean mean
contig_37_pilon 8.07 12.20
contig_41_pilon 21.74
contig_6_pilon 8.42
contig_42_pilon 21.84
contig_29_pilon 4.75
contig_19_pilon 6.22 6.60
contig_124_pilon 2.00
contig_34_pilon 3.84
contig_32_pilon 6.24 7.04
Awk processes files in records, where the records are separated by the record separator RS. Each record is split in fields where the field separator is defined by the variable FS that the -F flag can define.
In the case of the program presented in the OP, the record separator is the default value which is the <newline>-character and the field separator is set to be the <tab>-character.
Awk programs are generally written as a sequence of pattern-action pairs of the form pattern { action }. These pairs are executed sequentially and state to perform action when pattern returns a non-zero or non-empty string value.
In the current program there are three such action-patter pairs:
F!=FILENAME { FNUM++; F=FILENAME }: This states that if the value of F is different from the current FILENAME which is processed, then increase the value of FNUM with one and update the value of F with the current FILENAME.
In the end, this is the same as just checking if we are processing a new file or not. The equivalent version of this would be:
(FNR==1) { FNUM++ }
which reads: If we are processing the first record of the current file (FNR), then increase the file count FNUM.
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }: As there is no pattern, it implies true by default. So here, for each record/line increment the number of times you saw the value in the first column and store it in an associative array COL (key-value pairs). Memorize the first field in C and store in an array A the value of the current record, but remove the first field. So if the record of the second file reads "foo A B C D", and foo already been seen 3 times, then, COL["foo"] will be equal to 4 and A["foo",2] will read " A B C D".
END{ ... } This is a special pattern-action pair. Here END indicates that this action should only be executed at the end, when all files have been processed. What the end statement does, is straightforward, it just prints all records of each file. Including empty records.
In the end, this entire script can be simplified to the following:
awk 'BEGIN{ FS="\t" }
{ file_list[FILENAME]
key_list[$1]
record_list[FILENAME,$1]=$0 }
END { for (key in key_list)
for (fname in file_list)
print ( record_list[fname,key] ? record_list[fname,key] : key )
}' file1 file2 file3 ...
Assuming your '*mosdepth.summary.txt' files look like the following:
$ ls *mos*txt
1mosdepth.summary.txt 2mosdepth.summary.txt 3mosdepth.summary.txt
And contents are:
$ cat 1mosdepth.summary.txt
chrom length bases mean min max
contig_1_pilon 223468 1181176 5.29 0 860
contig_2_pilon 197061 2556215 12.97 0 217
contig_6_pilon 162902 2132156 13.09 0 80
$ cat 2mosdepth.summary.txt
chrom length bases mean min max
contig_19_pilon 286502 2067244 7.22 0 345
contig_29_pilon 263348 2222566 8.44 0 765
contig_32_pilon 291449 2671881 9.17 0 128
contig_34_pilon 51310 525393 10.24 0 47
$ cat 3mosdepth.summary.txt
chrom length bases mean min max
contig_37_pilon 548146 6652322 12.14 0 558
contig_41_pilon 7529 144989 19.26 0 71
The following awk command might be appropriate:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "file#"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr++} \
FNR>=2{printf "%s ", fnbr; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
file# chrom length bases mean min max
1 contig_1_pilon 223468 1181176 5.29 0 860
1 contig_2_pilon 197061 2556215 12.97 0 217
1 contig_6_pilon 162902 2132156 13.09 0 80
2 contig_19_pilon 286502 2067244 7.22 0 345
2 contig_29_pilon 263348 2222566 8.44 0 765
2 contig_32_pilon 291449 2671881 9.17 0 128
2 contig_34_pilon 51310 525393 10.24 0 47
3 contig_37_pilon 548146 6652322 12.14 0 558
3 contig_41_pilon 7529 144989 19.26 0 71
Alternatively, the following will output the filename rather than file#:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "fname"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr=FILENAME} \
FNR>=2{printf "%s ", fnbr; fnbr="-"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
fname chrom length bases mean min max
1mosdepth.summary.txt contig_1_pilon 223468 1181176 5.29 0 860
- contig_2_pilon 197061 2556215 12.97 0 217
- contig_6_pilon 162902 2132156 13.09 0 80
2mosdepth.summary.txt contig_19_pilon 286502 2067244 7.22 0 345
- contig_29_pilon 263348 2222566 8.44 0 765
- contig_32_pilon 291449 2671881 9.17 0 128
- contig_34_pilon 51310 525393 10.24 0 47
3mosdepth.summary.txt contig_37_pilon 548146 6652322 12.14 0 558
- contig_41_pilon 7529 144989 19.26 0 71
With either command, the target_cols="1 2 3 4 5 6" specifies the targeted columns to extract.
target_cols="1 2 3" for example, will produce:
fname chrom length bases
1mosdepth.summary.txt contig_1_pilon 223468 1181176
- contig_2_pilon 197061 2556215
- contig_6_pilon 162902 2132156
2mosdepth.summary.txt contig_19_pilon 286502 2067244
- contig_29_pilon 263348 2222566
- contig_32_pilon 291449 2671881
- contig_34_pilon 51310 525393
3mosdepth.summary.txt contig_37_pilon 548146 6652322
- contig_41_pilon 7529 144989
target_cols="4 5 6" will produce:
fname mean min max
1mosdepth.summary.txt 5.29 0 860
- 12.97 0 217
- 13.09 0 80
2mosdepth.summary.txt 7.22 0 345
- 8.44 0 765
- 9.17 0 128
- 10.24 0 47
3mosdepth.summary.txt 12.14 0 558
- 19.26 0 71

Memory allocation of program without any allocation syscalls

I am currently working on a C program in Debian. This program at first allocates several gigabytes of memory. the problem is that after the startup of the program, still it is allocating memory. I checked and there is no malloc or calloc or etc. in the main loop of the program. I have checked the memory with the RES column in the htop command.
then I decided to check the memory syscalls of the program with strace. I attached strace after program startup using this command:
strace -c -f -e trace=memory -p $(pidof myprogram)
Here is the result:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000311 0 10392 mprotect
------ ----------- ----------- --------- --------- ----------------
100.00 0.000311 10392 total
So it is clear that there is no brk or mmap syscalls that can allocate memory.
Here is the list of all syscals:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
33.00 1.446748 6156 235 67 futex
32.41 1.420658 8456 168 poll
17.35 0.760549 31 24459 nanosleep
16.24 0.712000 44500 16 select
1.00 0.044000 7333 6 2 restart_syscall
0.00 0.000000 0 80 40 read
0.00 0.000000 0 40 write
0.00 0.000000 0 184 mprotect
0.00 0.000000 0 33 rt_sigprocmask
0.00 0.000000 0 21 sendto
0.00 0.000000 0 47 sendmsg
0.00 0.000000 0 138 44 recvmsg
0.00 0.000000 0 7 gettid
------ ----------- ----------- --------- --------- ----------------
100.00 4.383955 25434 153 total
Do you have any idea why is memory allocated?

Increase speed of reading gz file in C

I have written a small C program. It does read some gzipped files, does some filtering and then again outputs to gzipped files.
I run gcc with -O3 -Ofast. Otherwise pretty standard.
If I do strace -c on my executable I get:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.01 0.077081 0 400582 read
42.73 0.071579 4771 15 munmap
9.34 0.015647 0 110415 brk
1.01 0.001688 32 52 openat
0.45 0.000746 3 228 mmap
0.20 0.000327 4 70 mprotect
0.15 0.000254 0 1128 write
0.06 0.000100 2 50 fstat
0.05 0.000087 1 52 close
0.00 0.000006 6 1 getrandom
0.00 0.000005 2 2 rt_sigaction
0.00 0.000004 2 2 1 arch_prctl
0.00 0.000003 3 1 1 stat
0.00 0.000003 1 2 lseek
0.00 0.000002 2 1 rt_sigprocmask
0.00 0.000002 2 1 prlimit64
0.00 0.000000 0 8 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 fdatasync
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.167534 512616 3 total
So my program is quite busy with reading the file. Now, I am not sure if I can get it faster. The relevant code is the following:
while (gzgets(file_pointer, line, LL) != Z_NULL) {
linkage = strtok(line,"\t");
linkage = strtok(NULL,"\t");
linkage[strcspn(linkage, "\n")] = 0;
add_linkage_entry(id_cnt, linkage);
id_cnt++;
}
Do you see see room for improvement here? Is it possible to intervene manually with gzread or is gzgets doint a good job here to not read char by char?
Any other advice? (Are the errors in the strace worrisome?)
EDIT:
add_linkage_entry does add an entry to a uthash hash table (https://troydhanson.github.io/uthash/)
I don't think that gzgets (and the related read system calls) are the bottleneck here.
The number of read calls is small for data that compresses well, and it will increase for data that has more entropy (zlib has to request uncompressed data from disk more frequently then). E.g., for text data generated from urandom (via
base64 /dev/urandom | tr -- '+HXA' '\t' | head -n 10000000 | gzip
) I get about 70000 read calls for 10M lines, equalling about 140 lines/call. This nicely matches your experience of 100..1000 lines per call.
What is more, the CPU time for reading those lines is still negligible (about 2.5M lines/s, including the strtok calls). Highly compressed data requires about 40 times fewer read calls and can be read about 4 times as fast -- but this factor of 4 can also be seen with raw decompression via gzip -d on the command lines.
It thus appears that your function add_linkage_entry is the bottleneck here. In particular the large number of brk calls looks unusal.
The errors in strace output look harmless.

Output a for loop into a array

I need the below 4th row idle outputs to be put into a array and then take a average out of the same . below is lparstat output of an aix system.
$ lparstat 2 10
System configuration: type=Shared mode=Uncapped smt=4 lcpu=16 mem=8192MB psize=16 ent=0.20
%user %sys %wait %idle physc %entc lbusy app vcsw phint %nsp %utcyc
----- ----- ------ ------ ----- ----- ------ --- ----- ----- ----- ------
2.6 1.8 0.0 95.5 0.02 9.5 0.0 5.05 270 0 101 1.42
2.8 1.6 0.0 95.6 0.02 9.9 1.9 5.38 258 0 101 1.42
0.5 1.4 0.0 98.1 0.01 5.5 2.9 5.17 265 0 101 1.40
2.8 1.3 0.0 95.8 0.02 8.9 0.0 5.37 255 0 101 1.42
2.8 2.0 0.0 95.2 0.02 10.8 1.9 4.49 264 0 101 1.42
4.2 1.7 0.0 94.1 0.02 12.2 0.0 3.66 257 0 101 1.42
0.5 1.5 0.0 98.0 0.01 6.3 1.9 3.35 267 0 101 1.38
3.1 2.0 0.0 94.9 0.02 12.1 2.9 3.07 367 0 101 1.41
2.3 2.2 0.0 95.5 0.02 9.8 0.0 3.40 259 0 101 1.42
25.1 25.5 0.0 49.4 0.18 89.6 2.6 2.12 395 0 101 1.44
I have made a script like this but need to press enter to get the output .
$ for i in ` lparstat 2 10 | tail -10 | awk '{print $4}'`
> do
> read arr[$i]
> echo arr[$i]
> done
arr[94.0]
arr[97.7]
arr[94.9]
arr[91.0]
arr[98.1]
arr[97.7]
arr[93.0]
arr[94.8]
arr[97.9]
arr[89.2]
Your script only needs a small improvement to calculate the average. You can do that inside awk right away:
lparstat 2 10 | tail -n 10 | awk '{ sum += $4 } END { print sum / NR }'
The tail -n 10 takes 10 last lines.
{ sum += $4 } is calculated for each line - it sums the values at 4th column.
Then END block executes after the whole file is read. The { print sum / NR } prints the average. NR is "Number of Records", where one record is one line, so it's number of lines.
Notes:
backticks ` are discouraged. The modern $( ... ) syntax is much preferred.
The "for i in `cmd`" or more commonly for i in $(...) is a common antipattern in bash. Use while read -r line when reading lines from a command, like cmd | while read -r line; do echo "$line"; done or in bash while read -r line; do echo "$line"; done < <(cmd)

Issue with AWK array length?

I have a tab separated matrix (say filename).
If I do:
head -1 filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
followed by:
head -2 filename | tail -1 | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get an answer of 24 (same answer) for all rows basically.
But if I do it:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get:
24
25
25
25
25 ...
Why is it so?
Following is the inputfile:
Case1 17.49 0.643 0.366 11.892 0.85 5.125 0.589 0.192 0.222 0.231 27.434 0.228 0 0.111 0.568 0.736 0.125 0.038 0.218 0.253 0.055 0.019 0 0.078
Case2 0.944 2.412 4.296 0.329 0.399 1.625 0.196 0.038 0.381 0.208 0.045 1.253 0.382 0.111 0.324 0.268 0.458 0.352 0 1.423 0.887 0.444 5.882 0.543
Case3 21.266 14.952 24.406 10.977 8.511 21.75 6.68 0.613 12.433 1.48 1.441 21.648 6.972 42.931 8.029 4.883 11.912 6.248 4.949 26.882 9.756 5.366 38.655 12.723
Case4 0.888 0 0.594 0.549 0.105 0.125 0 0 0.571 0.116 0.019 1.177 0.573 0.111 0.081 0.401 0 0.05 0.073 0 0 0 0 0.543
Well, I found an answer to my own problem:
I wonder how I missed it, but nullifying the array at the end of each initiation is always critical for repeated usage of same array name (no matter which language/ script one uses).
correct awk was:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array);delete array}'

Resources