Output a for loop into a array - arrays

I need the below 4th row idle outputs to be put into a array and then take a average out of the same . below is lparstat output of an aix system.
$ lparstat 2 10
System configuration: type=Shared mode=Uncapped smt=4 lcpu=16 mem=8192MB psize=16 ent=0.20
%user %sys %wait %idle physc %entc lbusy app vcsw phint %nsp %utcyc
----- ----- ------ ------ ----- ----- ------ --- ----- ----- ----- ------
2.6 1.8 0.0 95.5 0.02 9.5 0.0 5.05 270 0 101 1.42
2.8 1.6 0.0 95.6 0.02 9.9 1.9 5.38 258 0 101 1.42
0.5 1.4 0.0 98.1 0.01 5.5 2.9 5.17 265 0 101 1.40
2.8 1.3 0.0 95.8 0.02 8.9 0.0 5.37 255 0 101 1.42
2.8 2.0 0.0 95.2 0.02 10.8 1.9 4.49 264 0 101 1.42
4.2 1.7 0.0 94.1 0.02 12.2 0.0 3.66 257 0 101 1.42
0.5 1.5 0.0 98.0 0.01 6.3 1.9 3.35 267 0 101 1.38
3.1 2.0 0.0 94.9 0.02 12.1 2.9 3.07 367 0 101 1.41
2.3 2.2 0.0 95.5 0.02 9.8 0.0 3.40 259 0 101 1.42
25.1 25.5 0.0 49.4 0.18 89.6 2.6 2.12 395 0 101 1.44
I have made a script like this but need to press enter to get the output .
$ for i in ` lparstat 2 10 | tail -10 | awk '{print $4}'`
> do
> read arr[$i]
> echo arr[$i]
> done
arr[94.0]
arr[97.7]
arr[94.9]
arr[91.0]
arr[98.1]
arr[97.7]
arr[93.0]
arr[94.8]
arr[97.9]
arr[89.2]

Your script only needs a small improvement to calculate the average. You can do that inside awk right away:
lparstat 2 10 | tail -n 10 | awk '{ sum += $4 } END { print sum / NR }'
The tail -n 10 takes 10 last lines.
{ sum += $4 } is calculated for each line - it sums the values at 4th column.
Then END block executes after the whole file is read. The { print sum / NR } prints the average. NR is "Number of Records", where one record is one line, so it's number of lines.
Notes:
backticks ` are discouraged. The modern $( ... ) syntax is much preferred.
The "for i in `cmd`" or more commonly for i in $(...) is a common antipattern in bash. Use while read -r line when reading lines from a command, like cmd | while read -r line; do echo "$line"; done or in bash while read -r line; do echo "$line"; done < <(cmd)

Related

Faster drop-in replacement for bash `cut` for specific application

I have a very large tab separated file. The tab separated file is binary and will be streamed by the tool samtools (which is very fast and not the bottleneck). Now I want to output only the content up to the first tab.
In my current piped command cut is the bottleneck:
samtools view -# 15 -F 0x100 file.bam | cut -f 1 | pigz > out.gz
I tried using awk '{print $1}'. This is not sufficiently faster I also tried using parallelin combination withcut` but this also does not increase the speed much.
I guess it would be better to have a tool which just outputs the string until first tab and then completely skips the entire line.
Do you have a suggestion for a tool which is faster for my purpose? Ideally, one would write a small C program I guess but my C is a bit rusty so would take too long for me.
You are interested in a small C program that just outputs lines from stdin until first tab.
In C you can do this easily with something like this:
#include <stdio.h>
#include <string.h>
#define MAX_LINE_LENGTH 1024
int main(void) {
char buf[MAX_LINE_LENGTH];
while(fgets(buf, sizeof(buf), stdin) != NULL) {
buf[strcspn(buf, "\n\t")] = '\0';
fputs(buf, stdout);
fputc('\n', stdout);
}
return 0;
}
It simply reads lines with fgets from stdin. The string is terminated with a NUL byte at the first tab \t. The same applies if there is a \n to have no extra line feeds in the output, just in case there is no tab on an input line.
Whether this is much faster in your use case I cannot say, but it should at least provide a starting point for trying out your idea.
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
You might give a try other implementation of AWK, according to test done in 2009¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! nawk was found faster than gawk and mawk was found faster than nawk. You would need to run test with your data to find if using other implementation give noticeable boost.
¹so versions available in 2022 might give different result
In the question OP has mentioned that awk '{print $1}' is not sufficiently faster than cut; in my testing I'm seeing awk running about twice as fast as cut, so not sure how OP is using awk ... or if I'm missing something (basic) with my testing ...
OP has mentioned a 'large' tab-delimited file with up to 400 characters per line; we'll simulate this with the following code that generates a ~400MB file:
$ cat sam_out.awk
awk '
BEGIN { OFS="\t"; x="1234567890"
for (i=1;i<=40;i++) filler=filler x
for (i=1;i<=1000000;i++) print x,filler
}'
$ . ./sam_out.awk | wc
1000000 2000000 412000000
Test calls:
$ cat sam_tests.sh
echo "######### pipe to cut"
time . ./sam_out.awk | cut -f1 - > /dev/null
echo "######### pipe to awk"
time . ./sam_out.awk | awk '{print $1}' > /dev/null
echo "######### process-sub to cut"
time cut -f1 <(. ./sam_out.awk) > /dev/null
echo "######### process-sub to awk"
time awk '{print $1}' <(. ./sam_out.awk) > /dev/null
NOTE: also ran all 4 tests with output written to 4 distinct output files; diff of the 4 output files showed all were the same (wc: 1000000 1000000 11000000; head -1: 1234567890)
Results of running the tests:
######### pipe to cut
real 0m1.177s
user 0m0.205s
sys 0m1.454s
######### pipe to awk
real 0m0.582s
user 0m0.166s
sys 0m0.759s
######### process-sub to cut
real 0m1.265s
user 0m0.351s
sys 0m1.746s
######### process-sub to awk
real 0m0.655s
user 0m0.097s
sys 0m0.968s
NOTES:
test system: Ubuntu 10.04, cut (GNU coreutils 8.30), awk (GNU Awk 5.0.1)
earlier version of this answer showed awk running 14x-15x times faster than cut; that system: cygwin 3.3.5, cut (GNU coreutils 8.26), awk (GNU Awk 5.1.1)
You might consider process-substitutions instead of a pipeline.
$ < <( < <(samtools view -# 15 -F 0x100 file.bam) cut -f1 ) pigz
Note: I'm using process substitution to generate stdin and avoid using another FIFO. This seems to be much faster.
I've written a simple test script sam_test.sh that generates some output:
#!/usr/bin/env bash
echo {1..10000} | awk 'BEGIN{OFS="\t"}{$1=$1;for(i=1;i<=1000;++i) print i,$0}'
and compared the output of the following commands:
$ ./sam_test.sh | cut -f1 | awk '!(FNR%3)'
$ < <(./sam_test.sh) cut -f1 | awk '!(FNR%3)'
$ < <( < <(./sam_test.sh) cut -f1 ) awk '!(FNR%3)'
The later of the three cases is in 'runtime' significantly faster. Using strace -c , we can see that each pipeline adds a significant amount of wait4 syscalls. The final version is then also significantly faster (factor 700 in the above case).
Output of test case (short):
$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.22 0.643249 160812 4 1 wait4
0.30 0.001951 5 334 294 openat
0.21 0.001331 5 266 230 stat
0.04 0.000290 20 14 12 execve
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.648287 728 890 549 total
$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.72 0.256664 85554 3 1 wait4
0.45 0.001181 3 334 294 openat
0.29 0.000757 2 266 230 stat
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.259989 295 881 547 total
$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.43 0.002863 1431 2 1 wait4
19.68 0.001429 4 334 294 openat
14.87 0.001080 3 285 242 stat
10.00 0.000726 51 14 12 execve
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00 0.007261 7 909 557 total
Output of test case (full):
$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.22 0.643249 160812 4 1 wait4
0.30 0.001951 5 334 294 openat
0.21 0.001331 5 266 230 stat
0.04 0.000290 20 14 12 execve
0.04 0.000276 6 42 mmap
0.04 0.000229 76 3 clone
0.03 0.000178 3 49 4 close
0.02 0.000146 3 39 fstat
0.02 0.000109 9 12 mprotect
0.01 0.000080 5 16 read
0.01 0.000053 2 18 rt_sigprocmask
0.01 0.000052 3 16 rt_sigaction
0.01 0.000038 3 10 brk
0.01 0.000036 18 2 munmap
0.01 0.000034 5 6 2 access
0.00 0.000029 3 8 1 fcntl
0.00 0.000024 3 7 lseek
0.00 0.000019 4 4 3 ioctl
0.00 0.000019 9 2 pipe
0.00 0.000018 3 5 getuid
0.00 0.000018 3 5 getgid
0.00 0.000018 3 5 getegid
0.00 0.000017 3 5 geteuid
0.00 0.000013 4 3 dup2
0.00 0.000013 13 1 faccessat
0.00 0.000009 2 4 2 arch_prctl
0.00 0.000008 4 2 getpid
0.00 0.000008 4 2 prlimit64
0.00 0.000005 5 1 sysinfo
0.00 0.000004 4 1 write
0.00 0.000004 4 1 uname
0.00 0.000004 4 1 getppid
0.00 0.000003 3 1 getpgrp
0.00 0.000002 2 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.648287 728 890 549 total
$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.72 0.256664 85554 3 1 wait4
0.45 0.001181 3 334 294 openat
0.29 0.000757 2 266 230 stat
0.11 0.000281 20 14 12 execve
0.08 0.000220 5 42 mmap
0.06 0.000159 79 2 clone
0.05 0.000138 3 45 2 close
0.05 0.000125 3 39 fstat
0.03 0.000083 6 12 mprotect
0.02 0.000060 3 16 read
0.02 0.000054 3 16 rt_sigaction
0.02 0.000042 2 16 rt_sigprocmask
0.01 0.000038 6 6 2 access
0.01 0.000035 17 2 munmap
0.01 0.000027 2 10 brk
0.01 0.000019 3 5 getuid
0.01 0.000018 3 5 geteuid
0.01 0.000017 3 5 getgid
0.01 0.000017 3 5 getegid
0.00 0.000010 1 7 lseek
0.00 0.000009 2 4 3 ioctl
0.00 0.000008 4 2 getpid
0.00 0.000007 1 4 2 arch_prctl
0.00 0.000005 5 1 sysinfo
0.00 0.000004 4 1 uname
0.00 0.000003 3 1 getppid
0.00 0.000003 3 1 getpgrp
0.00 0.000003 1 2 prlimit64
0.00 0.000002 2 1 rt_sigreturn
0.00 0.000000 0 1 write
0.00 0.000000 0 1 pipe
0.00 0.000000 0 3 dup2
0.00 0.000000 0 8 1 fcntl
0.00 0.000000 0 1 faccessat
------ ----------- ----------- --------- --------- ----------------
100.00 0.259989 295 881 547 total
$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.43 0.002863 1431 2 1 wait4
19.68 0.001429 4 334 294 openat
14.87 0.001080 3 285 242 stat
10.00 0.000726 51 14 12 execve
2.67 0.000194 4 42 mmap
1.83 0.000133 3 39 fstat
1.67 0.000121 121 1 clone
1.58 0.000115 2 41 close
0.88 0.000064 6 10 2 access
0.87 0.000063 5 12 mprotect
0.73 0.000053 3 16 rt_sigaction
0.70 0.000051 4 12 rt_sigprocmask
0.66 0.000048 3 16 read
0.48 0.000035 3 10 brk
0.48 0.000035 3 9 getuid
0.44 0.000032 16 2 munmap
0.41 0.000030 3 8 1 fcntl
0.41 0.000030 3 9 geteuid
0.40 0.000029 3 9 getegid
0.34 0.000025 2 9 getgid
0.22 0.000016 5 3 dup2
0.21 0.000015 3 4 3 ioctl
0.19 0.000014 2 7 lseek
0.18 0.000013 13 1 faccessat
0.12 0.000009 2 4 2 arch_prctl
0.11 0.000008 4 2 prlimit64
0.08 0.000006 3 2 getpid
0.06 0.000004 4 1 write
0.06 0.000004 4 1 rt_sigreturn
0.06 0.000004 4 1 uname
0.06 0.000004 4 1 sysinfo
0.06 0.000004 4 1 getppid
0.06 0.000004 4 1 getpgrp
------ ----------- ----------- --------- --------- ----------------
100.00 0.007261 7 909 557 total
In the end I hacked a small C program which directly filters the BAM file and also writes to a gzip -- with a lot of help of the htslib developers (which is the basis for samtools).
So piping is not needed any more. This solution is about 3-4 times faster than the solution with the C code above (from Stephan).
See here:
https://github.com/samtools/samtools/issues/1672
if you just need first field why not just
{m,n,g}awk NF=1 FS='\t'
In terms of performance, I don't have a tab file handy but i do have 12.5mn rows 1.85 GB .txt with plenty of multi-byte UTF-8 in it that's "=" separated :
rows = 12,494,275. | ascii+utf8 chars = 1,285,316,715. | bytes = 1,983,544,693.
- 4.44s mawk 2
- 4.95s mawk 1
- 10.48s gawk 5.1.1
- 40.07s nawk
Why some enjoy pushing for the slow awks is beyond me.
=
in0: 35.8MiB 0:00:00 [ 357MiB/s] [ 357MiB/s] [> ] 1% ETA 0:00:00
out9: 119MiB 0:00:04 [27.0MiB/s] [27.0MiB/s] [ <=> ]
in0: 1.85GiB 0:00:04 [ 428MiB/s] [ 428MiB/s] [======>] 100%
( pvE 0.1 in0 < "${m3t}" | mawk2 NF=1 FS==; )
4.34s user 0.45s system 107% cpu 4.439 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 1.85GiB 0:00:04 [ 384MiB/s] [ 384MiB/s] [=====>] 100%
out9: 119MiB 0:00:04 [24.2MiB/s] [24.2MiB/s] [ <=>]
( pvE 0.1 in0 < "${m3t}" | mawk NF=1 FS==; )
4.83s user 0.47s system 107% cpu 4.936 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 1.85GiB 0:00:10 [ 180MiB/s] [ 180MiB/s] [ ==>] 100%
out9: 119MiB 0:00:10 [11.4MiB/s] [11.4MiB/s] [ <=>]
( pvE 0.1 in0 < "${m3t}" | gawk NF=1 FS==; )
10.36s user 0.56s system 104% cpu 10.476 total
1 52888940993baac8299b49ee2f5bdee7 stdin
=
in0: 4.25MiB 0:00:00 [42.2MiB/s] [42.2MiB/s] [> ] 0% ETA 0:00:00
out9: 119MiB 0:00:40 [2.98MiB/s] [2.98MiB/s] [<=> ]
in0: 1.85GiB 0:00:40 [47.2MiB/s] [47.2MiB/s] [=====>] 100%
( pvE 0.1 in0 < "${m3t}" | nawk NF=1 FS==; )
39.79s user 0.88s system 101% cpu 40.068 total
1 52888940993baac8299b49ee2f5bdee7 stdin
But these pale compared to using the right FS to collect everything away :
barely 1.95 secs
( pvE 0.1 in0 < "${m3t}" | mawk2 NF-- FS='=.*$'; )
1.83s user 0.42s system 115% cpu 1.951 total
1 52888940993baac8299b49ee2f5bdee7 stdin
By comparison, even gnu-cut that is purely C-code binary is slower :
( pvE 0.1 in0 < "${m3t}" | gcut -d= -f 1; )
2.53s user 0.50s system 113% cpu 2.674 total
1 52888940993baac8299b49ee2f5bdee7 stdin
You can save a tiny bit (1.772 secs) more using a more verbose approach :
( pvE 0.1 in0 < "${m3t}" | mawk2 '{ print $1 }' FS='=.*$'; )
1.64s user 0.42s system 116% cpu 1.772 total
1 52888940993baac8299b49ee2f5bdee7 stdin
Unfortunately, complex FS really isn't gawk's forte, even after you give it a helping boost with the byte-level flag :
( pvE 0.1 in0 < "${m3t}" | gawk -F'=.+$' -be NF--; )
20.23s user 0.59s system 102% cpu 20.383 total
52888940993baac8299b49ee2f5bdee7 stdin

cut/edit row lines and then transpose rows to column arrays

I've greatly simplified the values in the question
I have 500k lines of text in a space-delim file that I need to extract x, y values for plotting from each row and transpose those rows to columns. I'm trying to determine the most efficient way of doing this, and I've most been working in a bash script using sed and awk -- but if there are better ways of doing this, I'm open to suggestions.
The following code block is a simple example of the first 3 lines of text. The first 4 values (a,b,c,num) can be ignored but then each pair of values (separated by 0.000) needs to be cut and placed into a column and I need to do this for every line. The 2, 3 or 2 (right after c) tells us how many x,y pairs follow in the line, starting with 2500 and 500
Sample input:
a b c 2 2500.0 500.0 0.000 0.0 10.0
a b c 3 2000.0 450.0 0.000 1000.0 400.0 0.000 0.0 12.0
a b c 2 1800.0 475.0 0.000 0.0 15.0
Expected output:
2500.0 500.0 2000.0 450.0 1800.0 475.0
0.0 10.0 1000.0 400.0 0.0 15.0
0.0 12.0
I had a flash of inspiration - is this what you're trying to do:
$ cat tst.awk
BEGIN { OFS="\t" }
{
rowNr = 0
for ( i=6; i<NF; i+=3 ) {
val = $i OFS $(i+1)
vals[++rowNr,NR] = val
}
numRows = ( rowNr > numRows ? rowNr : numRows )
}
END {
numCols = NR
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
for ( colNr=1; colNr<=numCols; colNr++ ) {
val = ( (rowNr,colNr) in vals ? vals[rowNr,colNr] : OFS )
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
2500.0 500.0 2000.0 450.0 1800.0 475.0
0.0 10.0 1000.0 400.0 0.0 15.0
0.0 12.0
Previous answers:
Using your new example:
$ awk '{for (i=6;i<NF;i+=3) print $i, $(i+1)}' file
2500.0 500.0
0.0 10.0
2000.0 450.0
1000.0 400.0
0.0 12.0
1800.0 475.0
0.0 15.0
Original answer:
$ awk '{for (i=5;i<NF;i+=3) print $i, $(i+1)}' file
2328.948975 287.601868
2091.017578 168.763153
1949.201782 38.351677
1926.619385 43.498230
1808.323120 91.477585
1792.807861 60.784626
1650.532471 25.608397
1630.271484 54.119873
1615.212891 4.026413
1502.170288 118.276688
1176.283447 43.057251
1119.772705 56.983471
944.500000 81.461624
937.491516 107.800484
726.215515 181.033768
641.585510 82.066551
443.907104 27.303604
362.327789 90.082993
348.752777 39.833252
61.434803 44.582367
0.000000 7.181455

Spotfire Line chart with min max bars

I am trying to make a chart that has a line graph showing the change in value in the count column for each month, and then two points showing the min and max value in that month. The table table is below.
Date Min Max Count
1/1/2015 0.28 6.02 13
2/1/2015 0.2 7.72 8
3/1/2015 1 1 1
4/1/2015 0.4 6.87 7
5/1/2015 0.36 3.05 8
6/1/2015 0.17 1.26 13
7/1/2015 0.31 1.59 15
8/1/2015 0.39 3.35 13
9/1/2015 0.22 0.86 10
10/1/2015 0.3 2.48 13
11/1/2015 0.16 0.82 9
12/1/2015 0.33 2.18 5
1/1/2016 0.23 1.16 14
2/1/2016 0.38 1.74 7
3/1/2016 0.1 8.87 9
4/1/2016 0.28 0.68 3
5/1/2016 0.13 3.23 11
6/1/2016 0.33 1 5
7/1/2016 0.28 1.26 4
8/1/2016 0.08 0.41 2
9/1/2016 0.43 0.61 2
10/1/2016 0.49 1.39 4
11/1/2016 0.89 0.89 1
I tried doing a scatter plot but when I try to Add a Line from Column value I get an error saying that the line cannot work on categorical data.
Any suggestions on how I can prepare this visualization?
Thanks!
I would do this in a combination chart.
Insert a combination chart (Line & Bar Graph)
On your X-Axis put your date as <BinByDateTime([Date],"Year.Month",1)>
On your Y-Axis put your aggregations: Sum([Count]), Max([Max]), Min([Min])
Right click > Properties > Series > set the Min and Max to Line Type
(Optional) Change the Y-Axis scale

Issue with AWK array length?

I have a tab separated matrix (say filename).
If I do:
head -1 filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
followed by:
head -2 filename | tail -1 | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get an answer of 24 (same answer) for all rows basically.
But if I do it:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get:
24
25
25
25
25 ...
Why is it so?
Following is the inputfile:
Case1 17.49 0.643 0.366 11.892 0.85 5.125 0.589 0.192 0.222 0.231 27.434 0.228 0 0.111 0.568 0.736 0.125 0.038 0.218 0.253 0.055 0.019 0 0.078
Case2 0.944 2.412 4.296 0.329 0.399 1.625 0.196 0.038 0.381 0.208 0.045 1.253 0.382 0.111 0.324 0.268 0.458 0.352 0 1.423 0.887 0.444 5.882 0.543
Case3 21.266 14.952 24.406 10.977 8.511 21.75 6.68 0.613 12.433 1.48 1.441 21.648 6.972 42.931 8.029 4.883 11.912 6.248 4.949 26.882 9.756 5.366 38.655 12.723
Case4 0.888 0 0.594 0.549 0.105 0.125 0 0 0.571 0.116 0.019 1.177 0.573 0.111 0.081 0.401 0 0.05 0.073 0 0 0 0 0.543
Well, I found an answer to my own problem:
I wonder how I missed it, but nullifying the array at the end of each initiation is always critical for repeated usage of same array name (no matter which language/ script one uses).
correct awk was:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array);delete array}'

Why is gmtime implemented this way?

I happened across the source for Minix's gmtime function. I was interested in the bit that calculated the year number from days since epoch. Here are the guts of that bit:
http://www.raspberryginger.com/jbailey/minix/html/gmtime_8c-source.html
http://www.raspberryginger.com/jbailey/minix/html/loc__time_8h-source.html
#define EPOCH_YR 1970
#define LEAPYEAR(year) (!((year) % 4) && (((year) % 100) || !((year) % 400)))
#define YEARSIZE(year) (LEAPYEAR(year) ? 366 : 365)
int year = EPOCH_YR;
while (dayno >= YEARSIZE(year)) {
dayno -= YEARSIZE(year);
year++;
}
It looks like the algorithm is O(n), where n is the distance from the epoch. Additionally, it seems that LEAPYEAR must be calculated separately for each year – dozens of times for current dates and many more for dates far in the future. I had the following algorithm for doing the same thing (in this case from the ISO-9601 epoch (Year 0 = 1 BC) rather than UNIX epoch):
#define CYCLE_1 365
#define CYCLE_4 (CYCLE_1 * 4 + 1)
#define CYCLE_100 (CYCLE_4 * 25 - 1)
#define CYCLE_400 (CYCLE_100 * 4 + 1)
year += 400 * (dayno / CYCLE_400)
dayno = dayno % CYCLE_400
year += 100 * (dayno / CYCLE_100)
dayno = dayno % CYCLE_100
year += 4 * (dayno / CYCLE_4)
dayno = dayno % CYCLE_4
year += 1 * (dayno / CYCLE_1)
dayno = dayno % CYCLE_1
This runs in O(1) for any date, and looks like it should be faster even for dates reasonably close to 1970.
So, assuming that the Minix developers are Smart People who did it their way for a Reason, and probably know a bit more about C than I do, why?
Ran your code as y2 minix code as y1 Solaris 9 v245 & got this profiler data:
%Time Seconds Cumsecs #Calls msec/call Name
79.1 0.34 0.34 36966 0.0092 _write
7.0 0.03 0.37 1125566 0.0000 .rem
7.0 0.03 0.40 36966 0.0008 _doprnt
4.7 0.02 0.42 1817938 0.0000 _mcount
2.3 0.01 0.43 36966 0.0003 y2
0.0 0.00 0.43 4 0. atexit
0.0 0.00 0.43 1 0. _exithandle
0.0 0.00 0.43 1 0. main
0.0 0.00 0.43 1 0. _fpsetsticky
0.0 0.00 0.43 1 0. _profil
0.0 0.00 0.43 36966 0.0000 printf
0.0 0.00 0.43 147864 0.0000 .div
0.0 0.00 0.43 73932 0.0000 _ferror_unlocked
0.0 0.00 0.43 36966 0.0000 memchr
0.0 0.00 0.43 1 0. _findbuf
0.0 0.00 0.43 1 0. _ioctl
0.0 0.00 0.43 1 0. _isatty
0.0 0.00 0.43 73932 0.0000 _realbufend
0.0 0.00 0.43 36966 0.0000 _xflsbuf
0.0 0.00 0.43 1 0. _setbufend
0.0 0.00 0.43 1 0. _setorientation
0.0 0.00 0.43 137864 0.0000 _memcpy
0.0 0.00 0.43 3 0. ___errno
0.0 0.00 0.43 1 0. _fstat64
0.0 0.00 0.43 1 0. exit
0.0 0.00 0.43 36966 0.0000 y1
Maybe that is an answer
This is pure speculation, but perhaps MINIX had requirements that were more important than execution speed, such as simplicity, ease of understanding, and conciseness? Some of the code was printed in a textbook, after all.
Your method seems sound, but it's a little more difficult to get it to work for EPOCH_YR = 1970 because you are now mid-cycle on several cycles.
Can you see if you have an equivalent for that case and see whether it's still better?
You're certainly right that it's debatable whether that gmtime() implementation should be used in any high-performance code. That's a lot of busy work to be doing in any tight loops.
Correct approach. You definitely want to go for an O(1) algo. Would work in Mayan calendar without ado. Check the last line: dayno is limited to 0..364, although in leap years it needs to range 0..365 . The line before has a similar flaw.

Resources