awk to evaluate and count comparisons with sum - loops

I'm trying to figure out the right way to pull this off with awk (not very familiar with awk) but I can't seem to get it.
Basically I have a text file with two columns. I want to sum up the second column and then divide each entry of the second column by the sum and increment a counter if the result is less than 0.25. In order to do this it seems like I have to loop through twice, once to get the sum and once to evaluate each entry with the sum. How can I pull this off with a one-liner?
Example Input:
0 5
1 5
2 10
3 5
Example Output:
3 (the sum is 25 and three of the entries result in a value less than 0.25 when divided by 25)
I got stuck trying to do this in bash and realizing I need to use awk to deal with decimals. I can loop through and get the sum and loop through and do a conditional check on each entry but I don't understand how to do both simultaneously.

Untested:
awk '
{ sum+=$2 ; row[NR]=$2 }
END{ for(i=1;i<=NR;i++) if (row[i]/sum < 0.25) {counter+=1}; print counter }
' file

Using awk
$ awk '{sum+=$2;a[NR]=$2}END{for (i in a) if (a[i]/sum<0.25) count++;print count}' file
Explanation
sum+=$2 get the summary on column 2 and save to sum
a[NR]=$2 record column 2 into array a (NR, the line number, as index)
i in a get index from array a one by one
if (a[i]/sum<0.25) count++ do the calculate and increase count with the condition (<0.25)

Related

Issues with SUMPRODUCT in Excel: Trying to count the number of average subtractions above a given threshold

I have a fairly simple issue that I cannot seem to work out. It may be familiar to some of you now.
I have the following matrix (which I will refer to as two arrays):
F G H I J ... R S T U V
1 0 0 1 1
4 4 2 3 5 1 2 3 1 2
2 2 3 1 2 0 1
2 1 0 0 4 0 0 3 0 0
I would like to take the difference between the average of each row in array 1 (columns F:J) and the average of each row in array 2 (columns R:V). For example, the average of F2:J2 = 3.6, the average of R2:V2 = 1.8, and the overall difference = 1.8. I would then like to count the number of overall differences which exceed a given threshold (e.g., 1), but I want to ignore rows which have no entries (see R1:V1) and/or partially missing entries (see the 2nd entry in row F3:J3 and 4th and 5th entry in row R3:V3).
I was lucky enough to be introduced to array formulae by #Tom Sharpe, and have attempted to adapt his code for a similar issue I had, e.g.,:
=SUMPRODUCT(--((SUBTOTAL(1,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(1,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>1)*(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1))>0))
From what I understand, the code attempts to count the number of differences between the averages of each row in each array that exceed 1, so long as the product between the number of columns with full entries is >0 (i.e. has full data). However, it keeps throwing the #DIV/0! error, which I believe stems from that fact that it is still trying to subtract the average of F1:J1 and R1:V1 (e.g., the empty row), which would produce this kind of error. The correct answer for the current example is 1 (e.g., F2:J2 [3.6] - R2:V2 [1.8] = 1.8 == 1.8 > 1).
Does anyone have any ideas as to how the code can be attempted for the current purposes, and perhaps a v. brief explanation of what is going awry in the current code?
You're right, SUBTOTAL falls over when it's trying to find the average of an range containing only empty cells.
If you want to persevere and try and do it the same way as before with an array formula, you have to turn it round and put the condition for all the cells in both ranges to be non-blank in an if statement so that it doesn't try and take the average unless both ranges have no blanks:
=SUM(IF((SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),
--(SUBTOTAL(1,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(1,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>1)))
This time unfortunately I found I couldn't SUMPRODUCT it - I think this is because of the presence of the IF statement - so you have to enter it as an array formula using CtrlShiftEnter
Will this work for you?
=IF(NOT(OR(IFERROR(MATCH(TRUE,ISBLANK(F1:J1),0),FALSE),IFERROR(MATCH(TRUE,ISBLANK(R1:V1),0),FALSE))), SUBTOTAL(1,F1:J1)-SUBTOTAL(1,R1:V1), "Missing Value(s)")
My approach was a little different from what you tried to adapt from #TomSharp in that I'm validating the cells have data (not blank) and then perform the calculation, othewise return an error message. This is still an array function call, so when you enter the formulas, press ctrl+shft+enter.
The condition part of the opening if() checks to see that each range's cells are not blank: if a match( true= isblank(cell))
means a cell is blank (bad), if no match ... ie no blank cells, Match will return an #NA "error" (good). False is good = Errors found ? No. ((ie no blank cells))

Countif the Result of Subtracting Two Arrays Exceeds a Certain Value in Excel

I am new to array formulae and am having trouble with the following scenario:
I have the following matrix:
F G H I J ... R S T U V
1 0 0 1 1
0 1 1 1 2 3 1 2
2 0 2 3 1 2 0 1 0 0
2 1 0 0 1 0 0 3 0 0
My goal is to count the number of rows within which the difference between the sum of columns F:J and the sum of columns R:V is greater than a threshold. Critically, only rows with full data should be included: row 1 (where there are only values for columns F1:J1) and row 2 (where there are only some values for columns F2:J2) should be ignored.
If the threshold = 2.5, then the solution is 1. That is, row 3 is the only row with complete data where the difference between the sum of F3:J3 (8) and the sum of R3:V3 (3) is greater than 2.5 (e.g., 5 > 2.5).
I have tried to put together the following formula, rather pathetically, based on the teachings of #Tom Sharpe and #QHarr:
=COUNT(IF(SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(9,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>2.5,IF(AND(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1),SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))),IF(AND(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1),SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),SUBTOTAL(9,OFFSET(R1,ROW(R1:V1)-ROW(R1),0,1,COLUMNS(R1:V1))))))
But it seems to always produce a value of 1, even if I edit the matrix such that the difference between the sum of F4:J4 and R4:v4 also exceeds 2.5. Sadly I am struggling to understand why and would appreciate any guidance on the matter.
As an array formula in one cell without volatile functions:
=SUM((MMULT(--(LEN(F2:J5)*LEN(R2:V5)>0),--TRANSPOSE(COLUMN(F2:J2)>0))=5)*(MMULT(F2:J5-R2:V5,TRANSPOSE(--(COLUMN(F2:J2)>0)))>2.5))
should do the trick :D
Maybe, in say X1 (assuming you have labelled your columns):
=COUNTIF(Y:Y,TRUE)
In Y1 whatever your chosen cutoff (eg 2.5) and in Y2:
=((COUNTBLANK(F2:J2)+COUNTBLANK(R2:V2)=0)*SUM(F2:J2)-SUM(R2:V2))>Y$1
copied down to suit.
Try this:
=SUMPRODUCT((MMULT(F1:J4-R1:V4,--(ROW(INDIRECT("1:"&COLUMNS(F1:J4)))>0))>2.5)*(MMULT((LEN(F1:J4)>0)+(LEN(R1:V4)>0),--(ROW(INDIRECT("1:"&COLUMNS(F1:J4)))>0))=(COLUMNS(F1:J4)+COLUMNS(R1:V4))))
I think this will do it, replacing your AND's by multiplies (*):
=SUMPRODUCT(--((SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(9,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>2.5)*(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1))>0))
It could be simplified a bit more but a bit short of time.
Just another option...
=IF(NOT(OR(IFERROR(MATCH(TRUE,ISBLANK(F1:J1),0),FALSE),IFERROR(MATCH(TRUE,ISBLANK(R1:V1),0),FALSE))), SUBTOTAL(9,F1:J1)-SUBTOTAL(9,R1:V1), "Missing Value(s)")
My approach was a little different from what you tried to adapt from #TomSharp in that I'm validating the cells have data (not blank) and then perform the calculation, othewise return an error message. This is still an array function call, so when you enter the formulas, press ctrl+shft+enter.
The condition part of the opening if() checks to see that each range's cells are not blank: if a match( true= isblank(cell))
means a cell is blank (bad), if no match ... ie no blank cells, Match will return an #NA "error" (good). False is good = Errors found ? No. ((ie no blank cells))
Then the threshold condition becomes:
=COUNTIF(X1:X4,">"&Threshold)' Note: no Array formula here
I gave the threshold (Cell W6) a named range for read ablity.

The possible combinations of n digits

I want to write a C function that takes one integer as input and gives me all possible combinations using that much digits.
For example:
cases(3);
Output:
123 132 213 231 312 321
It uses the first three digits to create a three digit number, notice that I need that to work for any number of digits n.
Notice that cases(3) has 3! = 6 results.
So cases(4) has 4! = 24 results and so on.
I actually don't even know how to even approach this problem so any help is appreciated.
Recursion for the win :-)
the combinations of 1 digit is 1
the combinations of N digits is the recursive combinations of N - 1 digits with N added at every possible place
try to think of an algorithmn before you actually try to write the code.
Think of how you solved the Problem in your head when you wrote your desired output down. just find a systematic way to do this: for example you start with the lowest number and then check for the other numbers...
I have written the logic and code in python
#n digit number as input converted into list
m=int(input("enter number of digits:"))
f=[]
for i in range(1,m+1):
f.append(str(i))
#dynamic array for dynamic for loop inside recursion
a=[0 for k in range(len(f))]
c=[]#list which is to be used for append for digits
ans=[]# result in which the
# recursion for if loop inside for loop
#1st argument is fixed inside the loop
#2nd argument will be decreasing
def conditn(k,m):
if(m==0):
return 1
if(m==1):
if(a[k]!=a[0]):
return 1
if(a[k]!=a[m-1] and conditn(k,m-1)):
return 1
#recursion for for loop
#1st argument y is the length of the number
#2nd argument is for initialization for the varible to be used in for loop
#3rd argument is passing the list c
def loop(y, n,c):
if n<y-1:
#recursion until just befor the last for loop
for a[n] in range(y):
if(conditn(n,n)):
loop(y, n + 1,c)
else:
# last for loop
if(n==y-1):
for a[n] in range(y):
#last recursion of condition
if(conditn(n,n)):
#concatinating the individual number
concat=""
for i in range(y):
concat+=f[a[i]]+""
c.append(concat)
#returning the list of result for n digit number
return c
#printing the list of numbers after method call which has recursion within
#set is used used to convert any of the iterable to the
#distinct element and sorted sequence of iterable elements,
for j in (loop(len(f),0,c)):
print(j)

Search pattern and print hits lower than threshold

Here is an example what I need:
INPUT:
a 5
a 7
a 11
b 10
b 11
b 12
.
.
.
OUTPUT:
a 2
b 0
So on output should be hits lower than my threshold (in this case it is $2 < 10).
My code is:
awk 'OFS="\t" {v[$1]+=$2; n[$1]++} END {for (l in n) {print l, n[l]} }' input
and my output is
a 3
b 3
I am not sure where to put condition $2 < 10.
You can check the threshold condition with something like $2 < value, where value is an awk variable given with -v value=XX.
Also, you are using v[$1]+=$2: this sums, not counts the matching cases.
All together, I would use this:
awk -v t=10 '{list[$1]} $2<t {count[$1]++} END {for (i in list) print i, count[i]+0}' file
Note we need to use two arrays: one to keep track of the counters and another one the keep track of all possible values.
Explanation
-v t=10 provide threshold.
{list[$1]} keep track of all possible first fields appearing.
$2<t {count[$1]++} if the 2nd field is smaller than the threshold, increment the counter.
END {for (i in list) print i, count[i]+0} finally, loop through all the first fields and print the number of times they had a value lower than the threshold. The count[i]+0 trick makes it print 0 if the value is not set.
Test
$ awk -v t=10 '{list[$1]} $2<t {count[$1]++} END {for (i in list) print i, count[i]+0}' a
a 2
b 0

Why does awk seem to randomize the array?

If you look at output of this awk test, you see that array in awk seems to be printed at some random pattern. It seems to be in same order for same number of input. Why does it do so?
echo "one two three four five six" | awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
4 four
5 five
6 six
1 one
2 two
3 three
echo "P04637 1A1U 1AIE 1C26 1DT7 1GZH 1H26 1HS5 1JSP 1KZY 1MA3 1OLG 1OLH 1PES 1PET 1SAE 1SAF 1SAK 1SAL 1TSR 1TUP 1UOL 1XQH 1YC5 1YCQ" | awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
17 1SAF
4 1C26
18 1SAK
5 1DT7
19 1SAL
6 1GZH
7 1H26
8 1HS5
9 1JSP
10 1KZY
20 1TSR
11 1MA3
21 1TUP
12 1OLG
22 1UOL
13 1OLH
23 1XQH
14 1PES
1 P04637
24 1YC5
15 1PET
2 1A1U
25 1YCQ
16 1SAE
3 1AIE
Why does it do so, is there rule for this?
From 8. Arrays in awk --> 8.5 Scanning All Elements of an Array in the GNU Awk user's guide when referring to the for (value in array) syntax:
The order in which elements of the array are accessed by this
statement is determined by the internal arrangement of the array
elements within awk and cannot be controlled or changed. This can lead
to problems if new elements are added to array by statements in the
loop body; it is not predictable whether or not the for loop will
reach them. Similarly, changing var inside the loop may produce
strange results. It is best to avoid such things.
So if you want to print the array in the order you store it, then you have to use the classical for loop:
for (j=1; j<=NF; j++) print j,a[j]
Example:
$ awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j=1; j<=NF; j++) print j,a[j]}' <<< "P04637 1A1U 1AIE 1C26 1DT7 1GZH 1H26 1HS5 1JSP 1KZY 1MA3 1OLG 1OLH 1PES 1PET 1SAE 1SAF 1SAK 1SAL 1TSR 1TUP 1UOL 1XQH 1YC5 1YCQ"
1 P04637
2 1A1U
3 1AIE
4 1C26
5 1DT7
6 1GZH
7 1H26
8 1HS5
9 1JSP
10 1KZY
11 1MA3
12 1OLG
13 1OLH
14 1PES
15 1PET
16 1SAE
17 1SAF
18 1SAK
19 1SAL
20 1TSR
21 1TUP
22 1UOL
23 1XQH
24 1YC5
25 1YCQ
Awk uses hash tables to implement associative arrays. This is just an inherent property of this particular data structure. The location that a particular element is stored into the array depends on the hash of the value. Other factors to consider is the implementation of the hash table. If it is memory efficient, it will limit the range each key gets stored in using the modulus function or some other method. You also may get clashing hash values for different keys so chaining will occur, again affecting the order depending on which key was inserted first.
The construct (key in array) is perfectly fine when used appropriately to loop over every key but you cannot count on the order and you should not update array whilst in the loop as you may end up process array[key] multiple times by mistake.
There is a good decription of hash tables in the book Think Complexity.
The issue is the operator you use to get the array indices, not the fact that the array is stored in a hash table.
The in operator provides the array indices in a random(-looking) order (which IS by default related to the hash table but that's an implementation choice and can be modified).
A for loop that explicitly provides the array indices in a numerically increasing order also operates on the same hash table that the in operator on but that produces output in a specific order regardless.
It's just 2 different ways of getting the array indices, both of which work on a hash table.
man awk and look up the in operator.
If you want to control the output order using the in operator, you can do so with GNU awk (from release 4.0 on) by populating PROCINFO["sorted_in"]. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal for details.
Some common ways to access array indices:
To print array elements in an order you don't care about:
{a[$1]=$0} END{for (i in a) print i, a[i]}
To print array elements in numeric order of indices if the indices are numeric and contiguous starting at 1:
{a[++i]=$0} END{for (i=1;i in a;i++) print i, a[i]}
To print array elements in numeric order of indices if the indices are numeric but non-contiguous:
{a[$1]=$0; min=($1<min?$1:min); max=($1>max?$1:max)} END{for (i=min;i<=max;i++) if (i in a) print i, a[i]}
To print array elements in the order they were seen in the input:
{a[$1]=$0; b[++max]=$1} END{for (i=1;i <= max;i++) print b[i], a[b[i]]}
To print array elements in a specific order of indices using gawk 4.0+:
BEGIN{PROCINFO["sorted_in"]=whatever} {a[$1]=$0} END{for (i in a) print i, a[i]}
For anything else, write your own code and/or see gawk asort() and asorti().
If you are using gawk or mawk, you can also set an env variable WHINY_USERS, which will sort indices before iterating.
Example:
echo "one two three four five six" | WHINY_USERS=true awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
1 one
2 two
3 three
4 four
5 five
6 six
From mawk's manual:
WHINY_USERS
This is an undocumented gawk feature. It tells mawk to sort array indices before it starts to iterate over the elements of an array.

Resources