Complex formula to count uniques is counting one too many

Complex formula to count uniques is counting one too many - database

I have an issue with computing the number of unique patients and/or MRN. I crossed referenced one patient to their respective ID three times to make sure every patient only has exactly one unique MRN, even those they may appear more than once in the Excel database. My problem is using the formula:
=SUM(IF(FREQUENCY(MATCH(E4:E317,E4:E317,0),MATCH(E4:E317,E4:E317,0))>0,1))
for the names of patients, results in 94, which I'm confident is correct, but:
=SUM(IF(FREQUENCY(MATCH(F4:F317,F4:F317,0),MATCH(F4:F317,F4:F317,0))>0,1))
for the MRN of patients results in 95, which does not match. This initially prompted me that maybe a patient accidentally has two MRN. However, when I crossed-referenced multiple times to make sure one entry at a time, this is not so.
Any ideas why this occurred?

A PivotTable can be a quick and quite easy way to identify where 2-tuples whose components should be unique pairs are not, in a few cases (other solutions may work better where many are not):
The number of 2-tuples that can be checked in a single 'pass' is almost unlimited and mismatches can be identified by blank rows. In the example MRN 4.00 is associated with two names, a and d and it can be seen a is also associated with MRN 1.00 .
Alternatively, Remove Duplicates and sorting would achieve much the same result, though evidenced by one or more repeat values, rather than by blanks.

Try inserting a column next to name. Assuming new column next to name is F place the following in F4:
COUNTIF($E$4:$E$317,E4)
then cut and paste F4 into F5:F317
Sum Column F
Use the same approach for MRM.

Related

Excel calculate smallest of X columns within Y columns, ignoring zeros

I'm trying to calculate the sum of best segments in a run. For example, each Km gives a list as such:
5:40 6:00 5:45 5:55 6:21 6 :30
I'm trying to gather the best segments of 2km/3km/4km etc and would like a simple code to do it. At the moment, I'm using the formula
=Min(If(B1=0,9:9:9,sum(A1:B1),If(C1=0,9:9:9,sum(B1:C1))
but this goes all the way to 50km, meaning a very long formulae that I then have to repeat slightly differently at 3km, then 4km, then 5km etc. Surely there must me a way of
generating an array of summed columns of every n column, then iterating over that to find the min while ignoring values of 0?
I can do it manually for now, but what if I want to go over 50km? I might want to incorporate bike rides/car drives in the future just for some data analysis so I figured it best finding an ideal formulae now.
It's frustrating as I could code it and I want to avoid VBA ideally and stick to formulae in Excel.

Here is a draft of the case where there aren't any zeroes just for groups of 2Km. I decided the simplest approach initially was to add a couple of helper rows containing the running total of times (and for later use counts) and use a formula like this to subtract them in pairs:
=MIN(INDEX(A2:J2,SEQUENCE(1,9,2))-IF(SEQUENCE(1,9,0)=0,0,INDEX(A2:J2,SEQUENCE(1,9,0))))
but if you have access to recent additions to Excel 365 like Scan you can do it without helper rows.
Here is a more realistic scenario with a couple of zeroes thrown in
=LET(runningSum,Y$4:AP$4,runningCount,Y$5:AP$5,cols,COLUMNS(runningSum),leg,X7,
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))
Note that there are no runs of more than seven consecutive legs that don't contain a zero so 8, 9, 10 etc. just work out to 0.
As mentioned you could dispense with the helper rows by using Scan, but not everyone has access to this so I will add it separately:
=LET(data,Y$3:AP$3,runningSum,SCAN(0,data,LAMBDA(a,b,a+b)),
runningCount,SCAN(0,data,LAMBDA(a,b,a+(b>0))),leg,X7,cols,COLUMNS(data),
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))

Tom that worked! I learnt a few things on the way too and using the indexing method alongside sequence and columns is something I had not thought of. I'd never heard of the LET command before and I can already see that this is going to really help with some of the bigger calculations in the future.
Thank you so much, I'd like to show you how it now looks. Row 3087 is my old formula, row 3088 is a copy of the same data using the new formula, as you can see I've gotten exactly the same results so it's clear that it works perfectly and it is can be easily duplicated.

Stata / Loops to delete every 9th-15th observation out of 15,000 observations

I am trying to create a loop that deletes every 9th-15th observations out of dataset containing around 15,000 observations. What would be the syntax?

In general you should not loop over observations as there is almost definitely a more efficient and elegant solution, as Nick noted above with the drop if missing(turnout) solution. However, I gather you are a new Stata user so I will run through a hypothetical. Suppose that you in fact did want to drop observations within a certain range rather than based on a secondary variable like turnout. The general strategy could be as follows:
Group observations in groups of 15 (which I think you have here with the id column). In the solution below I will assume you do not have this column.
Within each group, drop the observations in the 9th-15th places.
// Group observations
gen group = ceil(_n/15)
// Within each group, drop in the 9-15 locations
by group: drop if inrange(_n,9,15)

drop if mod(_n, 9) == 0
will drop observations 9, 18, 27, and so forth. The same principle applies to any other positive integer.

How to multiply values within a nested array...times values in an another array (in Google Sheets)?

This is hard to explain so my title sucks, and is just my best guess at how I might be able to approach this. I have a Google Sheet of sales data for cases of various bottle sizes of kombucha. Column E is the sale date, Column G contains the item code, and column J is the quantity sold of said cases. See my (vastly simplified) sample data:
https://docs.google.com/spreadsheets/d/17-LzGrNJtBr-FwOZtdaoCws3ayeGOHu_TdtGOfXj4cA/edit?usp=sharing
See my current test code below (also present in the Formula tab of the linked spreadsheet). It successfully gives me the combined number of cases sold of half-liter bottles and Growlers. The values in E4 and E5 are cells containing my start and end dates, respectively, so I'm constraining the results only to those which fall within a certain date range.
This code works, but now I need to figure out a way to sum the total number of bottles sold instead of # of cases. The data set is already massive and pushing the limits of google sheets, so adding a column to the source data sheet with # of bottles per case is not an option. Half liter cases hold 13 bottles, and growlers hold 5. Is there any way to do this with my current approach, using another array perhaps? Or any other approach that keeps the formula as simple as possible?
FYI the current formula is a proof of concept and I will be adding many additional types of cases to the existing formula, each containing a different number of bottles per case, and using it as part of a larger dynamic formula that allows you to switch between showing # cases vs # bottles vs # of actual liters sold, so this is why I am hoping to find an array-based approach that will let me do this without needing to resort to an absurdly long and complex formula of nested IF statements.
=SUMPRODUCT(--((XeroInvoiceData!$E$3:$E>=B4)*(XeroInvoiceData!$E$3:$E<=B5)), (--(ISNUMBER(MATCH(XeroInvoiceData!$G$3:$G, {"HalfLiterCase","GrowlerCase"}, 0)))), XeroInvoiceData!$J$3:$J)
I would be eternally grateful for any assistance.

Here is my solution:
https://docs.google.com/spreadsheets/d/1ig0krumJu4Lj9-nIKJyRfPLTYbU-mzOL0JokRUDEqNc/edit?usp=sharing
My idea was to filter your table on date and sum by the type of container.
I wanted also to allow new types of containers that contain smaller units (bottles or liters).
I divided this job into 3 stages.
First we have to filter this table according to selected dates and container types.
I prepared a list that may be extended (all you need is to extend the filter range).
Then I have to vlookup values of units in each container and I try to do it inside the same formula.
General idea is
={[query results],arrayformula(ifna(vlookup([first column of query],$C$21:$D$26,2,0)*[second column of query])}
I divide it into 2 stages.
First stage referrs to query results in adjacent table:
Second stage uses indexes of query so formula is quite long:
Tell me if it solves your problem.

How many trucks came empty but bought something

I think this is the hardest to date I have had to crack - so hard I had a hard time finding a good headline.
So we have a site where trucks come and buy say Gravel, or sand or other building materials.
Sometimes they also unload demolition waste first.
I need to find out a couple of things
how many trucks (and from what companys) came empty
if they came empty what did they buy from us.
what companys are sending full trucks and what are sending empty trucks.
a tope 10 of materials they will drive to us from to buy even when coming empty to our facility.
a list of all the order numbers that they drove to us til fill and came with empty trucks. ( I have distances linked to order numbers, so now I can estimate the value of our products)
The data I have available:
I have a full data set of when what customer buys what and / or pay to deliver.
E.G.:
I can see the parts I need to split the data into I think it should be something like this
find all unique licence plates
somehow map if they bought materials within 30 minutes of
offloading demolition waste (most trucks will come between 2 and 10
times per day)
Present all this data (on a normal day we have about 800 trucks = 2000 lines since they weigh in, weigh out, and then some buy something = 2 more weigh lines)
I can easily find unique licence plates per day (either by formula or by Excel function Data/delete doublets,
but after that I have no clue where to start.
I think I need some sheets in between, where I somehow mark if a material was bought from an "empty truck" and I need a counter for that .. somehow...
Any help on how to get started is appreciated.

It seems like the best way to start is with a helper column (in the following exampes, I have chosen "Column M") to flag whether the truck arrived empty.
In the helper column, you can use something similar to the following formula.
{=IF(ISBLANK(B2),0,IF(C2="In",0,IF(B2=$B$2:$B$13,IF($C$2:$C$13="In",IF($A$2:$A$13>(A2-TIME(0,30,0)),0,1),1),1)))}
This is an array formula, which means you have to press ctrl+shift+enter after pasting it in the cell. Then you can copy that cell down the column.
Just to explain, the first if statement knows the truck is not arriving empty if Column C is 'In'. The second if statement creates an array and tests to see if other the same truck appears in other rows. The third if statement checks to see if the same truck checked 'In' in the matching rows, and the fourth if statement verifies if the time they checked in was less than thirty minutes ago. You can adjust the length by editing the TIME(0,30,0) function. The format is TIME(hours,minuites,seconds). Unless the truck matches all three of the second, third and fourth if statements, it is marked as coming empty.
Once you have this helper column, just about all of your tasks are quite simple.
1a: How many trucks came empty? Sum Column M
1b: How many trucks from what company? Create a unique list of companies. Then create a COUNTIFS formula based on Column M = 1 and Column K = Company. For example, if C32 had Company B then the formula =COUNTIFS($M$2:$M$13,1,$K$2:$K$13,C32) would return 2
1c: How many times did a truck come empty? Similar to 1b, create a unique list of License Plates, then use a COUNTIFS based on Column M = 1 and Column B = License Plate.
2: Similar to 1b, just use a unique list of products tested against Column F
3: Similar to 1b, just create a second column, next to the first that uses =COUNTIFS($M$2:$M$13,0,$K$2:$K$13,C53,$C$2:$C$13,"In") Which tests that Column M reports the truck did not come empty, that matches the company in Column K and that the truck came 'In' so you don't double count the same truck when it goes 'out'
4: Just sort list created by number 2. You can highlight the range, right-click and select "Sort" > "Custom Sort", then select the column you want to sort on and largest to smallest.
5: There are a couple of different ways, you could do this. The formula
{=TEXTJOIN(", ",TRUE,IF($M$2:$M$13=1,$J$2:$J$13,""))}
(again, entered as an array formula)
would create a comma separated list of order numbers. An alternative if you want a column of order numbers (but would only work if they are actually numbers), is to paste the formula {=MAX(IF($M$2:$M$13=1,$J$2:$J$13,))} in the first row of the column (in my example, its O2) and then {=MAX(IF($M$2:$M$13=1,IF($J$2:$J$13<O2,$J$2:$J$13,)))} in the row below (change the reference to O2 if you pasted it in a different spot)(again, note that both of these are array formulas). Then copy and paste the second formula down the column. When order numbers of trucks that came in empty are exhausted, the formula will report 0.

Return the smallest value from a list, in which only certain values are eligeble - excel

I am having some troubles formulating my problem but I hope you understand!
I have a table of firms building production plants in foreign countries in certain years. (Columns A to C).
In a seperate table i have so-called cross-national distance measures (based on the difference in gdp of the countries). (Columns G to M). Note that the distances change per year.
A simplified version of the excel would look like this:
https://new.wu.ac.at/fileadmin/wu/d/i/iib/photo/stack.JPG
What I want is a formula for the manually entered results in column D. It shall give me a result which is the following:
It shall look in which countries the specific company has previously (years before) built plants
It shall find the smallest cross-national distance from the current country to any of the countries previously entered
The value should be for the year of the current plant-construction
Let me illustrate my request with the example result i would want in cell D8:
The formula would have to find a list of countries that were previously entered in this case Turkey and Bulgaria
It would then have to into the second table and give me the minimum of the distances from Kosovo but only to Turkey and Bulgaria
This would have to be done in the rows for 2008 (current year)
I really hope you guys can help me, i figured out a way to find a minimum in a list and i can do it for certain years as well but the issue i am having that excel first needs to find the previously entered countries, memorize them in some kind of array and then use only these countries to consider the minimum distance.
Thank you very much!

Try this "array formula" for D2 copied down
=IFERROR(SMALL(IF(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)*(G$2:G$31=B2)*(H$2:H$31=C2),I$2:M$31),1),"N/A")
confirmed with CTRL+SHIFT+ENTER
That checks three conditions for your larger table - that the header row matches a qualifying country (using COUNTIFS function based on criteria in the small table), that column G matches the current year and column H matches the current country.
If all those criteria are satisfied then the relevant values in the table are returned, and SMALL finds the smallest. If there's an error (because there are no qualifying values) then N/A is returned
In Excel 2010 or later versions you can use AGGREGATE function instead of SMALL - this is useful because it doesn't require "array entry"
=IFERROR(AGGREGATE(15,6,I$2:M$31/(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)>0)/(G$2:G$31=B2)/(H$2:H$31=C2),1),"N/A")

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight