Multiple IF QUARTILEs returning wrong values - arrays

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.

If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

Related

Is there a way to enable/disable subsets of data taken by a formula?

There are 2 sheets: Data and Stats. Data contains records (several hundred) that are divided into ~20 categories. On the Stats sheet, I want to do various calculations. Is there a way to compose formulas in a way to take in only the categories that are enabled?
Previously, this was implemented in Excel, where SUMIFS function makes it possible (the conditions are inclusive). But in GS SUMIFS does not evaluate further conditions if condition1 is true. Or: I'm doing something wrong. Totally possible.
=SUMIF(Data!K:K, "<="&G3, Data!N:N) This version is a single condition. It says 'give me the sum of N column on Data sheet, provided corresponding value of K is less than or equal to G3'.
SUMIFS(Data!N:N, Data!K:K, "<="&G3, Data!B:B, G2:G20) (the syntax is slightly different in comparison with SUMIF)
Here, range G2:G20 contains the list of categories that I want to be able to switch on and off, and B column contains category labels. In Excel, when you indicate a range like this, it works. In GS - not so much.
Regardless of the Excel implementation, what would be the best way to achieve this on/off functionality in Google Sheets?
try:
=SUMPRODUCT(FILTER(Data!N:N, Data!K:K<=G3,
REGEXMATCH(""&Data!:B:B, TEXTJOIN("|", 1, G2:G20))))
note that G3 is overlapping with G2:G20 so something is not right in your referencing tho

Excel calculate smallest of X columns within Y columns, ignoring zeros

I'm trying to calculate the sum of best segments in a run. For example, each Km gives a list as such:
5:40 6:00 5:45 5:55 6:21 6 :30
I'm trying to gather the best segments of 2km/3km/4km etc and would like a simple code to do it. At the moment, I'm using the formula
=Min(If(B1=0,9:9:9,sum(A1:B1),If(C1=0,9:9:9,sum(B1:C1))
but this goes all the way to 50km, meaning a very long formulae that I then have to repeat slightly differently at 3km, then 4km, then 5km etc. Surely there must me a way of
generating an array of summed columns of every n column, then iterating over that to find the min while ignoring values of 0?
I can do it manually for now, but what if I want to go over 50km? I might want to incorporate bike rides/car drives in the future just for some data analysis so I figured it best finding an ideal formulae now.
It's frustrating as I could code it and I want to avoid VBA ideally and stick to formulae in Excel.
Here is a draft of the case where there aren't any zeroes just for groups of 2Km. I decided the simplest approach initially was to add a couple of helper rows containing the running total of times (and for later use counts) and use a formula like this to subtract them in pairs:
=MIN(INDEX(A2:J2,SEQUENCE(1,9,2))-IF(SEQUENCE(1,9,0)=0,0,INDEX(A2:J2,SEQUENCE(1,9,0))))
but if you have access to recent additions to Excel 365 like Scan you can do it without helper rows.
Here is a more realistic scenario with a couple of zeroes thrown in
=LET(runningSum,Y$4:AP$4,runningCount,Y$5:AP$5,cols,COLUMNS(runningSum),leg,X7,
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))
Note that there are no runs of more than seven consecutive legs that don't contain a zero so 8, 9, 10 etc. just work out to 0.
As mentioned you could dispense with the helper rows by using Scan, but not everyone has access to this so I will add it separately:
=LET(data,Y$3:AP$3,runningSum,SCAN(0,data,LAMBDA(a,b,a+b)),
runningCount,SCAN(0,data,LAMBDA(a,b,a+(b>0))),leg,X7,cols,COLUMNS(data),
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))
Tom that worked! I learnt a few things on the way too and using the indexing method alongside sequence and columns is something I had not thought of. I'd never heard of the LET command before and I can already see that this is going to really help with some of the bigger calculations in the future.
Thank you so much, I'd like to show you how it now looks. Row 3087 is my old formula, row 3088 is a copy of the same data using the new formula, as you can see I've gotten exactly the same results so it's clear that it works perfectly and it is can be easily duplicated.

How to multiply values within a nested array...times values in an another array (in Google Sheets)?

This is hard to explain so my title sucks, and is just my best guess at how I might be able to approach this. I have a Google Sheet of sales data for cases of various bottle sizes of kombucha. Column E is the sale date, Column G contains the item code, and column J is the quantity sold of said cases. See my (vastly simplified) sample data:
https://docs.google.com/spreadsheets/d/17-LzGrNJtBr-FwOZtdaoCws3ayeGOHu_TdtGOfXj4cA/edit?usp=sharing
See my current test code below (also present in the Formula tab of the linked spreadsheet). It successfully gives me the combined number of cases sold of half-liter bottles and Growlers. The values in E4 and E5 are cells containing my start and end dates, respectively, so I'm constraining the results only to those which fall within a certain date range.
This code works, but now I need to figure out a way to sum the total number of bottles sold instead of # of cases. The data set is already massive and pushing the limits of google sheets, so adding a column to the source data sheet with # of bottles per case is not an option. Half liter cases hold 13 bottles, and growlers hold 5. Is there any way to do this with my current approach, using another array perhaps? Or any other approach that keeps the formula as simple as possible?
FYI the current formula is a proof of concept and I will be adding many additional types of cases to the existing formula, each containing a different number of bottles per case, and using it as part of a larger dynamic formula that allows you to switch between showing # cases vs # bottles vs # of actual liters sold, so this is why I am hoping to find an array-based approach that will let me do this without needing to resort to an absurdly long and complex formula of nested IF statements.
=SUMPRODUCT(--((XeroInvoiceData!$E$3:$E>=B4)*(XeroInvoiceData!$E$3:$E<=B5)), (--(ISNUMBER(MATCH(XeroInvoiceData!$G$3:$G, {"HalfLiterCase","GrowlerCase"}, 0)))), XeroInvoiceData!$J$3:$J)
I would be eternally grateful for any assistance.
Here is my solution:
https://docs.google.com/spreadsheets/d/1ig0krumJu4Lj9-nIKJyRfPLTYbU-mzOL0JokRUDEqNc/edit?usp=sharing
My idea was to filter your table on date and sum by the type of container.
I wanted also to allow new types of containers that contain smaller units (bottles or liters).
I divided this job into 3 stages.
First we have to filter this table according to selected dates and container types.
I prepared a list that may be extended (all you need is to extend the filter range).
Then I have to vlookup values of units in each container and I try to do it inside the same formula.
General idea is
={[query results],arrayformula(ifna(vlookup([first column of query],$C$21:$D$26,2,0)*[second column of query])}
I divide it into 2 stages.
First stage referrs to query results in adjacent table:
Second stage uses indexes of query so formula is quite long:
Tell me if it solves your problem.

Mathematica Findroot Exploring the parameter space

I am solving three non-linear equations in three variables (H0D,H0S and H1S) using FindRoot. In addition to the three variables of interest, there are four parameters in these equations that I would like to be able to vary. My parameters and the range in which I want to vary them are as follows:
CF∈{0,15} , CR∈{0,8} , T∈{0,0.35} , H1R∈{40,79}
The problem is that my non-linear system may not have any solutions for part of this parameter range. What I basically want to ask is if there is a smart way to find out exactly what part of my parameter range admits real solutions.
I could run a FindRoot inside a loop but because of non-linearity, FindRoot is very sensitive to initial conditions so frequently error messages could be because of bad initial conditions rather than absence of a solution.
Is there a way for me to find out what parameter space works, short of plugging 10^4 combinations of parameter values by hand and playing around with the initial conditions and hoping that FindRoot gives me a solution?
Thanks a lot,

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources