Weighted average function - arrays

I'm trying to create a function that takes the weighted average of an array. I am creating a presentation that shows rates ($) and revenue by market for a client with a weighted average rate per product at the bottom. I could manually find each markets % share of total revenue and then multiply each market's % factor by it's rate and then add all of these values up to find the weighted average rate, but I want to create a function to do it for me. I want to do the following:
For the following client data (fake):
-Asia $16 $200,000
-Europe $9 $50,000
-N. America $21 $100,000
-Africa $25 $250,000
I need to find the weighted average rate across all markets.
Function WeightedAverage(array, weightarray)
#"array" being {$16,$9,$21,$25} and "weightarray" being {$200,000, $50,000, $100,000, $250,000}
WeightedAverage = SumProduct(array, weightvalues)
weightvalues = an array of values like so {$200,000/sum(weightarray), $50,000/sum(weightarray), $100,000/sum(weightarray), $250,000/sum(weightarray)}
End Function
This should return a weighted average rate of $20.
Can someone help me accomplish this?

You don't need VBA -- a simple formula works. If your data is in cells A1:C4, enter the following formula:
=SUMPRODUCT(B1:B4,C1:C4/SUM(C1:C4))
In this version of the formula I am treating C1:C4/SUM(C1:C4) much like an array formula, which converts the array of numbers (200,000, 50,000, etc.) into the corresponding array of weights (0.33, 0.083, etc.). In many contexts, this would require explicitly treating the expression as an array formula, but sumproduct is a rather powerful function which can apply whole-array formulas to its arguments. See this for a nice discussion.
In this particular case
=SUMPRODUCT(B1:B4,C1:C4)/SUM(C1:C4)
also works and corresponds to Merely Useful's answer in the comments, but I'll leave my answer as it is since it is useful to know that sumproduct can in effect evaluate array formulas in its arguments, even if the overall sumproduct isn't entered as an array formula itself.

Well, you don't need a VBA function for this, an array formula will do:
=SUM(A1:A4*B1:B4)/SUM(B1:B4)
(Press Ctrl+Enter when entering the formula)

Related

How to multiply values within a nested array...times values in an another array (in Google Sheets)?

This is hard to explain so my title sucks, and is just my best guess at how I might be able to approach this. I have a Google Sheet of sales data for cases of various bottle sizes of kombucha. Column E is the sale date, Column G contains the item code, and column J is the quantity sold of said cases. See my (vastly simplified) sample data:
https://docs.google.com/spreadsheets/d/17-LzGrNJtBr-FwOZtdaoCws3ayeGOHu_TdtGOfXj4cA/edit?usp=sharing
See my current test code below (also present in the Formula tab of the linked spreadsheet). It successfully gives me the combined number of cases sold of half-liter bottles and Growlers. The values in E4 and E5 are cells containing my start and end dates, respectively, so I'm constraining the results only to those which fall within a certain date range.
This code works, but now I need to figure out a way to sum the total number of bottles sold instead of # of cases. The data set is already massive and pushing the limits of google sheets, so adding a column to the source data sheet with # of bottles per case is not an option. Half liter cases hold 13 bottles, and growlers hold 5. Is there any way to do this with my current approach, using another array perhaps? Or any other approach that keeps the formula as simple as possible?
FYI the current formula is a proof of concept and I will be adding many additional types of cases to the existing formula, each containing a different number of bottles per case, and using it as part of a larger dynamic formula that allows you to switch between showing # cases vs # bottles vs # of actual liters sold, so this is why I am hoping to find an array-based approach that will let me do this without needing to resort to an absurdly long and complex formula of nested IF statements.
=SUMPRODUCT(--((XeroInvoiceData!$E$3:$E>=B4)*(XeroInvoiceData!$E$3:$E<=B5)), (--(ISNUMBER(MATCH(XeroInvoiceData!$G$3:$G, {"HalfLiterCase","GrowlerCase"}, 0)))), XeroInvoiceData!$J$3:$J)
I would be eternally grateful for any assistance.
Here is my solution:
https://docs.google.com/spreadsheets/d/1ig0krumJu4Lj9-nIKJyRfPLTYbU-mzOL0JokRUDEqNc/edit?usp=sharing
My idea was to filter your table on date and sum by the type of container.
I wanted also to allow new types of containers that contain smaller units (bottles or liters).
I divided this job into 3 stages.
First we have to filter this table according to selected dates and container types.
I prepared a list that may be extended (all you need is to extend the filter range).
Then I have to vlookup values of units in each container and I try to do it inside the same formula.
General idea is
={[query results],arrayformula(ifna(vlookup([first column of query],$C$21:$D$26,2,0)*[second column of query])}
I divide it into 2 stages.
First stage referrs to query results in adjacent table:
Second stage uses indexes of query so formula is quite long:
Tell me if it solves your problem.

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.
If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

Multiplying arrays resulting from multiplying arrays in Excel

I've tried looking through some of the posts and I'm having trouble finding something that will help me in this situation.
I have a spreadsheet that has Total Sales, Retail Price, and Inventory for each week in a year for a list of 100 or so projects. These three pieces of info are displayed as columns repeated for the year, with a row for each item.
I was able to add up the total annual cells (every 3rd column) using SUMPRODUCT((MOD(COLUMN(D3:L3),3)=0)*D3:L3)
The next goal is to get a formula to calculate the weighted average retail. I basically need to find a formula that will end up with the SUMPRODUCT of an array of Sales data and Retail data.
I have tried to use some layering of MMULT and SUMPRODUCT but keep getting #VALUE! errors. Particularly with SUMPRODUCT(TRANSPOSE(MMULT(TRANSPOSE((MOD(COLUMN(D3,L3),3)=0)),D3,L3)),MMULT(TRANSPOSE((MOD(COLUMN(D3,L3),3)=1)),D3,L3)) and with putting braces in there as well =SUMPRODUCT({TRANSPOSE(MMULT(TRANSPOSE((MOD(COLUMN(D3,L3),3)=0)),D3,L3))},{MMULT(TRANSPOSE((MOD(COLUMN(D3,L3),3)=1)),D3,L3)})
Does anyone have any experience with this type of issue? I feel like it should be something that Excel can do without having to have separate sheets to calculate.
For your weighted average:
=SUMPRODUCT(($D$2:$FD$2="Sales")*$D3:$FD3*$E3:$FE3)/SUMIF($D$2:$FD$2,"Sales",$D3:$FD3)
And also, to add up the Sales, you could consider:
=SUMIF($D$2:$FD$2,"Sales",$D3:$FD3)
I am assuming FD is the last column of data for the year, but change it if that is not the case.

TSQL - Do you calculate values then sum, or sum first then calculate values?

I feel stupid asking this - there is probably a math rule I am forgetting.
I am trying to calculate a gross profit based on net sales, cost, and billbacks.
I get two different values based on how I do the calculation:
(sum(netsales) - sum(cost)) + sum(billbackdollars) as CalculateOutsideSum,
sum((netsales - cost) + BillBackDollars) as CalculateWithinSum
This is coming off of a basic transaction fact table.
In this particular example, there are about 90 records being summed, and I get the following results
CalculateOutsideSum: 234.77
CalculateWithinSum: 247.70
I imagined this would be some sort of transitive property and both results would be the same considering it's just summation.
Which method is correct?
From a mathematical point of view, you should get exactly the same value with both your formulas.
Anyway in this cases it's better to performs sum after any calculation.
EDIT AFTER OPENER RESPONSE:
And treat your data with isnull function or other casting function which increases data precision.
Rounding, formatting and castings which decreases data precision should be applied after sums.
Just figured it out...
Problem was Net Sales was null for 3 rows, causing the calculation to become null, and incorrectly summing. After adding an isnull, both sums come out the same.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources