I'm looking to improve from a standard formula, to create an array formula, that returns the number of text matches between 2 data sets based on set criteria.
Please refer to this Google Sheet: https://docs.google.com/spreadsheets/d/1PqBn6PPF0L21pK9fwA8jVAWp3D2_diftg4MCIz0oBGg/edit#gid=1979770676
My overarching aim is to matchmake between a set of mentors and mentees, using a text-matching scoring system.
Baseline MVP
I've first did this using match-based scoring, with each text match between a mentor and mentee giving 1 point, and the sum of the number of matches between each mentor and mentee will give the final score for that mentor-mentee pair. This is represented in a matrix in the "Matching MVP" tab, with the formula:
=SUMPRODUCT(COUNTIF(IF(B$1=Sandbox!$A$2:$A$20,Sandbox!$D$2:$D$20,"x"),IF($A2=Sandbox!$A$23:$A$35,Sandbox!$D$23:$D$35,"o")))
What I have done:
To further refine the scoring, I want to add an element of "intensity", where if a mentor/mentee has more than 1 tag in the same category, the match-score in that category between a mentor/mentee pair will be scaled downwards by a factor of 1/sqrt(no. of tags in that category).
For example, if mentee 1 has the "South Korea" and "Japan" tags in the "Country" Category, their "Country" Category score will be scaled down as the following: Match Score * 1/sqrt(2), to give an intensity-adjusted score for that category.
This intensity factor is applied for each category, and the final match-score will be a sum-product of the arrays of the intensity factors of each category (see G11:G13), and the raw match-counts in each category for a mentor-mentee pair (see K11:K13). The final result can be seen in cell G17.
The intensity-factor computation is an array formula in G11 with the following:
=ArrayFormula(1/sqrt(COUNTIF(IF(G3=$A$2:$A$20,$B$2:$B$20,""),G4:G6)))
and the raw match-scores is a standard drag-down formula from K11:K13, referencing cells G4:G6 for the categories:
=SUMPRODUCT(COUNTIF(FILTER(IF($G$3=$A$2:$A$20,$D$2:$D$20,""),REGEXMATCH(IF($G$3=$A$2:$A$20,$D$2:$D$20,""),G4)),FILTER(IF($G$2=$A$23:$A$35,$D$23:$D$35,""),REGEXMATCH(IF($G$2=$A$23:$A$35,$D$23:$D$35,""),G4))))
Where I need help:
I want to replace the formulas in K11:K13 with an array formula instead, that returns the exact same result. This is because I want to nest all the array formulas into one sum-product, which can then fit very nicely into my matrix in the "Matching MVP" tab, replacing the simple match-scoring MVP. The number of categories in G4:G6 will also expand, and I want the array formula to account for that.
It is entirely possible that the final result I want (in G17) can be accomplished in an entirely different way, so do feel free to try other methods of applying the intensity factors to the match-scoring.
Any help with this is greatly appreciated!
Related
Wordy title but I was interviewed on this question, couldn't derive the answer and really would love to better understand array usage in excel.
Question:
You have two arrays. One depicts the credit ratings given by various companies (i.e. Company 1, Company 2...) and then an array showing the rankings of all credit ratings. Your goal is to use a formula to find the Lowest Rating by "CUSIP" and which company gave this rating. NOTE: A rating of NR = "NOT RATED" and should be excluded.
Arrays (Red & Black Array is the desired outcome of the formula)
Figured out my own answer... Thought I would document it here for you. One of the issues I had in the interview was to do with me setting up the "Ranking" table as a vertical table rather than a longwise table. This did not allow the max-matching array function to properly work.
I had to rearrange the ranking table to be elongated: | AAA | AA | A |
Lowest Rating Formula:
=INDEX((Ranking Table Array),
2,
MAX(IF(Credit Ratings Array)<>"NR",MATCH((CUSIP row in the table),
(Credit Ratings Array),0))))
I found this to accurately match the credit ratings with the lowest rated (would rank higher than a higher rated bond) for all of the companies
The company formula:
INDEX(Table1[[#Headers],[Company 1]:[Company 3]],1,MATCH(C11,B4:D4,0))
Piggie backing off the other formula - this formula matches the lowest credit rating found in the last formula to the company rating that in each row.
Issues with answers:
They require the answer table to be in the same format as the data being read. Because the formulas assume that CUSIPs ascending are being assessed the formulas will move 1 row down with each shift of the formula down. This is not scalable (even though this is an interview question).
Looking to create pricing groups bases on multiple criteria. Each group could have multiple items within the group. I'm struggling with the autocreation the naming of each group. I estimate there should be about 6.5K pricing groups out of 14K items.
Below is the criteria -
QTY per case - is the number of bottles in a case
Size - size of the bottle
Family Brand - contains a group of like items
Code - CS1 - This is my unique code for each group that contains each of the above and lowest possible case price.
enter image description here
The "Thinking" column is how I want each group to look, but how do I do this with 14K items quickly?
If I understood correctly your pricing group name consists of two parts: a simple combination of columns and a "special" column, that should be counted.
Part 1 is simple: =C2&"-"&B2&"-"&A1&"-"
To make Part 2 easier you could sort, sorting fields Part 1, CODE-CS1.
After have done this you could use helping columns. If Part 1 is in column x and code-CS1 in column y you could find a formula for
Part 2 (column z): ="T"&IF(X1=X2;IF(Y1=Y2;Z1;Z1+1);1)
That means: If Part 1 is changing your counter starts with T1, if not so if your CODE CS1 changes, it counts, if not, so it keeps last number.
the result code would be =X2&Z2
It is untested and I use german excel, maybe the code doesn't work without any adaption, but in general it should work
This is hard to explain so my title sucks, and is just my best guess at how I might be able to approach this. I have a Google Sheet of sales data for cases of various bottle sizes of kombucha. Column E is the sale date, Column G contains the item code, and column J is the quantity sold of said cases. See my (vastly simplified) sample data:
https://docs.google.com/spreadsheets/d/17-LzGrNJtBr-FwOZtdaoCws3ayeGOHu_TdtGOfXj4cA/edit?usp=sharing
See my current test code below (also present in the Formula tab of the linked spreadsheet). It successfully gives me the combined number of cases sold of half-liter bottles and Growlers. The values in E4 and E5 are cells containing my start and end dates, respectively, so I'm constraining the results only to those which fall within a certain date range.
This code works, but now I need to figure out a way to sum the total number of bottles sold instead of # of cases. The data set is already massive and pushing the limits of google sheets, so adding a column to the source data sheet with # of bottles per case is not an option. Half liter cases hold 13 bottles, and growlers hold 5. Is there any way to do this with my current approach, using another array perhaps? Or any other approach that keeps the formula as simple as possible?
FYI the current formula is a proof of concept and I will be adding many additional types of cases to the existing formula, each containing a different number of bottles per case, and using it as part of a larger dynamic formula that allows you to switch between showing # cases vs # bottles vs # of actual liters sold, so this is why I am hoping to find an array-based approach that will let me do this without needing to resort to an absurdly long and complex formula of nested IF statements.
=SUMPRODUCT(--((XeroInvoiceData!$E$3:$E>=B4)*(XeroInvoiceData!$E$3:$E<=B5)), (--(ISNUMBER(MATCH(XeroInvoiceData!$G$3:$G, {"HalfLiterCase","GrowlerCase"}, 0)))), XeroInvoiceData!$J$3:$J)
I would be eternally grateful for any assistance.
Here is my solution:
https://docs.google.com/spreadsheets/d/1ig0krumJu4Lj9-nIKJyRfPLTYbU-mzOL0JokRUDEqNc/edit?usp=sharing
My idea was to filter your table on date and sum by the type of container.
I wanted also to allow new types of containers that contain smaller units (bottles or liters).
I divided this job into 3 stages.
First we have to filter this table according to selected dates and container types.
I prepared a list that may be extended (all you need is to extend the filter range).
Then I have to vlookup values of units in each container and I try to do it inside the same formula.
General idea is
={[query results],arrayformula(ifna(vlookup([first column of query],$C$21:$D$26,2,0)*[second column of query])}
I divide it into 2 stages.
First stage referrs to query results in adjacent table:
Second stage uses indexes of query so formula is quite long:
Tell me if it solves your problem.
Consider the following table
I need to generate a bar chart with the Category Group = "Country". The chart should only display the top 3 Groups based on the count of records for a country. I have already applied a filter for the Category Group specifying the Top N Condition as 3 for Count(Country). The chart generated, applies the filter as expected based on count, but i need only 3 bars to be shown even if there are bars with duplicate values.
Below is the chart that I get.
Expected Result
Now I know that, i can create an additional column in my dataset with ranking values and then apply a filter on this column to get the expected result (i have tried this, and it works)
Is there a way to achieve the expected result without changing the underlying Dataset?
Note: The dataset shown above is a highly simplified version of my dataset. In reality i have a huge dataset with a lot of columns. The same dataset has been used for various charts (with groupings on different columns).
This was an interesting question as I've always just "solved" the tiebreaker in the dataset without much thought. However, I do see a fairly easy way to use the rnd() function to dissolve the ties as long as you don't care which of the tied countries is shown:
=(Count(Fields!Country.Value) * 1000) + (Rnd() * 100)
Which essentially just weights the count per country into the thousands and then tiebreaks with a random small value:
New York: 30XX
France: 20XX
China: 10XX
Italy: 10XX
Singapore: 10XX
If you wanted to actually solve the tiebreaker with an alphabetical preference, you could do something similar but incorporate the numeric value for the first letter of the country etc...
I just found out, I could use an x-axis minimum of 0 and an maximum of 10.5 together with an interval setting of 1.
So, I was able to achieve a top 10 limitation and the axis labels show the names - (it may be a side effect but when I changed the axis maximum to a whole number, the axis no longer shows the names but numbers).
I was not really happy with the other approaches. They seems much of an overkill to me for such a simple requirement like limiting the number of bars shown in a chart.
I am having some troubles formulating my problem but I hope you understand!
I have a table of firms building production plants in foreign countries in certain years. (Columns A to C).
In a seperate table i have so-called cross-national distance measures (based on the difference in gdp of the countries). (Columns G to M). Note that the distances change per year.
A simplified version of the excel would look like this:
https://new.wu.ac.at/fileadmin/wu/d/i/iib/photo/stack.JPG
What I want is a formula for the manually entered results in column D. It shall give me a result which is the following:
It shall look in which countries the specific company has previously (years before) built plants
It shall find the smallest cross-national distance from the current country to any of the countries previously entered
The value should be for the year of the current plant-construction
Let me illustrate my request with the example result i would want in cell D8:
The formula would have to find a list of countries that were previously entered in this case Turkey and Bulgaria
It would then have to into the second table and give me the minimum of the distances from Kosovo but only to Turkey and Bulgaria
This would have to be done in the rows for 2008 (current year)
I really hope you guys can help me, i figured out a way to find a minimum in a list and i can do it for certain years as well but the issue i am having that excel first needs to find the previously entered countries, memorize them in some kind of array and then use only these countries to consider the minimum distance.
Thank you very much!
Try this "array formula" for D2 copied down
=IFERROR(SMALL(IF(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)*(G$2:G$31=B2)*(H$2:H$31=C2),I$2:M$31),1),"N/A")
confirmed with CTRL+SHIFT+ENTER
That checks three conditions for your larger table - that the header row matches a qualifying country (using COUNTIFS function based on criteria in the small table), that column G matches the current year and column H matches the current country.
If all those criteria are satisfied then the relevant values in the table are returned, and SMALL finds the smallest. If there's an error (because there are no qualifying values) then N/A is returned
In Excel 2010 or later versions you can use AGGREGATE function instead of SMALL - this is useful because it doesn't require "array entry"
=IFERROR(AGGREGATE(15,6,I$2:M$31/(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)>0)/(G$2:G$31=B2)/(H$2:H$31=C2),1),"N/A")