Aggregate Values of Multiple Lookups in Excel - arrays

I'm looking for a non-VBA solution to this problem.
Say I have a graph (in the computer science sense) in a spreadsheet as follows:
A B C D
1 Vertex Neighbors Degree Avg Nghbr Deg
2 A B,C 2 2.5
3 B A,C 2 2.5
4 C A,B,D 3 1.666666667
5 D C 1 3
I've entered columns C and D by hand but I want them to be calculated automatically. I've found reasonable solutions for column C that essentially count the commas and add 1. But for column D, I can't find a solution. I've found countless articles that explain how to lookup one value multiple times in one column, and countless articles that explain how to look up multiple values once in multiple columns, but I can't figure out how to look up multiple values in ONE column, get back an array of values, and then take the average of that array. I'm sure this can be done in VBA but I'd prefer a native Excel solution if one exists.
Obviously I'd like to extend this so that I can do other analyses of a vertex's neighbors. Presumably once I know the method to analyze a "looked-up array" I will be able to use it in other functions as well.
Any help is greatly appreciated.

To get column C:
=LEN(B2)-LEN(SUBSTITUTE(B2,",",""))+1
To get column D use SUMPRODUCT with SEARCH:
=SUMPRODUCT((ISNUMBER(SEARCH("," & $A$2:$A$5 & ",","," & B2 & ",")))*$C$2:$C$5)/C2

Related

(excel) How to return an array from a sum of ranges?

I'm setting up a morphological table that will have to go through potentially a couple hundred items, so it's desirable for this process to not be done by hand.
Here's a small summary of the situation:
fin
eng
op
fli
A
2
4
6
8
B
1
3
5
4
C
1
2
3
5
D
1
4
7
2
The first column holds named ranges A through D which have associated values from the 4 categories in row 1.
In a second table we create configurations based on which features are selected, something like this:
Config 1
Config 2
A
B
C
D
What I'm looking for is a formula that would read for each configuration which named range is selected, add the score for each category and return it in a simple array. Something like
Config 1 {3,6,9,13}, Config 2 {2,7,12,6}
So far I've found that the Indirect formula works exactly the way I want but I have to manually input each range. Something like:
=INDIRECT(A1)+INDIRECT(A2)
I've played around with different permutations of sum functions but instead of returning the arrays it returns the sum of the first values.
=SUM(INDIRECT(A1:A2))
Amy suggestion would be welcome.
I know this would probably be much simpler with code but this study needs to be done in excel..
I'm not sure if this answers your question as it doesn't use named ranges, but you could try something like this:
=MMULT(SEQUENCE(1,4,1,0),$B$2:$E$5*COUNTIF(INDEX($H$2:$I$3,0,ROW()-ROW($A$7)+1),$A$2:$A$5))

How to multiply values within a nested array...times values in an another array (in Google Sheets)?

This is hard to explain so my title sucks, and is just my best guess at how I might be able to approach this. I have a Google Sheet of sales data for cases of various bottle sizes of kombucha. Column E is the sale date, Column G contains the item code, and column J is the quantity sold of said cases. See my (vastly simplified) sample data:
https://docs.google.com/spreadsheets/d/17-LzGrNJtBr-FwOZtdaoCws3ayeGOHu_TdtGOfXj4cA/edit?usp=sharing
See my current test code below (also present in the Formula tab of the linked spreadsheet). It successfully gives me the combined number of cases sold of half-liter bottles and Growlers. The values in E4 and E5 are cells containing my start and end dates, respectively, so I'm constraining the results only to those which fall within a certain date range.
This code works, but now I need to figure out a way to sum the total number of bottles sold instead of # of cases. The data set is already massive and pushing the limits of google sheets, so adding a column to the source data sheet with # of bottles per case is not an option. Half liter cases hold 13 bottles, and growlers hold 5. Is there any way to do this with my current approach, using another array perhaps? Or any other approach that keeps the formula as simple as possible?
FYI the current formula is a proof of concept and I will be adding many additional types of cases to the existing formula, each containing a different number of bottles per case, and using it as part of a larger dynamic formula that allows you to switch between showing # cases vs # bottles vs # of actual liters sold, so this is why I am hoping to find an array-based approach that will let me do this without needing to resort to an absurdly long and complex formula of nested IF statements.
=SUMPRODUCT(--((XeroInvoiceData!$E$3:$E>=B4)*(XeroInvoiceData!$E$3:$E<=B5)), (--(ISNUMBER(MATCH(XeroInvoiceData!$G$3:$G, {"HalfLiterCase","GrowlerCase"}, 0)))), XeroInvoiceData!$J$3:$J)
I would be eternally grateful for any assistance.
Here is my solution:
https://docs.google.com/spreadsheets/d/1ig0krumJu4Lj9-nIKJyRfPLTYbU-mzOL0JokRUDEqNc/edit?usp=sharing
My idea was to filter your table on date and sum by the type of container.
I wanted also to allow new types of containers that contain smaller units (bottles or liters).
I divided this job into 3 stages.
First we have to filter this table according to selected dates and container types.
I prepared a list that may be extended (all you need is to extend the filter range).
Then I have to vlookup values of units in each container and I try to do it inside the same formula.
General idea is
={[query results],arrayformula(ifna(vlookup([first column of query],$C$21:$D$26,2,0)*[second column of query])}
I divide it into 2 stages.
First stage referrs to query results in adjacent table:
Second stage uses indexes of query so formula is quite long:
Tell me if it solves your problem.

Vlookup an array of formulas in Excel

I have one table with two columns
ID Probability
A 1%
B 2%
C 3%
D 4%
I have another table, with some IDs and corresponding weights:
ID Weight
A 50%
D 25%
A 15%
B 5%
B 5%
What I'm looking for is a way, in a single formula, to find the corresponding probabilities for each of the IDs in the second table using the data from the first, multiply each by their respective weights from the second table, then sum the results.
I recognise a simple way to solve it would be to add a proxy column to the second table and list corresponding probabilities using a vlookup and multiplying by the weight, then summing the results, but I feel like there must be a more elegant solution.
I've tried entering the second table IDs as an array in both Vlookup and Index/Match formulas, but while both accept a range as a lookup value, both only execute for the first value of the range instead of cycling through the whole array.
I guess ideally the formula would
set an 1 x 5 array for the IDs,
populate a new 1 x 5 array based on the probabilities from the first table
multiply the new array by the existing 1x5 array for weights
Sum whatever is the result
[edit] So for the above example, the final result would be (50% x 1%)+(25% x 4%) + (15% x 1%) + (5% x 2%) + (5% x 2%) = 1.85%
The real tables are much, much bigger than the examples I've given so a simple Sum() function for individual vlookups is out.
Love to hear of any clever solutions?
Using the same ranges as given by Trương Ngọc Đăng Khoa:
=SUMPRODUCT(SUMIF(A1:A4,D1:D5,B1:B4),E1:E5)
Regards
You can use this formula :
{=SUM(LOOKUP(D1:D5;A1:A4;B1:B4)*E1:E5)}
With table in this :
A B C D E
1 A 1% A 50%
2 B 2% D 25%
3 C 3% A 15%
4 D 4% B 5%
5 B 5%
Great response, thanks guys!
XOR LX, your answer seemed to work in all cases, which is what I was looking for (and seems like it was much simpler than I'd originally thought). I think I misunderstood the way the SUMIF function works.
In case anyone is interested, I also found my own (stupidly complex) solution:
=SUM(IF(A1:A4=TRANSPOSE(D1:D5),1,0)*TRANSPOSE(E1:E5)*B1:B4)
Which basically works by transforming the thing into a 4 x 5 matrix instead. I think I still prefer the XOR LX solution for it's simplicity.
Appreciate the help, everyone!

How to find pattern groups in boolean array?

Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.
In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.
Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.
Can't really figure out a solution other than brute force matching.
Please advice the proper name of the problem, so I could read more, or advice an elegant solution.
This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.
Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)
But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.

How to format data in (a) CSV file(s) so that it can easily be imported in R?

Edit:
So, this format would work:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
1 2 5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But what if I have two columns with multiple value that are linked. Say column quality has a machine and a quality linked and the column looks like this
MachineQuality
[[{1:1224}, {2:3453}], [{1:2242}, {2:4142}]
Now if I want to split that up like I did with the coordinates of the convexhull I would need 2 rows instead of 1. But wouldn't I need 2 rows for every row that is already in (so 4, because there are already 2 extra for the coordinates) like this:
featureID charge xcoordinate ycoordinate quality1 quality2
1 2 5105.9217 336.125209180674 1224 3453
1 2 5105.9217 336.125209180674 2242 4142
1 2 5108.7642 336.124751115092 1224 3453
1 2 5108.7642 336.124751115092 2242 4142
[...]
Would it have to be like this?
I'm very new to R, my knowledge doesn't go much further than knowing how to make a vector and some simple plots. I'm going to use R for an internship project the next couple of months and during this time I will (hopefully) learn some of the ins and outs of R. However, before I start I need to produce the data that I'm going to do the statistics on. I need to know beforehand how I should format my output CSV data so that I can easily read it in once I start my R analysis.
One thing that I've been asked to do is make a CSV file out of the data so that it can be read in by R. The example CSV files for importing with R that I've seen all look like this
featureID Charge value
1 2 10
2 0 9
However, my data mostly consists out of columns for which the values contain multiple values. To clarify:
As an example, my data exists of "features" that, amongs other information has a "convexhull". This convexhull consists of paired x and y coordinates. So what I could have for data is (only showing two coordinates, can be many)
featureID Charge Convexhull
1 2 [[{'y': '336.125209180674'}, {'x': '5105.9217'}], [{'y': '336.124751115092'}, {'x': '5108.7642'}]]
Is it possible to get this in one CSV file, being able to read it in R correctly (so that the paired x and y coordinates are preserved)? If so, how should the CSV file look like? For example, I've seen examples for CSV files with multiple values that look like this:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But I can't find if this is easily imported by R.
If this is not doable in one CSV file, are the CSV files easily imported independently, with a primary key idea, like database linking?
The only critical things are that you have a unique character separating your data columns and that each column is the same length. As long as the second row in your last example is filled in that will import fine.
You need to consider what you want to do with the data after it's in R to decide how you might want any other special formatting beforehand. But, as long as the column separator is a unique character and the columns are of equal length then it will import.
(You can violate the unique separator requirement if your entries are wrapped in quotes. And if you want to get really fancy you could "import" almost anything. But if someone's asking you to format the data then they probably want a rectangular data.frame compatible layout. They probably want unique values in each column (no columns of points). But that's between you and them.)
long vs. wide form. Your last example is known as long form (except all cells should be filled in) and your first example is roughly wide form as discussed on the ?reshape page and illustrated in the examples at the end of that page. You likely want to stick with long form. For an alternative see the reshape2 package.
save & load. Note that if you are only writing it out to read it back in to R later (as opposed to communicating it to some other software) you could use save and load which don't require any change to the object at all.
json. Another possibility given the form of your example is that you might want to look at the rjson package .

Resources