How to merge data sets with the same key column in PowerAutomate - inner-join

I have two data sets coming from two separate API calls.
These both give me a list of events, and a number value.
I want to merge these two sets into one table, in PowerAutomate, but I am having trouble figuring out how -- I've tried a nested Apply to Each function but it's not successful.
Example of my data:
Array 1:
| Event Name | Tickets |
|------------|---------|
| General Admission | 510 |
| Members | 210 |
| Special Programs | 100 |
| Groups | 20 |
Array 2:
| Event Name | Checked In|
|------------|---------|
| General Admission | 210 |
| Members | 110 |
| Special Programs | 100 |
(these data are in JSON format, arrays or strings: [{ "Event Name": "General Admission", "Purchased Tickets": "510"},{"Event Name": "Members", "Purchased Tickets": "210"}....] and [{"Event Name": "General Admission", "Checked In": "210"},.....])
And what I want to do is merge them into:
| Event Name | Tickets | Checked In |
|------------|---------|------------|
| General Admission | 510 | 210 |
| Members | 210 | 110 |
| Special Programs | 100 | 100 |
| Groups | 20 | |
(ideally the last cell would be 0 but that's something I can figure out myself). The important thing there is to note that both tables might not have identical rows, I just want to treat it as a null for now if so. I don't have a lot of control over what key values I get, those are dependant on the ran reports. However if they do match, the keys will be identical.
union just plops the second array after the first, and then I get a table with two sets of key values, and blank for column 2 and 3 respectively. I think inner-join can help me with this but it doesn't seem to give me an option to select both arrays.
I don't have access to any PowerAutomate premium features

A flow such as this will do it for you. I have to say, this type of manipulation like you're wanting to do is not always ideal on the LogicApps/PowerAutomate engine. Either way ...
To describe each step ...
Base Data
This is simply your JSON that is the base data you want to augment with the new data set.
Lookup Data
This is the other data set noting that the last record has no CheckedIn property. Obviously, that's an important part of the flow.
Initialized Combined Array
This is empty at this stage, it will combine the data in a joined state.
Process Each Base Data Item
This is obvious, we're just looping through the Base Data array.
Filter Lookup Array on Current Base Data Item
It's a bit easier to visualise this action.
To show you the code behind that step ...
"inputs": {
"from": "#body('Lookup_Data')",
"where": "#equals(item()['EventName'], items('Process_Each_Base_Data_Item')?['EventName'])"
},
Append to Combined Array
There are a couple of ways to do this but to illustrate it all a bit better, we're simply creating a new object and adding that to the Combined Array variable.
This is the code contained within the step ...
"inputs": {
"name": "Combined Array",
"value": {
"EventName": "#{items('Process_Each_Base_Data_Item')?['EventName']}",
"Tickets": "#items('Process_Each_Base_Data_Item')?['Tickets']",
"CheckedIn": "#if(greater(length(body('Filter_Lookup_Array_on_Current_Base_Data_Item')), 0), body('Filter_Lookup_Array_on_Current_Base_Data_Item')[0]['CheckedIn'], 0)"
}
}
The important expression being the CheckedIn property.
It checks to see if the filter result had at least one value, if it does, it gets the first (using index 0 but you could use the first function) item (I assume there should only be one) and then gets the CheckedIn property from that.
The end result is an array that looks like this ...
[
{
"EventName": "General Admission",
"Tickets": 510,
"CheckedIn": 210
},
{
"EventName": "Members",
"Tickets": 210,
"CheckedIn": 110
},
{
"EventName": "Special Programs",
"Tickets": 100,
"CheckedIn": 100
},
{
"EventName": "Groups",
"Tickets": 20,
"CheckedIn": 0
}
]

Related

How to validate two data sets coming from an algorithm to check its effectiveness

I have two data sets:
Ist (AKA "OLD") [smaller - just a sample]:
Origin | Alg.Result | Score
Star123 | Star123 | 100
Star234 | Star200 | 90
Star421 | Star420 | 98
Star578 | Star570 | 95
... | ... | ...
IInd (AKA "NEW") [bigger - used all real data]:
Origin | Alg.Result | Score
Star123 | Star120 | 90
Star234 | Star234 | 100
Star421 | Star423 | 98
Star578 | Star570 | 95
... | ... | ...
Those DFs are the results of two different algorithms. Let's call them "OLD" and "NEW".
The logic of those algorithms is following:
it takes value from some table (represented in the column: 'Origin'), and tries to match this value from some different table (outcome represented as a column: Alg. Result). Plus it calculates a score of the match based on some internal logic (column: Score).
Additionally important information:
I DF (old) is a smaller sample
II DF (new) is a bigger sample
Values in ORIGIN are the same for both datasets, excluding the fact that the old dataset has fewer of them compared to the NEW set.
Values in Alg. Result can:
be exactly the same as in Origin
can be similar
can be completely something else
In a solution where those algorithms are used, the threshold is used based on SCORE. For OLD it's a Score > 90. For the new, it's the same.
What I want to achieve is to:
How accurate is the new algorithm?
Validate how accurate is the new approach ("NEW") in matching values with Origin values.
What are the discrepancies between the OLD and NEW sets:
which cases the OLD has that the NEW doesn't have
which cases the NEW has, which the OLD doesn't have
What kind of comparison would you do to achieve those goals?
I thought about checking:
True positive => by taking NEW dataset with condition NEW.Origin == NEW.Alg.Result and NEW.Score == 100
False positive => by taking NEW dataset with condition NEW.Origin != NEW.Alg.Result and NEW.Score == 100
False-negative => by taking NEW dataset with condition NEW.Origin == NEW.Al.Result and NEW.Score != 100
I don't see a sense to count True negatives if the algorithm always generates some match. I'm not sure what this could look like.
What else you'd suggest? What to do to compare OLD and NEW values? Do you have some ideas?

Element-wise sum of array across rows of a dataset - Spark Scala

I am trying to group the below dataset based on the column "id" and sum the arrays in the column "values" element-wise. How do I do it in Spark using Scala?
Input: (dataset of 2 columns, column1 of type String and column2 of type Array[Int])
| id | values |
---------------
| A | [12,61,23,43]
| A | [43,11,24,45]
| B | [32,12,53,21]
| C | [11,12,13,14]
| C | [43,52,12,52]
| B | [33,21,15,24]
Expected Output: (dataset or dataframe)
| id | values |
---------------
| A | [55,72,47,88]
| B | [65,33,68,45]
| C | [54,64,25,66]
Note:
The result has to be flexible and dynamic. That is, even if there are 1000s of columns or even if the file is several TBs or PBs, the solution should still hold good.
I'm a little unsure about what you mean when you say it has to be flexible, but just on top of my head, I can think of a couple of ways. The first (and in my opinion the prettiest) one uses a udf:
// Creating a small test example
val testDF = spark.sparkContext.parallelize(Seq(("a", Seq(1,2,3)), ("a", Seq(4,5,6)), ("b", Seq(1,3,4)))).toDF("id", "arr")
val sum_arr = udf((list: Seq[Seq[Int]]) => list.transpose.map(arr => arr.sum))
testDF
.groupBy('id)
.agg(sum_arr(collect_list('arr)) as "summed_values")
If you have billions of identical ids, however, the collect_list will of course be a problem. In that case you could do something like this:
testDF
.flatMap{case Row(id: String, list: Seq[Int]) => list.indices.map(index => (id, index, list(index)))}
.toDF("id", "arr_index", "arr_element")
.groupBy('id, 'arr_index)
.agg(sum("arr_element") as "sum")
.groupBy('id)
.agg(collect_list('sum) as "summed_values")
The below single-line solution worked for me
ds.groupBy("Country").agg(array((0 until n).map(i => sum(col("Values").getItem(i))) :_* ) as "Values")

Use excel to summarise data from a column by identifier

I have a spreadsheet with a column called MRN (the identifier) and the drugs administered next to them. There are duplicates of the MRN in column A that correspond to different courses of drugs. What I'm hoping to do is to summarise all the drugs administered associated with one MRN in one line, removing all duplicates. It looks something like this.
| | A | B |
| 1 | MRN Item
| 2 | 1 cefoTAXime
| 3 | 1 ampicillin
| 4 | 1 cefoTAXime
| 5 | 1 vancomycin
| 6 | 1 cefTRIaxone
| 7 | 2 ampicillin
| 8 | 2 vancomycin
| 9 | 2 vancomycin
I have 3 different formulas. The first is to produce a list of MRNs that are all unique. The second is to pull all drugs by MRN and list them in one line. The third is to remove duplicates from this list. They are below (in order).
{=IFERROR(INDEX($A$2:$A$2885, MATCH(0,COUNTIF(D$1:$D1, $A$2:$A$2885),0 )),"")}
{=INDEX($A$2:$B$2885,SMALL(IF($A$2:$A$2885=$D2,ROW($A$2:$A$2885)),COLUMN(D:D))-4,2)}
{=IFERROR(INDEX($E$2:$AE$2, MATCH(0,COUNTIF(D$3:$D3, $E$2:$AE$2),0 )),"")}
*I know that I can edit the second one by adding IF(ISERROR ...) to remove NA and print blanks if drug not found, but want to keep the formulas as simple as possible at this time.
My problem is that second formula isn't pulling all the drugs by MRN, and in an ideal world I would be able to combine the second and third formula into one, but I am not sure how to. Here is a link to a test file that shows my issue and the formulas in action.
https://1drv.ms/x/s!ApoCMYBhswHzhooXnumW2iV7yx-JaA
I appreciate that there may be a better way to do this using python/R, and if that's possible then I'm more than happy to try, but I couldn't make any headway. Thanks for your help and suggestions.
If you could deal with a count of the number of courses per drug per MRN, you can do this with Power Query (aka Get & Transform in Excel 2016)
Starting with the data you provided on your worksheet, the results would look like:
M-Code
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"MRN", Int64.Type}, {"Item", type text}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"MRN"}, {{"Count", each _, type table}}),
#"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"MRN", "Item"}, {"Count.MRN", "Count.Item"}),
#"Pivoted Column" = Table.Pivot(#"Expanded Count", List.Distinct(#"Expanded Count"[Count.Item]), "Count.Item", "Count.MRN", List.NonNullCount)
in
#"Pivoted Column"

Excel Lookup IP addresses in multiple ranges

I am trying to find a formula for column A that will check an IP address in column B and find if it falls into a range (or between) 2 addresses in two other columns C and D.
E.G.
A B C D
+---------+-------------+-------------+------------+
| valid? | address | start | end |
+---------+-------------+-------------+------------+
| yes | 10.1.1.5 | 10.1.1.0 | 10.1.1.31 |
| Yes | 10.1.3.13 | 10.1.2.16 | 10.1.2.31 |
| no | 10.1.2.7 | 10.1.1.128 | 10.1.1.223 |
| no | 10.1.1.62 | 10.1.3.0 | 10.1.3.127 |
| yes | 10.1.1.9 | 10.1.4.0 | 10.1.4.255 |
| no | 10.1.1.50 | … | … |
| yes | 10.1.1.200 | | |
+---------+-------------+-------------+------------+
This is supposed to represent an Excel table with 4 columns a heading and 7 rows as an example.
I can do a lateral check with
=IF(AND((B3>C3),(B3 < D3)),"yes","no")
which only checks 1 address against the range next to it.
I need something that will check the 1 IP address against all of the ranges. i.e. rows 1 to 100.
This is checking access list rules against routes to see if I can eliminate redundant rules... but has other uses if I can get it going.
To make it extra special I can not use VBA macros to get it done.
I'm thinking some kind of index match to look it up in an array but not sure how to apply it. I don't know if it can even be done. Good luck.
Ok, so I've been tracking this problem since my initial comment, but have not taken the time to answer because just like Lana B:
I like a good puzzle, but it's not a good use of time if i have to keep guessing
+1 to Lana for her patience and effort on this question.
However, IP addressing is something I deal with regularly, so I decided to tackle this one for my own benefit. Also, no offense, but getting the MIN of the start and the MAX of the end is wrong. This will not account for gaps in the IP white-list. As I mentioned, this required 15 helper columns and my result is simply 1 or 0 corresponding to In or Out respectively. Here is a screenshot (with formulas shown below each column):
The formulas in F2:J2 are:
=NUMBERVALUE(MID(B2,1,FIND(".",B2)-1))
=NUMBERVALUE(MID(B2,FIND(".",B2)+1,FIND(".",B2,FIND(".",B2)+1)-1-FIND(".",B2)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2)+1)+1,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)-1-FIND(".",B2,FIND(".",B2)+1)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)+1,LEN(B2)))
=F2*256^3+G2*256^2+H2*256+I2
Yes, I used formulas instead of "Text to Columns" to automate the process of adding more information to a "living" worksheet.
The formulas in L2:P2 are the same, but replace B2 with C2.
The formulas in R2:V2 are also the same, but replace B2 with D2.
The formula for X2 is
=SUMPRODUCT(--($P$2:$P$8<=J2)*--($V$2:$V$8>=J2))
I also copied your original "valid" set in column A, which you'll see matches my result.
You will need helper columns.
Organise your data as outlined in the picture.
Split address, start and end into columns by comma (ribbon menu Data=>Text To Columns).
Above the start/end parts, calculate MIN FOR START, and MAX FOR END for all split text parts (i.e. MIN(K5:K1000) .
FORMULAS:
VALIDITY formula - copy into cell D5, and drag down:
=IF(AND(B6>$I$1,B6<$O$1),"In",
IF(OR(B6<$I$1,B6>$O$1),"Out",
IF(B6=$I$1,
IF(C6<$J$1, "Out",
IF( C6>$J$1, "In",
IF( D6<$K$1, "Out",
IF( D6>$K$1, "In",
IF(E6>=$L$1, "In", "Out"))))),
IF(B6=$O$1,
IF(C6>$P$1, "Out",
IF( C6<$P$1, "In",
IF( D6>$Q$1, "Out",
IF( D6<$Q$1, "In",
IF(E6<=$R$1, "In", "Out") )))) )
)))

MDX Union Across Different Dimensions

I would like to combine two MDX queries from separate dimensions on columns. Example:
Number of Sales/Product Type vs. Gender and State:
| CA | OR | WA | Male | Female
------------------------------------------------
food | 125 | 343 | 130 | 570 | 459
------------------------------------------------
drink | 123 | 465 | 135 | 678 | 343
State and Gender are their own respective dimensions, and I would like to do some aggregation (eg. sales count) across different product types (food, drink). Below is some idea of how this might work, although the queries can not be joined as they have different hierarchies. How might I go about tacking on male and female, for example, as columns in this result?
SELECT
NON EMPTY
{ [Store].[Store State].Members, [Gender].[Gender].Members } ON COLUMNS,
{ [Product].[Product Family].Members } ON ROWS
FROM [Sales]
WHERE { [Measures].[Sales Count] }
Example error:
MondrianEvaluationException: Expressions must have the same hierarchy
Is there a way to do this effectively in MDX,? If so, may I a specify specific aggregations for each colunm (eg. aggregate state data by total sales, gender data by profit).
Thank you for your help
In MDX, all axes must have the same dimensionality. The best approach would be to run two queries, and show them beside each other in the client tool.
However, you could do something similar to what #Vhteghem_Ph proposed:
SELECT
NON EMPTY
[Store].[Store State].Members * { [Gender].[Gender].[(All Gender)] }
+
{ [Store].[Store State].[(All Store State)] } * [Gender].[Gender].Members
ON COLUMNS,
{ [Product].[Product Family].Members } ON ROWS
FROM [Sales]
WHERE { [Measures].[Sales Count] }
Note that the + used here, which has two sets as parameters, is a short form of Union(set1, set2). And again, Union needs both sets to have the same dimensionality, in this case the first dimension of the set is the Store States hierarchy, and the second is the Gender hierarchy.
I don't know if the following will work with mondrian:
SELECT
{[Measures].[Internet Sales Amount]} ON 0
,NON EMPTY
{
(
[Customer].[Gender].[Gender].MEMBERS
,[Customer].[Marital Status].[(All)]
)
,(
[Customer].[Gender].[(All)]
,[Customer].[Marital Status].[Marital Status].MEMBERS
)
} ON 1
FROM [Adventure Works];
Philip

Resources