Find all matching results in column and output their adjacent rows - arrays

Summary
I have a database of ~1500 entries. Some of them have more than one entry (up to 5) under the same identifier. I have a pre-set list of IDs I would like to check to see if they have multiple records, and if so, pull all of them, so I can compare them for differences.
Current solution
I currently have this working correctly, using a method to pull the 1st, 2nd, 3rd, 4th and 5th response respectively. The current formula is:
=INDEX('DATABASE'!$B:$B,SMALL(IF($A2='DATABASE'!$A:$A,ROW('DATABASE'!$A:$A)-ROW('DATABASE'!$A$1)+1),1)) for the 1st result, =INDEX('DATABASE'!$B:$B,SMALL(IF($A2='DATABASE'!$A:$A,ROW('DATABASE'!$A:$A)-ROW('DATABASE'!$A$1)+1),2)) for the second, and so on. In this case, the above code would check the value in A2 (ID2Check), and then do a match for it on the 'DATABASE' sheet in column A. If it matches, it will return the first value that matches from column B. The second piece of code above does the same, except it returns the second match from column B. ID2Check is populated by me, and the Results 1-5 are the code output. For example:
Database:
ID
Data Item
AAA
Lemons
111
Greenhouse
FOO
Computer
AAA
Monitor
CAT
Coffee
ORANGE
Pintglass
123
Birthday
FOO
Avengers
AAA
Plasters
FOO
NachoTaco
Code output:
ID2Check
Result 1
Result 2
Result 3
Result 4
Result 5
AAA
Lemons
Monitor
123
Birthday
FOO
Computer
Avengers
NachoTaco
Problem
This solution works correctly, and will pull the expected values as required. The problem is, as this code needs to be entered 5 times per identifier, and there are ~1500 records to check, I need to flash fill 7500 formulas. Unsurprisingly, this leads to Excel crashing and becoming unresponsive for the best part of 30 minutes. I am looking to find a solution that would be more efficient and stable when running en-mass. Any assistance would be appreciated.
Final Notes / Info
I'm using MSO 365, version 2202 (I cannot update beyond this). This will be run in the Desktop version of Excel. I would prefer this is done exclusively using formulas, but I am open to using Visual Basic if it would be otherwise impossible or incredibly inefficient.

=UNIQUE(C5:C15)
=TRANSPOSE(FILTER($D$5:$D$15;$C$5:$C$15=F5))

If you have access to HSTACK() function then could try-
=HSTACK(UNIQUE(A2:A11),TEXTSPLIT(TEXTJOIN("/",FALSE,BYROW(UNIQUE(A2:A11),LAMBDA(x,TEXTJOIN("|",TRUE,FILTER(B2:B11,A2:A11=x))))),"|","/",,,""))

Related

Long calculation times with XLOOKUP vs INDEX-MIN-COLUMN

I'm using this formula =IF(B24="","",IFERROR(INDEX(Sheet3!$C$3:$EE$3,,MIN(IF(Sheet3!$C$4:$EE$23=(Sheet2!C24&$K$18),COLUMN(Sheet3!$C:$EE)))-2),"NF")) to return a cell value in the top row of an array - a date in this case.
The search criteria is a combination of a unique project number and a 2 digit status alphanumerical code for the project. The array consists of 23 rows where combinations of the unique numbers are found, each with different status codes.
So essentially, I'm building a FILTERED project status dashboard that returns dates linked to the relevant project status.
The code above is inspired from ( LINK ) that uses a very similar layout, but it uses town suburbs linked to postal codes instead of project numbers and status codes. The formula works well (though, not entered as an array formula), but I don't have a single formula in the sheet, I have 3 300 occurrences of this formula.
The problem comes in when the user changes the FILTER - Excel recalculates the entire dashboard and that takes anywhere from 2 to 5 minutes to run. You hit the escape button and cancel the calculation after setting the filter, but Excel just starts calculating again after a few seconds. After that, Excel's response is sluggish and almost unusable. Yes - our hardware is pretty weak ...
I tried XLOOKUP as well, but can't set the "lookup_array" to an array ( Sheet3!$C$4:$EE$23 ) because it doesn't match the "return-array" ( Sheet3!$C$3:$EE$3 ) Concatenating the lookup arrays with & works, but then you'd have to do that for all 23 rows, and again, multiply that by 3 300.
I thought of creating a UDF, but the function will still be called every time Excel recalculates after filtering... 3 300 calls ...
Any ideas on how to make the INDEX version run faster, or make the XLOOKUP accept the lookup_array as Sheet3!$C$4:$EE$23 in the hopes that it'll run faster?
Thank you!
Not really an elegant solution, but it works.
I imported the dataset into a helper sheet, where I combined the cell value with the corresponding value in Column A for each row ( a name in this case ) and the date from row 1 for each column, using underscore as a delimiter.
This new data range was then given a unique name, EE in this case.
On a second helper sheet, using this formula =INDEX(Filtered,1+INT((ROW('Sheet1'!C3)-1)/COLUMNS(Filtered)),MOD(ROW('Sheet1'!C3)-1+COLUMNS(Filtered),COLUMNS(Filtered))+1) and drag it down till it returns an REF! error and going back one row before the error.
This transposes all the data into a single column G. Using =UNIQUE(SORT(FILTER(B3:B3240,B3:B3240<> "",""))) then gives me a filtered list of unique values in column H that I then run
=IF(H3="","",LEFT(H3, SEARCH("_",H3,1)-1)) for the first data value in I, and
=IF(H3="","",MID(H3, SEARCH("_",H3) + 1, SEARCH("_",H3,SEARCH("_",H3)+1) - SEARCH("_",H3) - 1)) for the middle data value in J, and
=IF(H3="","",IFERROR(TEXT(RIGHT(H3,5),"yyyy-mm-dd"),"NF")) for the last data value in K.
Then just run XLOOPUP across columns I, J and K.
Runs quick and easy and solves a few of the other issue I had as well.
The second data set has just over 35 000 rows - still works well and fast.

Excel, counting within a formulatext on array

Imagine rows A2:A11 = name of customer, Columns B1:AE1 = days of the month.
To make it easy:
Daily, we tally if customers purchase (quantity) and separate them with a + to get the total of that day purchase. (example: on 2nd day of the month (C2:C5)
Abe =44+54+10
John =22+10+40
Sara =40
Mary=10+10
Also we need to count total “sales cases” of the whole day (in the above example it is 3+3+1+2)= 9 to show in the last row of the day. (B12 in this example)
The logic is something like
=SUMPRODUCT(LEN(FORMULATEXT(C2:C5))-LEN(SUBSTITUTE(FORMULATEXT(C2:C5),"+","")))
But I’m getting NA.
reminder: when there are no "+" signs & the value is more than zero, it should count as 1.
help?
There is a trick to do this, which might end up a bit convoluted to achieve the end result you want
First, define a new name in Name Manager (in the Formula menu bar)
Name: FormulaText
Refers to: =GET.CELL(6,OFFSET(INDIRECT("RC",FALSE),0,-1))
Now if you have a formula in cell B3 of =10+20+30 enter =FormulaText in cell C3 and you will get the text version of the formula
You can now count + symbols in that formula using =LEN(C3)-LEN(SUBSTITUTE(C3,"+",""))
In your specific circumstances, I would offset all of this to the right of your spreadsheet, say by 35 columns, in which case you will need to change the definition of FormulaText accordingly.
A fair bit of set-up work, but the result should work automagically.

Remove duplicate values based on timestamp

I would need your help with and SQL query that has to remove duplicate entries from a table, mostly using the datestamp column as a criteria in two passes.
Microsoft SQL DBMS is in question.
Here is a little more details:
Terminology: Module is basically a group of single machine workplaces onto which users operate.
Table:
ModNam column is fixed, there are 15 modules from M A01 to M A15, then goes the B row M B01 ... M B15 and so on until row F.
Pos column is irrelevant at the moment.
MdCod column represents a code of the machine being added to the position in the certain module. It can be replaced by another machine at any given time.
I have one query that will be inserting data into this table by copying entries from another table, every time a new machine is added to one of the positions.
Tricky part for me is a second query that should be comparing records in two phases and if:
1) Inside same module (first pass of the query represented with red color in the example pic attached):
ModNam value is the same, MdCod matches between the entries then the most recent datestamp decides the single one to stay and others duplicates get deleted
2) Inside other module (second pass of the query represented with purple color in the example pic attached):
ModNam values are different and MdCod matches between the entries then the most recent datestamp decides the single one to stay and others duplicates get deleted.
Please help and advise.
Example pic (updated):
Thank you all in advance.

SPSS :Loop through the values of variable

I have a dataset that has patient data according to the site they visited our mobile clinic. I have now written up a series of commands such as freqs and crosstabs to produce the analyses I need, however I would like this to be done for patients at each site, rather than the dataset as whole.
If I had only one site, a mere filter command with the variable that specifies a patient's site would suffice, but alas I have 19 sites, so I would like to find a way to loop through my code to produce these outputs for each site. That is to say for i in 1 to 19:
1. Take the i th site
2. Compute a filter for this i th site
3. Run the tables using this filtered data of patients at ith site
Here is my first attempt using DO REPEA. I also tried using LOOP earler.
However it does not work I keep getting an error even though these are closed loops.
Is there a way to do this in SPSS syntax? Bear in mind I do not know Python well enough to do this using that plugin.
*LOOP #ind= 1 TO 19 BY 1.
DO REPEAT #ind= 1 TO 20.
****8888888888888888888888888888888888888888888888888888888 Select the Site here.
COMPUTE filter_site=(RCDSITE=#ind).
USE ALL.
FILTER BY filter_site.
**********************Step 3: Apply the necessary code for tables
*********Participation in the wellness screening, we actually do not care about those who did FP as we are not reporting it.
COUNT BIO= CheckB (1).
* COUNT FPS=CheckF(1).
* COUNT BnF= CheckB CheckF(1).
VAL LABEL BIO
1 ' Has the Wellness screening'
0 'Does not have the wellness screening'.
*VAL LABEL FPS
1 'Has the First patient survey'.
* VAL LABEL BnF
1 'Has either Wellness or FPS'
2 'Has both surveys done'.
FREQ BIO.
*************************Use simple math to calcuate those who only did the Wellness/First Patient survey FUB= F+B -FnB.
*******************************************************Executive Summary.
***********Blood Pressure.
FREQ BP.
*******************BMI.
FREQ BMI.
******************Waist Circumference.
FREQ OBESITY.
******************Glucose.
FREQ GLUCOSE.
*******************Cholesterol.
FREQ TC.
************************ Heamoglobin.
FREQ HAEMOGLOBIN.
*********************HIV.
FREQ HIV.
******************************************************************************I Lifestyle and General Health.
MISSING VALUES Gender GroupDep B8 to B13 ('').
******************Graphs 3.1
Is this just Frequencies you are producing? Try the SPLIT procedure by the variable RCDSITE. Should be enough.
SPLIT FILES allows you to partition your data by up to eight variables. Then each procedure will automatically iterate over each group.
If you need to group the results at a higher level than the procedure, that is, to run a bunch of procedures for each group before moving on to the next one so that all the output for a group will be together, you can use the SPSSINC SPLIT DATASET and SPSSINC PROCESS files extension commands to do this.
These commands require the Python Essentials. That and the commands can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) if you have at least version 18.
HTH,
Jon Peck
A simple but perhaps not very elegant way is to select from the menu: Data/Select Cases/If condition, there you enter the filter for site 1 and press Paste, not OK.
This will give the used filter as syntax code.
So with some copy/paste/replace/repeat you can get the freqs and all other results based on the different sites.

How do I get around the Sum(First(...)) not allowed limitation is SSRS2005

The problem that I have is SQL Server Reporting Services does not like Sum(First()) notation. It will only allow either Sum() or First().
The Context
I am creating a reconciliation report. ie. what sock we had a the start of a period, what was ordered and what stock we had at the end.
Dataset returns something like
Type,Product,Customer,Stock at Start(SAS), Ordered Qty, Stock At End (SAE)
Export,1,1,100,5,90
Export,1,2,100,5,90
Domestic,2,1,200,10,150
Domestic,2,2,200,20,150
Domestic,2,3,200,30,150
I group by Type, then Product and list the customers that bought that product.
I want to display the total for SAS, Ordered Qty, and SAE but if I do a Sum on the SAS or SAE I get a value of 200 and 600 for Product 1 and 2 respectively when it should have been 100 and 200 respectively.
I thought that i could do a Sum(First()) But SSRS complains that I can not have an aggregate within an aggregate.
Ideally SSRS needs a Sum(Distinct())
Solutions So Far
1. Don't show the Stock at Start and Stock At End as part of the totals.
2. Write some code directly in the report to do the calc. tried this one - didn't work as I expected.
3. Write an assembly to do the calculation. (Have not tried this one)
Edit - Problem clarification
The problem stems from the fact that this is actually two reports merged into one (as I see it). A Production Report and a sales report.
The report tried to address these criteria
the market that we sold it to (export, domestic)
how much did we have in stock,
how much was produced,
how much was sold,
who did we sell it to,
how much do we have left over.
The complicating factor is the who did we sell it to. with out that, it would have been relativly easy. But including it means that the other top line figures (stock at start and stock at end) have nothing to do with the what is sold, other than the particular product.
I had a similar issue and ended up using ROW_NUMBER in my query to provide a integer for the row value and then using SUM(IIF(myRowNumber = 1, myValue, 0)).
I'll edit this when I get to work and provide more data, but thought this might be enough to get you started. I'm curious about Adolf's solution too.
Pooh! Where's my peg?!
Have you thought about using windowing/ranking functions in the SQL for this?
This allows you to aggregate data without losing detail
e.g. Imagine for a range of values, you want the Min and Max returning, but you also wish to return the initial data (no summary of data).
Group Value Min Max
A 3 2 9
A 7 2 9
A 9 2 9
A 2 2 9
B 5 5 7
B 7 5 7
C etc..
Syntax looks odd but its just
AggregateFunctionYouWant OVER (WhatYouWantItGroupedBy, WhatYouWantItOrderedBy) as AggVal
Windowing
Ranking
you're dataset is a little weird but i think i understand where you're going.
try making the dataset return in this order:
Type, Product, SAS, SAE, Customer, Ordered Qty
what i would do is create a report with a table control. i would set up the type, product, and customer as three separate groups. i would put the sas and sae data on the same group as the product, and the quantity on the customer group. this should resemble what i believe you are trying to go for. your sas and sae should be in a first()
Write a subquery.
Ideally SSRS needs a Sum(Distinct())
Re-write your query to do this correctly.
I suspect your problem is that you're written a query that gets you the wrong results, or you have poorly designed tables. Without knowing more about what you're trying to do, I can't tell you how to fix it, but it has a bad "smell".

Resources