Looking up values within a range of cells - arrays

Suppose I have the following data table in Excel
Company Amount Text
Oracle $3,400 330 Richard ERP
Walmart $750 348 Mary ERP
Amazon $6,880 xxxx Loretta ERP
Rexel $865 0000 Mike ERP
Toyota $11,048 330 Richard ERP
I want to go through each item in the "Text" column, search the item against the following range of names:
Mary
Mike
Janine
Susan
Richard
Jerry
Loretta
and return the name in the "Person" column, if found. For example:
Company Amount Text Person
Oracle $3,400 330 Richard ERP Richard
Walmart $750 348 Mary ERP Mary
Amazon $6,880 xxxx Loretta ERP Loretta
Rexel $865 0000 Mike ERP Mike
Toyota $11,048 330 Richard ERP Richard
I've tried the following in Excel which works:
=IF(N2="","",
IF(ISNUMBER(SEARCH(Sheet2!$A$1,N2)),Sheet2!$A$1,
IF(ISNUMBER(SEARCH(Sheet2!$A$2,N2)),Sheet2!$A$2,
IF(ISNUMBER(SEARCH(Sheet2!$A$3,N2)),Sheet2!$A$3,
....
Where $A$1:$A$133 is my range and N2 is the "Text" column values; however, that is a lot of nested code and apparently Excel has a limit on the number of nested IF statements you can have.
Is there a simpler solution (arrays? VBA?)
Thanks!

Use the following formula:
=IFERROR(INDEX(Sheet2!A:A,AGGREGATE(15,6,ROW(Sheet2!$A$1:INDEX(Sheet2!A:A,MATCH("zzz",Sheet2!A:A)))/(ISNUMBER(SEARCH(Sheet2!$A$1:INDEX(Sheet2!A:A,MATCH("zzz",Sheet2!A:A)),N2))),1)),"")

Related

Google Sheets - finding first match in array formula

Okay, I found many solution for many problems in Google Sheets, but this one is just hard as a rock. :)
I have a sheet where in column C are various names and in column D are professions. For example:
C / D
John Smith / plumber
Paul Anderson / carpenter
Sarah Palmer / dentist
Jonah Huston / carpenter
Laura Jones / dentist
Sid Field / carpenter
...etc
(as you can see every name are identical, but professions are repeating several times)
I'd like to see in column F the last matching name of the same profession
C / D / F
John Smith / plumber / (N/A)
Paul Anderson / carpenter /Jonah Huston
Sarah Palmer / dentist / Laura Jones
Jonah Huston / carpenter / Sid Field
Laura Jones / dentist / (N/A)
Sid Field / carpenter / (N/A)
...etc
It works fine with INDEX and FILTER function, but I have to copy the code over and over again as I add extra rows. This is the code I use:
=IFERROR(INDEX(FILTER($C4:$C,$D4:$D=D6,$A4:$A<A6),1))
I'm looking for a solution with Array Formula to autofill all cells in column F, and tried various versions (Lookup, Vlookup..etc), but couldn't find the right formula.
Any guidance would be appreciated. :)
try:
=INDEX(IF(C1:C="",,IFNA(VLOOKUP(
D1:D&COUNTIFS(D1:D, D1:D, ROW(D1:D), "<="&ROW(D1:D)), {
D1:D&COUNTIFS(D1:D, D1:D, ROW(D1:D), "<="&ROW(D1:D))-1,
C1:C}, 2, 0), "(N/A)")))

Matching employees from different school and hometown

I am new to coding. Now I have a employee table looked like below:
Name
Hometown
School
Jeff
Illinois
Loyola University Chicago
Alice
California
New York University
William
Michigan
University of Illinois at Chicago
Fiona
California
Loyola University Chicago
Charles
Michigan
New York University
Linda
Indiana
Loyola University Chicago
I am trying to get those employees in pairs where two employees come from different state and different university. Each person can only be in one pair. The expected table should look like
employee1
employee2
Jeff
Alice
William
Fiona
Charles
Linda
The real table is over 3,000 rows. I am trying to do it with SQL or Python, but I don't know where to start.
A straightforward approach is to pick employees one by one and search the table after the one for an appropriate peer; found peers are flagged in order to not be paired repeatedly. Since in your case a peer should be found after a few steps, this iteration will likely be faster than operations which construct whole data sets at once.
from io import StringIO
import pandas as pd
# read example employee table
df = pd.read_table(StringIO("""Name Hometown School
Jeff Illinois Loyola University Chicago
Alice California New York University
William Michigan University of Illinois at Chicago
Fiona California Loyola University Chicago
Charles Michigan New York University
Linda Indiana Loyola University Chicago
"""))
# create expected table; its length is half that of the above
ef = pd.DataFrame(index=pd.RangeIndex(len(df)/2), columns=['employee1', 'employee2'])
k = 0 # number of found pairs, index into expected table
# array of flags for already paired employees
paired = pd.Series(False, pd.RangeIndex(len(df)))
# go through the employee table and collect pairs
for i in range(len(df)):
if paired[i]: continue
for j in range(i+1, len(df)):
if not paired[j] \
and df.iloc[j]['Hometown'] != df.iloc[i]['Hometown'] \
and df.iloc[j]['School'] != df.iloc[i]['School']:
# we found a pair - store it, mark employee j paired
ef.iloc[k] = df.iloc[[i, j]]['Name']
k += 1
paired[j] = True
break
else:
print("no peer for", df.iloc[i]['Name'])
print(ef)
output:
employee1 employee2
0 Jeff Alice
1 William Fiona
2 Charles Linda

Is normalization always necessary and more efficient?

I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.

Track changes in a cell array with string elements in Matlab

I have a cell array with 5 columns and about 5000 rows with string elements. E.g.:
1997 Charles House Materials Chemicals
The years go from 1997 to 2013 and you find repetitions while the years pass by. What I would like to get is a new cell array with the cases in which the combination of the second and third column change. E.g. original cell array:
1997 Charles House Materials Chemicals %initial
1997 Rita Office Financial Bank %initial
1998 Rita Office Financial Bank %no change
1999 Charles House Materials Chemicals %no change
2000 Charles Office Materials Chemicals %change in the 2nd column
2001 Charles Office Materials Chemicals %no change
2003 Rita Star Financial Bank %change in the 2nd column
2005 Charles Castle Materials Chemicals %change in the 2nd column
2010 Rita Moon Financial Bank %change in the 2nd column
I would like for my new array to give me the first/original row and the cases in which you
observe a change. E.g. output:
1997 Charles House Materials Chemicals
1997 Rita Office Financial Bank
2000 Charles Office Materials Chemicals
2003 Rita Star Financial Bank
2005 Charles Castle Materials Chemicals
2010 Rita Moon Financial Bank
My problem is mainly related to the fact I am dealing with strings. If one could help me I would appreciate. Thanks a lot for your availability.
Code
%// Assuming input_cellarr is the input cell array
[~,~,col2id] = unique(input_cellarr(:,2),'stable')
[~,~,col3id] = unique(input_cellarr(:,3),'stable')
[~,unqrows] = unique([col2id col3id],'rows','stable')
out = input_cellarr(unqrows,:) %// Desired output
First of all, I filtered out the first column because the year does not matter. Then, you have to compare the previous with the next line. This is done using X(1:end-1,2:end) and X(2:end,2:end)). Only if row wise equal, nothing has change.
changes=all(strcmpi(X(1:end-1,2:end),X(2:end,2:end)),2)
%at this point we are missing the first line
changes=[true;changes]
%finally use logical indexing to select the relevant data.
result=X(changes,:);

Human Name lookup / translation

I am working on a requirement to match people from different databases. One tricky problem is variance in names like Bob - Robert, Jim - James, Lizzy - Elizabeth etc across databases.
Is there a lookup/translation available for this kind of a requirement.
Take a look at my answer (as well as the others) here:
Tools for matching name/address data
You'd need to implement a lookup table with the alternate names in it:
Base | Alternate
----------------
Robert | Bob
Elizabeth | Liz
Elizabeth | Lizzy
Elizabeth | Beth
Then search the database for the base name and all alternates. You'll end up with a number of multiple matches which will then need to be checked to see if they really match based on a comparison of whatever other data you have in the two databases. Maybe the dates of the records in each database could be used - records entered close in time indicate the same person.

Resources