I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance.
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction.
What confuses me is how to store the bidirectional nature of the relationship.
I've started to put some examples below, but wondered if there is a best practice for storing this type of data,
Example data
id, address
001, 5 Main Street
002, 5 Main St.
003, 5 Main Str
004, 6 High Street
005, 7 Low Street
006, 7 Low St
Suggestion 1
customer_id1, customer_id2, relationship_strength
001, 002, 0.74
001, 003, 0.77
002, 003, 0.76
005, 006, 0.77
Not happy with this approach as it sort of infers a one way relationship between customer_id1 to customer_id2. Unless of course I include all relationships both ways, but that would double the amount of processing time and the size of the tables.
eg would need to include: 002, 001, 0.74
Suggestion 2
customer_id, grouping_id
001, 1
002, 1
003, 1
005, 2
006, 2
The way to deal with symmetric relations in a relational system is as follows:
choose a canonical form in which the symmetric pairs are stored, e.g. customer_id1 < customer_id2.
Define a view SYMM_TBL as select id1,id2,... from ... UNION select id2 as id1,id1 as id2, ... FROM ...
Decent systems ought not punish you in the performance area when querying this view.
What we have here is a graph in which each node has a relationship (edit distance) with every other node. This is not in the normal range of data models. It is also not a permanent feature of your database (assuming you resolve the business processes which led to the duplicate data) so it isn't worth sweating over the solution which best fits relational theory. What we need is a a practical solution.
Think of it as a matrix. If we go for the optimum processing we won't execute the duplicate scorings. So we score Address 1 against all the other Addresses, we score Address 2 against all the other Addresses except Address 1, we score Address 3 against all the other Addresses except Addresses 1 and 2, etc. And what we end up with is a bit like a football league table:
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 - - 100 75 72
3 - - - 75 72
4 - - - - 83
5 - - - - -
This data can best be stored in suggestion 1, a table of ID1, ID2, SCORE. Although we do need to pivot the data to get the output looking like that :)
In a proper league table there are two sets of scores - Home and Away - so the table is symmetrical. But that doesn't apply here, as the edit distance for 1 > 2 is the same as 2 > 1. However, it would make querying the results more straightforward if the result set included the mirrored scores. That is, for records (1,5,76), (2,5,72), etc we generate records (5,1,76), (5,2,72). This could be done at the end of the scoring process.
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 95 - 100 75 72
3 95 100 - 75 72
4 80 75 75 - 83
5 76 72 72 83 -
Of course, this is mainly a presentational thing, so it only needs to be done for display purposes, e.g. exporting the data to a spreadsheet. We can still get all the scores for, say, Address 5 in a readable fashion without miiroring the scores using a simple SQL statement:
select case when id1 = 5 then id1 else id2 end as id1
, case when id1 = 5 then id2 else id1 end as id2
, score
from your_table
where id1 = 5
or id2 = 5
/
As always it depends on what you want to do with the data once you've calculated it.
Assuming it's simply to identify or locate duplicates then your suggestion 1 is what I'd use, i.e. a second table that simply stores the pairs and the strengths. My only suggestion is to make the strengths a scaled integer rather than a decimal.
Related
I need to model a star schema for some business needs about liquidity stress testing.
i will try to find an analogy example.
let's say we have deals about financing/financial securities etc
in the fact table,
at a given date, this deals have the real value of X euros
but will have a variation in time. thus we have some projection values.
my concern is about how to represent this projection values for this deals, and more specifically what granularity to choose.
( the example below is oversimplification of the fact table -- and yes it's Dimension Id's that
are used otherwise )
Method 1 : as many metric columns as projection values calculated
AsofDate
DealId
0Day
1Day
7Days
1Month
2022-01-01
financingDeal1
100
99
98
85
2022-01-01
financingDeal2
150
150
120
120
2022-01-01
financingDeal3
100
99
98
85
2022-01-01
financingDeal4
100
99
98
85
Method 2 : add a granularity : a row is not anymore only a deal on a given that. it's a deal in a given date and it's projection in the next few days/months
AsofDate
DealId
projection
value
2022-01-01
financingDeal1
0Day - actual
100
2022-01-01
financingDeal1
1Day
99
2022-01-01
financingDeal1
7Days
99
2022-01-01
financingDeal1
1Month
85
from where i see it :
for method 1, the main inconvenient is if in the futur, we have a new projection value for 3 months, we will need to add a column in the ETL/ in the OLAP cube for '3months'
for method 2:
we will have as many rows as (deals * projections) and we do have 11 projections so it's 11 rows for each deal and we do have 1Million+ of them.
what is your opinion on this topic?
Thanks for your consideration
Good morning,
I have an issue with extracting the correct handicap value within the following table:
K L M
Handicap York Hereford
0 1287 1280
1 1285 1275
2 1280 1271
3 1275 1268
4 1270 1265
5 1268 1260
6 1265 1258
7 1260 1254
8 1255 1250
9 1253 1246
I also have these 2 lines of sample score/round data:
G H I
Round Score Handicap
York 1269 5
York 1270 4
Hereford 1269 XXX
Hereford 1270 XXX
If for instance someone on a York, gets a score of 1269, they should get a handicap of 5, which this formula achieves:
INDEX($K$7:$K$16,MATCH($H7,$L$7:$L$16,-1))+1
However this formula only works on the one column $L$7:$L$16
Similarly, the 2ns score is calculated with the following formula:
=INDEX($K$8:$K$17,MATCH($H8,$L$8:$L$17,-1))
What I'd like to do is, build that out so if I changed the round to a Hereford, with the exact same score, the cell would automatically calculate that the handicap should be 3.
Is this possible, maybe with an array?
Regards,
Andrew.
With ms365, try:
Formula in I2:
=XLOOKUP(H2,FILTER(L$2:M$11,L$1:M$1=G2),K$2:K$11,"NB",-1,-1)
I would avoid using OFFSET because it is a volatile function.
To select the appropriate column, you can use another MATCH:
MATCH($G7,$L$6:$M$6,0)
will return the column number. This makes it simple if you more than just York and Hereford columns.
Then, to return the matching line:
=MATCH($H7,INDEX($L$7:$M$16,0,MATCH($G7,$L$6:$M$6,0)),-1)
Note the use of 0 for the Row argument in the INDEX function which will return the entire column (all the rows).
Since your handicaps are sequential, as written this formula returns the same values as does yours. But I don't think it is correct since both formulas return 1 for a 1287 York.
You probably need to subtract one from the result of the formula.
=MATCH($H7,INDEX($L$7:$M$16,0,MATCH($G7,$L$6:$M$6,0)),-1)-1
Reference your lookup range with an OFFSET() function, and for the third parameter (which is column offset), use a MATCH() on the headers.
The formula on your first row would be:
=INDEX($K$7:$K$16,MATCH(H7,OFFSET($L$7:$M$16,0,MATCH(G7,$L$6:$M$6,0)-1,ROWS($L$7:$M$16),1),-1))+1
I have multiple datasets (100+) that all contain the same 3 columns (code_num, replicate, total_qty) each with a distinct code (code_num).
data code_num_1
code_num replicate total_qty
12345 376 45
12345 76 67
12345 943 300
.
.
data code_num_2
code_num replicate total_qty
12234 85 746
12234 900 35
12234 726 273
.
.
and etc.
I would like to run those datasets through a data step if possible:
data test;
set test_; <-- datasets will go here...
if _N_ in(&PercentileRow10,&PercentileRow20,&PercentileRow30,&PercentileRow40,&PercentileRow50,&PercentileRow60,&PercentileRow70, &PercentileRow80,&PercentileRow90);
run;
*Note: &percentilerow is a macro variable that will obtain the percentiles from the datasets. The column quantity will determine percentiles. I have this step beforehand:
proc sql no print;
create table ___ as
select code_num,
replicate,
sum(qty) as total_qty
from ____
group by code_num, replicate
order by total_qty;
quit;
Ideally, I would like to obtain the percentiles of each dataset and create a new dataset that will have each percentile and the associated replicate it occurred and the total quantity. Could I use a macro and do loop to run my datasets through this data set to produce new datasets?
data code_num_1_perc
percentile replicate qty
10 87 45
20 933 65
30 34 100
.
.
90 467 837
This is my ideal output for each dataset code_num_#. If possible
If I understand the requirements correct, the proposed methodology is flawed.
For example, the median (50th percentile) of a series such as
1, 2, 3, 4, 5, 6, 7, 8, 9, 10 is 5.5. 5.5 is not a value in the data set so how would a replicate number be selected?
My recommendation would be a different process altogether. Look into PROC RANK to see how ties are handled and how you'd like them handled. You didn't specify which variable would used to calculate the percentiles.
Combine all data sets into one, adding in a data set identifier to uniquely identify each data set.
data combined;
length source data_set_name $50.;
set code_num_: indsname = source;
data_set_name = source;
run;
Use PROC RANK to group into deciles
proc rank data=combined out=combined_deciles groups=10;
by data_set_name;
var total_qty;
ranks PRanks;
run;
Get the first (or last, based on requirements) value for each rank
data want;
set combined_deciles;
by datasetName Pranks;
if first.Pranks;
run;
Table copied as Text
Column1 Column2 Column3 Column4 Column5 Column6
A AA AAA 100 95 92
A AA AAA 85 83 81
A AA BBB 200 199 160
A BB AAA 65 55 49
B AA AAA 89 88 83
B AA BBB 150 149 145
B BB AAA 140 135
B BB BBB 190 185
B AA AAA 510
AA
AAA BBB
A 173 160
B 593 145
and some more explanation
Basically i want the sum of "Column 6" for the given criteria but the data in Column 6 can only be entered after some delay w.r.t. Column 1, Column 2, Column 3 & Column 4.
Till Column 6 data is entered, i want excel to use the number available in Column 5 which is also entered after some delay w.r.t. Column 1, Column 2, Column 3 & Column 4 but before Column 6.
And till Column 5 data is entered, i want excel to use the number available in Column 4.
Now I am familiar with two SUM/IF arrangements as included below in post.
First one is array sum/if arrangement which is convenient to write but results in terribly long calculation time with 1.5 seconds for just one column and I have over 100 columns in one sheet and about 9 sheets.
Second one is using SUMIFS which requires extensive time to write but relatively better calculation time of 0.5 seconds for column but is still quite high.
Now I need to do away with the array arrangement but doing so will take quite some time and I want to know if there is any better/other arrangement.
Just let me know other arrangement which can get the required result and I will check the arrangement for calculation timing. If the other arrangement is also convenient to write than that is a plus.
This is my table:
And I want to add the right most columns which are not empty i.e. have a number in it, but with the criteria for the first three columns in cell D15.
I only found option to add image. Please let me know how to upload excel file.
enter image description here
Can somebody please suggest an alternate to this array formula so it can calculate way faster
{=SUM(
IF(
($B$2:$B$10=$C15)*
($C$2:$C$10=$C$13)*
($D$2:$D$10=D$14)>0,
IF(
$G$2:$G$10<>"",
$G$2:$G$10,
IF(
$F$2:$F$10<>"",
$F$2:$F$10,
$E$2:$E$10))))}
I have tried below which reduces the calculation time to 1/3 but it is too much typing for the large data I am dealing with
=SUMIFS(
$G$2:$G$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"<>"&"")
+SUMIFS(
$F$2:$F$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"="&"",
$F$2:$F$10,"<>"&"")
+SUMIFS(
$E$2:$E$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"="&"",
$F$2:$F$10,"="&"")
If you're OK with using a helper column (which you should be), you can use this formula in a helper cell and drag down. (In my example at bottom, this formula is in cell H2 and drag down.)
= INDEX(E2:G2,MATCH(-1E+300,E2:G2,-1))
This gets all of the data in either column 4 5 or 6 all into one column.
Then you can use a simpler SUMIFS formula in cell D15:
= SUMIFS($H$2:$H$10, // Sum range (helper column)
$B$2:$B$10,$C15, // Criteria 1 (A or B)
$C$2:$C$10,$C$13, // Criteria 2 (AA or BB)
$D$2:$D$10,D$14) // Criteria 3 (AAA or BBB)
See below, working example:
DISCLAIMER
This answer will simplify your formulas, but I'm not sure if this will help with the performance problems you are experiencing. SUMIFS in itself I don't see being likely the cause of long calculation times. Probably you are experiencing long calculation times because other parts of your spreadsheet are using inefficient formulas and/or formulas involving volatile cells, but that is just a guess because I have no idea what the rest of your spreadsheet looks like.
i have this building floor data selected:
6
5
4
3
2
1
UG
GM
G
LG
5B
5A
B1
B2
for this sorting i use this kind of Order by :
order by
(case when ISNUMERIC(floorNo) = 1 then CAST(floorNo AS Int) end) desc ,
(case when ISNUMERIC(left(floorNo,1)) = 0 and ISNUMERIC(substring(floorNo,2,1)) = 1 then floorNo end) asc,
(case when ISNUMERIC(floorNo) = 0 and left(floorNo,1) <>'L' then floorNo end) desc
but i want to make it like this :
6
5B
5A
5
4
3
2
1
UG
GM
G
LG
B1
B2
Can ANy one Help me solve it?
If you make a complicated enough (set of) case statement(s), you would eventually be able to handle all the possibilities, but it is likely to run very slow if you have a lot of data.
If I had to do this, I would probably make a separate lookup table (FloorOrder) with two columns; this floor code and an order column (integer). Create a script to populate the lookup table with all the various possibilities - pick a maximum number of floors, basements, and subfloors per floor, and make all of the possibilities with some loops. Then add all the various floors near ground floor. Make sure the order numbers are spread out enough that you can easily add other codes in between when somebody comes up with a new option (because they will). Something like this subset.
Code Order
2 2000
1C 1300
1B 1200
1A 1100
1 1000
UG 800
GM 500
G 0
LG -300
B1 -1000
It doesn't really matter what the order codes are, as long as they sort the list in the right order, can be easily generated when creating the table, and leave space for fitting things in the gap. Whenever somebody comes up with a new weird floor code (some I've seen near me are things like M (Mezzanine, UM for Upper Mezzanine, etc), add new records to the FloorOrder table to fit them in. Make sure you table has an index on the floor codes
To use it, join to the FloorOrder table, sort by the Order column.