Get matching string with the percentage - sql-server

I have the following details of the data:
Table 1: Table1 is of small in size around few records.
Table 2: Table2 is having 50 millions of rows.
Requirement: I need to match the any string column from table1 to table2 for example name column to name and get the percentage of matching (note column can be any, maybe address or any string column which have multiple words in a single cell).
Sample data:
create table table1(id int, name varchar(100), address varchar(200));
insert into table1 values(1,'Mario Speedwagon','H No 10 High Street USA');
insert into table1 values(2,'Petey Cruiser Jack','#1 Church Street UK');
insert into table1 values(3,'Anna B Sthesia','#101 No 1 B Block UAE');
insert into table1 values(4,'Paul A Molive','Main Road 12th Cross H No 2 USA');
insert into table1 values(5,'Bob Frapples','H No 20 High Street USA');
create table table2(name varchar(100), address varchar(200), email varchar(100));
insert into table2 values('Speedwagon Mario ','USA, H No 10 High Street','mario#gmail.com');
insert into table2 values('Cruiser Petey Jack','UK #1 Church Street','jack#gmail.com');
insert into table2 values('Sthesia Anna','UAE #101 No 1 B Block','Aanna#gmail.com');
insert into table2 values('Molive Paul','USA Main Road 12th Cross H No 2','APaul#gmail.com');
insert into table2 values('Frapples Bob ','USA H No 20 High Street','BobF#gmail.com');
Expected Result:
tbl1_Name tbl2_Name Percentage
--------------------------------------------------------
Mario Speedwagon Speedwagon Mario 100
Petey Cruiser Jack Cruiser Petey Jack 100
Anna B Sthesia Sthesia Anna around 80+
Paul A Molive Molive Paul around 80+
Bob Frapples Frapples Bob 100
Note: Above given is just sample data to understand, I have few records in table1 and 50 millions in table2 in actual senario.
My Try:
Step 1: As suggested by Shnugo have normalize data and stored in the same table's.
For table1:
ALTER TABLE table1 ADD Name_Normal VARCHAR(1000);
GO
--00:00:00 (5 row(s) affected)
UPDATE table1
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
For table2:
ALTER TABLE table2 ADD Name_Normal VARCHAR(1000);
GO
--01:59:03 (50000000 row(s) affected)
UPDATE table2
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
Step 2: Create Percentage calculation function using Levenshtein distance in Microsoft Sql Server
Step 3: Query to get the matching percentage.
--00:00:33 (23456 row(s) affected)
SELECT t.name AS [tbl1_Name],t1.name AS [tbl2_Name],
dbo.ufn_Levenshtein(t.Name_Normal,t1.Name_Normal) percentage
into #TempTable
FROM table2 t
INNER JOIN table1 t1
ON CHARINDEX(SOUNDEX(t.Name_Normal),SOUNDEX(t1.Name_Normal))>0
--00:00:00 (23456 row(s) affected)
SELECT *
FROM #TempTable
WHERE percentage >= 50
order by percentage desc;
Conclusion: Getting expected result but it's taking around 2 hours for normalizing table2 as mentioned in comment in above query. Any suggestion for better optimization at step 1 for table2?

Have you tried looking into DQS (Data Quality Services)?
Depends on your SQL version, it comes with the installation file.
https://learn.microsoft.com/en-us/sql/data-quality-services/data-matching?view=sql-server-2017

Related

Best way to attack a confusing SQL issue in inserting data into a TEMP table

I'm working in SQL Server 2016. Confusing problem with SQL issue. I have a TEMP table that contains unique rows. I have to insert 5 PRODUCTID values for each row each row based on another column value, AgentNo, in this temp table. The PRODUCTID value, there are 5 of them, comes from another table but there is no relationship between the tables. So my question is how do I insert a row for each ProductID into this temp table for each unique row that is currently in the temp table.
Here is a pic of the TEMP table that requires 5 rows for each:
Here is a pic of what I'm needing to come away with:
Here is my SQL code for both TEMP tables:
IF OBJECT_ID('tempdb..#tempTarget') IS NOT NULL DROP TABLE #tempTarget
SELECT 0 as ProductID, 1 as [Status], a.AgentNo, u.UserID, u.[Password], 'N' as AdminID, tel.LocationSysID --, tel.OwnerID, tel.LocationName, a.OwnerSysID, a.AgentName
INTO #tempTarget
FROM dbo.TEST_EvalLocations tel
INNER JOIN dbo.AGT_Agent a
ON tel.LocationName = a.AgentName
INNER JOIN dbo.IW_User u
ON a.AgentNo = u.UserID
WHERE tel.OwnerID = 13313
AND tel.LocationSysID <> 15434;
SELECT * FROM #tempTarget WHERE LocationSysID NOT IN (15425, 15434);
GO
-- Create source table
IF OBJECT_ID('tempdb..#tempSource') IS NOT NULL DROP TABLE #tempSource
SELECT DISTINCT lpr.ProductID
INTO #tempSource
FROM dbo.Eval_LocationProductRelationship lpr
WHERE lpr.ProductID IN (16, 15, 13, 14, 12) --BETWEEN 15435 AND 15595
Sorry I could not get this into a DDL file as these are TEMP tabless. Any help/direction would be appreciated. Thanks.
CROSS JOIN will be the best solution for your case.
If you only want 5 rows for each data in First table means, simply use the below cross join query.
SELECT B.ProductID,
A.[Status],
A.AgentNo,
A.UserID,
A.[Password] AS Value,
A.AdminID,
A.LocationSysID
FROM #tempTarget A
CROSS JOIN tempSource B
If you want additional row with 0, then you have to insert a 0 into your second temp table and use the same query.
INSERT INTO #tempSource SELECT 0
If i understand correctly following is the scenario,
One Temp table has all the content.
select * from #withoutProducts
product table
select * from #products
Then following is the query your are looking for
select a.ProductID,[Status],AgentNo,UserID,[value]
from #products a cross join #withoutProducts b
order by AgentNO,a.productID

Show all and only rows in table 1 not in table 2 (using multiple columns)

I have one table (Table1) that has several columns used in combination: Name, TestName, DevName, Dept. When each of these 4 columns have values, the record is inserted into Table2. I need to confirm that all of the records with existing values in each of these fields within Table1 were correctly copied into Table 2.
I have created a query for it:
SELECT DISTINCT wr.Name,wr.TestName, wr.DEVName ,wr.Dept
FROM table2 wr
where NOT EXISTS (
SELECT NULL
FROM TABLE1 ym
WHERE ym.Name = wr.Name
AND ym.TestName = wr. TestName
AND ym.DEVName = wr.DEVName
AND ym. Dept = wr. Dept
)
My counts are not adding up, so I believe that this is incorrect. Can you advise me on the best way to write this query for my needs?
You can use the EXCEPT set operator for this one if the table definitions are identical.
SELECT DISTINCT ym.Name, ym.TestName, ym.DEVName, ym.Dept
FROM table1 ym
EXCEPT
SELECT DISTINCT wr.Name, wr.TestName, wr.DEVName, wr.Dept
FROM table2 wr
This returns distinct rows from the first table where there is not a match in the second table. Read more about EXCEPT and INTERSECT here: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017
Your query should do the job. It checks anything that are in Table1, but not Table2
SELECT ym.Name, ym.TestName, ym.DEVName, ym.Dept
FROM Table1 ym
WHERE NOT EXISTS (
SELECT 1
FROM table2
WHERE ym.Name = Name AND ym.TestName = TestName AND ym.DEVName = DEVName AND ym. Dept = Dept
)
If the structure of both tables are the same, EXCEPT is probably simpler.
IF OBJECT_ID(N'tempdb..#table1') IS NOT NULL drop table #table1
IF OBJECT_ID(N'tempdb..#table2') IS NOT NULL drop table #table2
create table #table1 (id int, value varchar(10))
create table #table2 (id int)
insert into #table1(id, value) VALUES (1,'value1'), (2,'value2'), (3,'value3')
--test here. Comment next line
insert into #table2(id) VALUES (1) --Comment/Uncomment
select * from #table1
select * from #table2
select #table1.*
from #table1
left JOIN #table2 on
#table1.id = #table2.id
where (#table2.id is not null or not exists (select * from #table2))

Performance tuning on join two tables columns with patindex

Sample data:
Note:
The table tbl_test1 is filtered table, may have less records based on filtered earlier.
The following is just the data sample for understanding purpose. The actual table tbl_test2 is having 70 columns and 100 millions of records.
The WHERE condition is dynamic comes with any combination.
The display columns are also dynamic, i mean one or more columns.
create table tbl_test1
(
col1 varchar(100)
);
insert into tbl_test1 values('John Mak'),('Omont Boy'),('Will Smith'),('Mak John');
create table tbl_test2
(
col1 varchar(100)
);
insert into tbl_test2 values('John Mak'),('Smith Will'),('Jack Don');
query 1: The following query is take more than 10 min and still running for 100 millions records.
select t2.col1
from tbl_test2 t2
inner join tbl_test1 t2 on patindex('%'+t1.col1+'%',t2.col1) > 0
query 2: This also keeps running unable to get the result after 10 min of wait.
select t2.col1
from tbl_test2 t2
where exists
(
select * from tbl_test1 t1 where charindex(t1.col1,t2.col1) > 0
)
expected result:
col1
----------
John Mak
Smith Will

Subtract data from the same column

I am having difficulties subtracting data from the same column. My data structure looks like this:
TimeStamp ID state Gallons
1/01/2012 1 NY 100
1/05/2012 1 NY 90
1/10/2012 1 NY 70
1/15/2012 1 NY 40
So I would like to get the result of (100-40) = 60 gallons used for that time period. I tried using the following query:
SELECT T1.Gallons, T1.Gallons -T1.Gallons AS 'Gallons used'
FROM Table 1 AS T1
where Date = '2012-07-01'
ORDER BY Gallons
It seems like a simple way of doing this. However, I can not find an efficient way of including the timestamp as a way of retrieving the right values (ex. gallon value on 1/1/2012 - gallon value on 1/30/2012).
ps. The value of "Gallons" will always decrease
select max(gallons) - min(gallons)
from table1
group by Date
having Date = '2012-07-01'
try
select max(gallons) - min(gallons)
from table1 t1
where date between '2012-01-01' and '2012-01-31'
SELECT MAX(gallons) - MIN(gallons) AS 'GallonsUsed'
FROM table;
would work in this instance, if you are doing a monthly report this could work. How does the value of gallons never increase? what happens when you run out?
Will you always be comparing the maximum and minimum amount? If not, maybe something like the following will suffice (generic sample)? The first time you'll run the script you'll get an error but you can disregard it.
drop table TableName
create table TableName (id int, value int)
insert into TableName values (3, 20)
insert into TableName values (4, 17)
insert into TableName values (5, 16)
insert into TableName values (6, 12)
insert into TableName values (7, 10)
select t1.value - t2.value
from TableName as t1 inner join TableName as t2
on t1.id = 5 and t2.id = 6
The thing you're asking for is the last line. After we've inner joined the table onto itself, we simply request to match the two lines with certain ids. Or, as in your case, the different dates.

How to JOIN to a table that has multiple values in the column?

I've got a person table that contains an error code field that can contain multiple error codes (001, 002, 003...). I know that's a schema problem but this is a vendor application and I have no control over the schema, so I have to work with what I've got.
There is also a Error table that contains ErrorCode (char(3)) and Descript (char(1000)). In my query the Person.ErrorCode is joined to the Error.ErrorCode to get the value of the corresponding description.
For person records where there is only one error code, I can get the corresponding Descript with no problem. What I'm trying to do is somehow concat the Descript values for records where there are multiple errors.
For example, here's some sample data from Error table:
ErrorCode Descript
001 Problem with person file
002 Problem with address file
003 Problem with grade
Here are the columns resulting from my SELECT on Person with a JOIN on Error:
Person.RecID Person.ErrorCode Error.Descript
12345 001 Problem with person file
12346 003 Problem with grade
12347 002,003
What I'm trying to get is this:
Person.RecID Person.ErrorCode Error.Descript
12345 001 Problem with person file
12346 003 Problem with grade
12347 002,003 Problem with address file, Problem with grade
Suggestions appreciated!
You should see: "Arrays and Lists in SQL Server 2005 and Beyond, When Table Value Parameters Do Not Cut it" by Erland Sommarskog, then there are many ways to split string in SQL Server. This article covers the PROs and CONs of just about every method. in general, you need to create a split function. This is how a split function can be used to join rows:
SELECT
*
FROM dbo.yourSplitFunction(#Parameter) b
INNER JOIN YourCodesTable c ON b.ListValue=c.CodeValue
I prefer the number table approach to split a string in TSQL but there are numerous ways to split strings in SQL Server, see the previous link, which explains the PROs and CONs of each.
For the Numbers Table method to work, you need to do this one time table setup, which will create a table Numbers that contains rows from 1 to 10,000:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this split function:
CREATE FUNCTION [dbo].[FN_ListToTable]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this will not return empty rows
----------------
SELECT
ListValue
FROM (SELECT
LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(#SplitOn, List2, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS List2
) AS dt
INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
WHERE SUBSTRING(List2, number, 1) = #SplitOn
) dt2
WHERE ListValue IS NOT NULL AND ListValue!=''
);
GO
You can now easily split a CSV string into a table and join on it:
DECLARE #ErrorCode table (ErrorCode varchar(20), Description varchar(30))
INSERT #ErrorCode VALUES ('001','Problem with person file')
INSERT #ErrorCode VALUES ('002','Problem with address file')
INSERT #ErrorCode VALUES ('003','Problem with grade')
DECLARE #Person table (RecID int, ErrorCode varchar(20))
INSERT #Person VALUES (12345 ,'001' )
INSERT #Person VALUES (12346 ,'003' )
INSERT #Person VALUES (12347 ,'002,003')
SELECT
p.RecID,c.ListValue,e.Description
FROM #Person p
CROSS APPLY dbo.FN_ListToTable(',',p.ErrorCode) c
INNER JOIN #ErrorCode e ON c.ListValue=e.ErrorCode
OUTPUT:
RecID ListValue Description
----------- ------------- -------------------------
12345 001 Problem with person file
12346 003 Problem with grade
12347 002 Problem with address file
12347 003 Problem with grade
(4 row(s) affected)
you can use the XML trick to concatenate the rows back together:
SELECT
t1.RecID,t1.ErrorCode
,STUFF(
(SELECT
', ' + e.Description
FROM #Person p
CROSS APPLY dbo.FN_ListToTable(',',p.ErrorCode) c
INNER JOIN #ErrorCode e ON c.ListValue=e.ErrorCode
WHERE t1.RecID=p.RecID
ORDER BY p.ErrorCode
FOR XML PATH(''), TYPE
).value('.','varchar(max)')
,1,2, ''
) AS ChildValues
FROM #Person t1
GROUP BY t1.RecID,t1.ErrorCode
OUTPUT:
RecID ErrorCode ChildValues
----------- -------------------- -----------------------------------------------
12345 001 Problem with person file
12346 003 Problem with grade
12347 002,003 Problem with address file, Problem with grade
(3 row(s) affected)
This returns the same result set as above, but may perform better:
SELECT
t1.RecID,t1.ErrorCode
,STUFF(
(SELECT
', ' + e.Description
FROM (SELECT ListValue FROM dbo.FN_ListToTable(',',t1.ErrorCode)) c
INNER JOIN #ErrorCode e ON c.ListValue=e.ErrorCode
ORDER BY c.ListValue
FOR XML PATH(''), TYPE
).value('.','varchar(max)')
,1,2, ''
) AS ChildValues
FROM #Person t1
GROUP BY t1.RecID,t1.ErrorCode
Denormalize person.errorcode before the join with error.errorcode
I don't mean denormalize on the table level, I mean with a view or sql code.
You can use a Common Table Expression to pretend that the person table is normal:
;WITH PersonPrime as (
SELECT RecID,ErrorCode,CAST(null as varchar(100)) as Remain from Person where Value not like '%,%'
UNION ALL
SELECT RecID,SUBSTRING(ErrorCode,1,CHARINDEX(',',ErrorCode)-1),SUBSTRING(ErrorCode,CHARINDEX(',',ErrorCode)+1,100) from Person where Value like '%,%'
UNION ALL
SELECT RecID,Remain,null FROM PersonPrime where Remain not like '%,%'
UNION ALL
SELECT RecID,SUBSTRING(Remain,1,CHARINDEX(',',Remain)-1),SUBSTRING(Remain,CHARINDEX(',',Remain)+1,100) from PersonPrime where Remain like '%,%'
)
SELECT RecID,ErrorCode from PersonPrime
And now use PersonPrime where you'd have used Person in your original query. You'll want to CAST that null to a varchar column as wide as ErrorCode in the Person table
Concatenating the error descriptions might not be the way to go. It adds unnecessary complexity to the SQL statement that will be extremely problematic to debug. Any future additions or changes to the SQL will also be difficult. Your best bet is to generate a resultset that is normalized, even though your schema is not.
SELECT Person.RecID, Person.ErrorCode, Error.ErrorCode, Error.Descript
FROM Person INNER JOIN Error
ON REPLACE(Person.ErrorCode, ' ', '') LIKE '%,' + CONVERT(VARCHAR,Error.ErrorCode) + ',%'
If a person has multiple error codes set, then this will return one row for each error specified (ignoring duplicates). Using your example, it will return this.
Person.RecID Person.ErrorCode Error.ErrorCode Error.Descript
12345 001 001 Problem with person file
12346 003 003 Problem with grade
12347 002,003 002 Problem with address file
12347 002,003 003 Problem with grade
By grouping the errors together and concatenating them is an option:
SELECT *, GROUP_CONCAT(Person.ErrorCode) FROM Person
GROUP BY Person.RecID

Resources