Performance tuning on join two tables columns with patindex - sql-server

Sample data:
Note:
The table tbl_test1 is filtered table, may have less records based on filtered earlier.
The following is just the data sample for understanding purpose. The actual table tbl_test2 is having 70 columns and 100 millions of records.
The WHERE condition is dynamic comes with any combination.
The display columns are also dynamic, i mean one or more columns.
create table tbl_test1
(
col1 varchar(100)
);
insert into tbl_test1 values('John Mak'),('Omont Boy'),('Will Smith'),('Mak John');
create table tbl_test2
(
col1 varchar(100)
);
insert into tbl_test2 values('John Mak'),('Smith Will'),('Jack Don');
query 1: The following query is take more than 10 min and still running for 100 millions records.
select t2.col1
from tbl_test2 t2
inner join tbl_test1 t2 on patindex('%'+t1.col1+'%',t2.col1) > 0
query 2: This also keeps running unable to get the result after 10 min of wait.
select t2.col1
from tbl_test2 t2
where exists
(
select * from tbl_test1 t1 where charindex(t1.col1,t2.col1) > 0
)
expected result:
col1
----------
John Mak
Smith Will

Related

Improve the query performance

Initially there is 2 tables involved
col1 - Int
Col2 -int
select * from table inner table2 on table.col1=table2.col2
-- Fine it give result in short time in a 2 minutes
But after change the col2 to nvarchar(30)
select * from table inner table2 on table.col1=convert(nvarchar(30),table2.col2 )
-- its running more than a hours
Any solution to optimize the query
Joining 2 tables using an nvarchar(30) will be slower than an int column as it is bigger. I would stick with using int if possible.

Best way to attack a confusing SQL issue in inserting data into a TEMP table

I'm working in SQL Server 2016. Confusing problem with SQL issue. I have a TEMP table that contains unique rows. I have to insert 5 PRODUCTID values for each row each row based on another column value, AgentNo, in this temp table. The PRODUCTID value, there are 5 of them, comes from another table but there is no relationship between the tables. So my question is how do I insert a row for each ProductID into this temp table for each unique row that is currently in the temp table.
Here is a pic of the TEMP table that requires 5 rows for each:
Here is a pic of what I'm needing to come away with:
Here is my SQL code for both TEMP tables:
IF OBJECT_ID('tempdb..#tempTarget') IS NOT NULL DROP TABLE #tempTarget
SELECT 0 as ProductID, 1 as [Status], a.AgentNo, u.UserID, u.[Password], 'N' as AdminID, tel.LocationSysID --, tel.OwnerID, tel.LocationName, a.OwnerSysID, a.AgentName
INTO #tempTarget
FROM dbo.TEST_EvalLocations tel
INNER JOIN dbo.AGT_Agent a
ON tel.LocationName = a.AgentName
INNER JOIN dbo.IW_User u
ON a.AgentNo = u.UserID
WHERE tel.OwnerID = 13313
AND tel.LocationSysID <> 15434;
SELECT * FROM #tempTarget WHERE LocationSysID NOT IN (15425, 15434);
GO
-- Create source table
IF OBJECT_ID('tempdb..#tempSource') IS NOT NULL DROP TABLE #tempSource
SELECT DISTINCT lpr.ProductID
INTO #tempSource
FROM dbo.Eval_LocationProductRelationship lpr
WHERE lpr.ProductID IN (16, 15, 13, 14, 12) --BETWEEN 15435 AND 15595
Sorry I could not get this into a DDL file as these are TEMP tabless. Any help/direction would be appreciated. Thanks.
CROSS JOIN will be the best solution for your case.
If you only want 5 rows for each data in First table means, simply use the below cross join query.
SELECT B.ProductID,
A.[Status],
A.AgentNo,
A.UserID,
A.[Password] AS Value,
A.AdminID,
A.LocationSysID
FROM #tempTarget A
CROSS JOIN tempSource B
If you want additional row with 0, then you have to insert a 0 into your second temp table and use the same query.
INSERT INTO #tempSource SELECT 0
If i understand correctly following is the scenario,
One Temp table has all the content.
select * from #withoutProducts
product table
select * from #products
Then following is the query your are looking for
select a.ProductID,[Status],AgentNo,UserID,[value]
from #products a cross join #withoutProducts b
order by AgentNO,a.productID

Show all and only rows in table 1 not in table 2 (using multiple columns)

I have one table (Table1) that has several columns used in combination: Name, TestName, DevName, Dept. When each of these 4 columns have values, the record is inserted into Table2. I need to confirm that all of the records with existing values in each of these fields within Table1 were correctly copied into Table 2.
I have created a query for it:
SELECT DISTINCT wr.Name,wr.TestName, wr.DEVName ,wr.Dept
FROM table2 wr
where NOT EXISTS (
SELECT NULL
FROM TABLE1 ym
WHERE ym.Name = wr.Name
AND ym.TestName = wr. TestName
AND ym.DEVName = wr.DEVName
AND ym. Dept = wr. Dept
)
My counts are not adding up, so I believe that this is incorrect. Can you advise me on the best way to write this query for my needs?
You can use the EXCEPT set operator for this one if the table definitions are identical.
SELECT DISTINCT ym.Name, ym.TestName, ym.DEVName, ym.Dept
FROM table1 ym
EXCEPT
SELECT DISTINCT wr.Name, wr.TestName, wr.DEVName, wr.Dept
FROM table2 wr
This returns distinct rows from the first table where there is not a match in the second table. Read more about EXCEPT and INTERSECT here: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017
Your query should do the job. It checks anything that are in Table1, but not Table2
SELECT ym.Name, ym.TestName, ym.DEVName, ym.Dept
FROM Table1 ym
WHERE NOT EXISTS (
SELECT 1
FROM table2
WHERE ym.Name = Name AND ym.TestName = TestName AND ym.DEVName = DEVName AND ym. Dept = Dept
)
If the structure of both tables are the same, EXCEPT is probably simpler.
IF OBJECT_ID(N'tempdb..#table1') IS NOT NULL drop table #table1
IF OBJECT_ID(N'tempdb..#table2') IS NOT NULL drop table #table2
create table #table1 (id int, value varchar(10))
create table #table2 (id int)
insert into #table1(id, value) VALUES (1,'value1'), (2,'value2'), (3,'value3')
--test here. Comment next line
insert into #table2(id) VALUES (1) --Comment/Uncomment
select * from #table1
select * from #table2
select #table1.*
from #table1
left JOIN #table2 on
#table1.id = #table2.id
where (#table2.id is not null or not exists (select * from #table2))

Get matching string with the percentage

I have the following details of the data:
Table 1: Table1 is of small in size around few records.
Table 2: Table2 is having 50 millions of rows.
Requirement: I need to match the any string column from table1 to table2 for example name column to name and get the percentage of matching (note column can be any, maybe address or any string column which have multiple words in a single cell).
Sample data:
create table table1(id int, name varchar(100), address varchar(200));
insert into table1 values(1,'Mario Speedwagon','H No 10 High Street USA');
insert into table1 values(2,'Petey Cruiser Jack','#1 Church Street UK');
insert into table1 values(3,'Anna B Sthesia','#101 No 1 B Block UAE');
insert into table1 values(4,'Paul A Molive','Main Road 12th Cross H No 2 USA');
insert into table1 values(5,'Bob Frapples','H No 20 High Street USA');
create table table2(name varchar(100), address varchar(200), email varchar(100));
insert into table2 values('Speedwagon Mario ','USA, H No 10 High Street','mario#gmail.com');
insert into table2 values('Cruiser Petey Jack','UK #1 Church Street','jack#gmail.com');
insert into table2 values('Sthesia Anna','UAE #101 No 1 B Block','Aanna#gmail.com');
insert into table2 values('Molive Paul','USA Main Road 12th Cross H No 2','APaul#gmail.com');
insert into table2 values('Frapples Bob ','USA H No 20 High Street','BobF#gmail.com');
Expected Result:
tbl1_Name tbl2_Name Percentage
--------------------------------------------------------
Mario Speedwagon Speedwagon Mario 100
Petey Cruiser Jack Cruiser Petey Jack 100
Anna B Sthesia Sthesia Anna around 80+
Paul A Molive Molive Paul around 80+
Bob Frapples Frapples Bob 100
Note: Above given is just sample data to understand, I have few records in table1 and 50 millions in table2 in actual senario.
My Try:
Step 1: As suggested by Shnugo have normalize data and stored in the same table's.
For table1:
ALTER TABLE table1 ADD Name_Normal VARCHAR(1000);
GO
--00:00:00 (5 row(s) affected)
UPDATE table1
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
For table2:
ALTER TABLE table2 ADD Name_Normal VARCHAR(1000);
GO
--01:59:03 (50000000 row(s) affected)
UPDATE table2
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
Step 2: Create Percentage calculation function using Levenshtein distance in Microsoft Sql Server
Step 3: Query to get the matching percentage.
--00:00:33 (23456 row(s) affected)
SELECT t.name AS [tbl1_Name],t1.name AS [tbl2_Name],
dbo.ufn_Levenshtein(t.Name_Normal,t1.Name_Normal) percentage
into #TempTable
FROM table2 t
INNER JOIN table1 t1
ON CHARINDEX(SOUNDEX(t.Name_Normal),SOUNDEX(t1.Name_Normal))>0
--00:00:00 (23456 row(s) affected)
SELECT *
FROM #TempTable
WHERE percentage >= 50
order by percentage desc;
Conclusion: Getting expected result but it's taking around 2 hours for normalizing table2 as mentioned in comment in above query. Any suggestion for better optimization at step 1 for table2?
Have you tried looking into DQS (Data Quality Services)?
Depends on your SQL version, it comes with the installation file.
https://learn.microsoft.com/en-us/sql/data-quality-services/data-matching?view=sql-server-2017

SQL Server putting data into temp table first before heavy join

Is it a good idea to put data into temp table first before joining several other tables?
For instance, let's say I have the following:
tableA, 5 million rows
tableB, 5 million rows
tableC, 5 million rows
...
tableG
The Query I want to perform may look like:
SELECT 1 FROM tableA
INNER JOIN tableB WITH (NOLOCK) ON tableA.col1= tableB.col1
LEFT JOIN tableC WITH (NOLOCK) ON ...
...
LEFT JOIN tableG WITH (NOLOCK) ON ...
WHERE tableA.someCol= conditionA AND tableB.someCol= conditionB...
Assuming with the filter, only a small subset of tableA will be returned. Would it be a good idea to pull data from tableA first before joining other tables, so as to avoid blocking and may be increase performance?
I tried googling but couldn't find any satisfactory answer. Thanks in advance.
Here are the "typicals" that I try. I usually try them out and see what happens under load and under "big data" that represents production row numbers, not dev row numbers.
Going from memory.
If it is "one time" use, I try to use the derived table method.
If it data in the "holder" table can be reused, I start with a #variableTable if the number of rows will be small.
2.b. The only time I've seen a #variableTable screw you is if you do some aggregate results...where the "summary rows" are only a few, but to generate the summary rows, you hit a large amount of rows. Think something like "Select StateAbbreviation, count(*) from dbo.LargeTableOfData".....there will only be 50 or so rows in the result table, BUT the aggregate data comes from a large table with lots of rows.
Then I to go a #TempTable. Most times without an index. Sometimes with an index.
2 or 3 times in my life the index on the #TempTable resulted in significant improvement.
It is a "try it out game". Sometimes you just don't know until you give it the ole college try.
Use Northwind
GO
/* Temp Table , No Index(es) */
IF OBJECT_ID('tempdb..#TempTableNoIndex') IS NOT NULL
begin
drop table #TempTableNoIndex
end
CREATE TABLE #TempTableNoIndex
(
OrderID int
)
Insert into #TempTableNoIndex (OrderID) select top 5 OrderID from dbo.Orders
Select * from dbo.[Order Details] od where exists (select null from #TempTableNoIndex innerHolder where innerHolder.OrderID = od.OrderID )
/* Temp Table , With Index(es) */
IF OBJECT_ID('tempdb..#TempTableWithIndex') IS NOT NULL
begin
drop table #TempTableWithIndex
end
CREATE TABLE #TempTableWithIndex
(
OrderID int
)
CREATE INDEX IX_TEMPTABLE_TempTableWithIndex_OrderID ON #TempTableWithIndex (OrderID)
Insert into #TempTableWithIndex (OrderID) select top 5 OrderID from dbo.Orders
Select * from dbo.[Order Details] od where exists (select null from #TempTableWithIndex innerHolder where innerHolder.OrderID = od.OrderID )
/* Variable Table */
Declare #HolderTable TABLE ( OrderID int )
Insert into #HolderTable (OrderID) select top 5 OrderID from dbo.Orders
Select * from dbo.[Order Details] od where exists (select null from #HolderTable innerHolder where innerHolder.OrderID = od.OrderID )
/* Derived Table */
Select * from dbo.[Order Details] od
join
( select top 5 OrderID from dbo.Orders ) as derived1
on od.OrderID = derived1.OrderID
/* Clean up */
IF OBJECT_ID('tempdb..#TempTableNoIndex') IS NOT NULL
begin
drop table #TempTableNoIndex
end
IF OBJECT_ID('tempdb..#TempTableWithIndex') IS NOT NULL
begin
drop table #TempTableWithIndex
end

Resources