Query Large (in the millions) Data Faster - sql-server

I have two tables:
Tbl1 has 2 columns: name and state
Tbl2 has name and state and additional columns about the fields
I am trying to match tbl1 name and state with tbl2 name and state. I have remove all exact matches, but I see that I could match more if I could account for misspelling and name variations by using a scalar function that compares the 2 names and returns an integer showing how close of a match they are (the lower the number returned the better the match).
The issue is that Tbl1 has over 2M records and Tbl2 has over 4M records – it takes about 30sec to just to search one record from Tbl1 in Tbl2.
Is there some way I could arrange the data or query so the search could be completed faster?
Here’s the table structure:
CREATE TABLE Tbl1
(
Id INT NOT NULL IDENTITY( 1, 1 ) PRIMARY KEY,
Name NVARCHAR(255),
[State] VARCHAR(50),
Phone VARCHAR(50),
DoB SMALLDATETIME
)
GO
CREATE INDEX tbl1_Name_indx ON dbo.Tbl1( Name )
GO
CREATE INDEX tbl1_State_indx ON dbo.Tbl1( [State] )
GO
CREATE TABLE Tbl2
(
Id INT NOT NULL IDENTITY( 1, 1 ) PRIMARY KEY,
Name NVARCHAR(255),
[State] VARCHAR(50)
)
GO
CREATE INDEX tbl2_Name_indx ON dbo.Tbl1( Name )
GO
CREATE INDEX tbl2_State_indx ON dbo.Tbl1( [State] )
GO
Here's a sample function that I tested with to try to rule out function complexity:
CREATE FUNCTION [dbo].ScoreHowCloseOfMatch
(
#SearchString VARCHAR(200) ,
#MatchString VARCHAR(200)
)
RETURNS INT
AS
BEGIN
DECLARE #Result INT;
SET #Result = 1;
RETURN #Result;
END;
Here's some sample data:
INSERT INTO Tbl1
SELECT 'Bob Jones', 'WA', '555-333-2222', 'June 10, 1971' UNION
SELECT 'Melcome T Homes', 'CA', '927-333-2222', 'June 10, 1971' UNION
SELECT 'Janet Rengal', 'WA', '555-333-2222', 'June 10, 1971' UNION
SELECT 'Matt Francis', 'TN', '234-333-2222', 'June 10, 1971' UNION
SELECT 'Same Bojen', 'WA', '555-333-2222', 'June 10, 1971' UNION
SELECT 'Frank Tonga', 'NY', '903-333-2222', 'June 10, 1971' UNION
SELECT 'Jill Rogers', 'WA', '555-333-2222', 'June 10, 1971' UNION
SELECT 'Tim Jackson', 'OR', '757-333-2222', 'June 10, 1971'
GO
INSERT INTO Tbl2
SELECT 'BobJonez', 'WA' UNION
SELECT 'Malcome X', 'CA' UNION
SELECT 'Jan Regal', 'WA'
GO
Here's the query:
WITH cte as (
SELECT t1Id = t1.Id ,
t1Name = t1.Name ,
t1State = t1.State,
t2Name = t2.Name ,
t2State = t2.State ,
t2.Phone ,
t2.DoB,
Score = dbo.ScoreHowCloseOfMatch(t1.Name, t2.Name)
FROM dbo.Tbl1 t2
JOIN dbo.Tbl2 t1
ON t1.State = t2.State
)
SELECT *
INTO CompareResult
FROM cte
ORDER BY cte.Score ASC
GO

One possibility would be to add a column with a normalized name used only for matching purposes. You would remove all the white spaces, remove accents, replace first names by abbreviated first names, replace known nicknames by real names etc.
You could even sort the first name and the last name of one person alphabetically in order to allow swapping both.
Then you can simply join the two tables by this normalized name column.

JOIN dbo.Tbl2 t1
ON t1.State = t2.State
You are joining 2Mx4M rows on a max 50 distinct values join criteria. No wonder this is slow. You need to go back to the drawing board and redefine your problem. If you really want to figure out the 'close match' of every body with everybody else in the same state, then be prepared to pay the price...

Related

How to get records which has more than one entries on another table

An example scenario for my question would be:
How to get all persons who has multiple address types?
Now here's my sample data:
CREATE TABLE #tmp_1 (
ID uniqueidentifier PRIMARY KEY
, FirstName nvarchar(max)
, LastName nvarchar(max)
)
CREATE TABLE #tmp_2 (
SeedID uniqueidentifier PRIMARY KEY
, SomeIrrelevantCol nvarchar(max)
)
CREATE TABLE #tmp_3 (
KeyID uniqueidentifier PRIMARY KEY
, ID uniqueidentifier REFERENCES #tmp_1(ID)
, SeedID uniqueidentifier REFERENCES #tmp_2(SeedID)
, SomeIrrelevantCol nvarchar(max)
)
INSERT INTO #tmp_1
VALUES
('08781F73-A06B-4316-B6A5-802ED58E54BE', 'AAAAAAA', 'aaaaaaa'),
('4EC71FCE-997C-46AA-B119-6C5A2545DDC2', 'BBBBBBB', 'bbbbbbb'),
('B0726ABF-738E-48BC-95CB-091C9D731A0E', 'CCCCCCC', 'ccccccc'),
('6C6CE284-A63C-49D2-B2CC-F25C9CBC8FB8', 'DDDDDDD', 'ddddddd')
INSERT INTO #tmp_2
VALUES
('4D10B4EC-C929-4D6B-8C94-11B680CF2221', 'Value1'),
('4C891FE9-60B6-41BE-A64B-11A9A8B58AB2', 'Value2'),
('6F6EFED6-8EA0-4F70-A63F-6A103D0A71BD', 'Value3')
INSERT INTO #tmp_3
VALUES
(NEWID(), '08781F73-A06B-4316-B6A5-802ED58E54BE', '4D10B4EC-C929-4D6B-8C94-11B680CF2221', 'sdfsdgdfbgcv'),
(NEWID(), '08781F73-A06B-4316-B6A5-802ED58E54BE', '4C891FE9-60B6-41BE-A64B-11A9A8B58AB2', 'asdfadsas'),
(NEWID(), '08781F73-A06B-4316-B6A5-802ED58E54BE', '4C891FE9-60B6-41BE-A64B-11A9A8B58AB2', 'xxxxxeeeeee'),
(NEWID(), '4EC71FCE-997C-46AA-B119-6C5A2545DDC2', '4D10B4EC-C929-4D6B-8C94-11B680CF2221', 'sdfsdfsd'),
(NEWID(), 'B0726ABF-738E-48BC-95CB-091C9D731A0E', '4D10B4EC-C929-4D6B-8C94-11B680CF2221', 'zxczxcz'),
(NEWID(), 'B0726ABF-738E-48BC-95CB-091C9D731A0E', '6F6EFED6-8EA0-4F70-A63F-6A103D0A71BD', 'eerwerwe'),
(NEWID(), '6C6CE284-A63C-49D2-B2CC-F25C9CBC8FB8', '4D10B4EC-C929-4D6B-8C94-11B680CF2221', 'vbcvbcvbcv')
Which gives you:
This is my attempt:
SELECT
t1.*
, Cnt -- not really needed. Just added for visual purposes
FROM #tmp_1 t1
LEFT JOIN (
SELECT
xt.ID
, COUNT(1) Cnt
FROM (
SELECT
#tmp_3.ID
, COUNT(1) as Cnt
FROM #tmp_3
GROUP BY ID, SeedID
) xt
GROUP BY ID
) t2
ON t1.ID = t2.ID
WHERE t2.Cnt > 1
Which gives:
ID FirstName LastName Cnt
B0726ABF-738E-48BC-95CB-091C9D731A0E CCCCCCC ccccccc 2
08781F73-A06B-4316-B6A5-802ED58E54BE AAAAAAA aaaaaaa 2
Although this gives me the correct results, I'm afraid that this query is not the right way to do this performance-wise because of the inner queries. Any input is very much appreciated.
NOTE:
A person can have multiple address of the same address types.
"Person-Address" is not the exact use-case. This is just an example.
The Cnt column is not really needed in the result set.
The way you have named your sample tables and data help little in understanding the problem.
I think you want all IDs which have 2 or more SomeIrrelevantCol values in the last table?
This can be done by:
select * from #tmp_1
where ID in
(
select ID
from #tmp_3
group by ID
having count(distinct SomeIrrelevantCol)>=2
)

SQL Server Dynamic Search Based on Stored Search Parameters

I have to create a dynamic search. The search criteria is stored in tables and there is a main table for the stored records. Here is the structure:
--Main Table. This table stores records of a user. Basically we store files in this table. Each file is associated with a single city.
DECLARE #Records TABLE(
[RecordId] INT IDENTITY(1,1),
[FileName] VARCHAR(100),
[OwnerId] INT,
[CityId] INT
)
INSERT INTO #Records
SELECT 'A', 100, 101 UNION
SELECT 'B', 100, 102 UNION
SELECT 'C', 100, 103 UNION
SELECT 'D', 100, 104
--The next table is used to associate a file with multiple friends.
DECLARE #FriendRecords TABLE
(
[Id] INT IDENTITY(1,1),
[RecordId] INT,
[FriendId] INT
)
INSERT INTO #FriendRecords
SELECT 1, 201 UNION --File '1' is associated with 'FriendId' 201
SELECT 1, 202 UNION --File '1' is associated with 'FriendId' 202
SELECT 2, 201 UNION --File '2' is associated with 'FriendId' 201
SELECT 3, 202 --File '3' is associated with 'FriendId' 202
--The following table is used to create a criteria for user.
DECLARE #Criteria TABLE
(
[CriteriaId] INT IDENTITY(1,1),
[CriteriaName] VARCHAR(50),
[OwnerId] INT
)
INSERT INTO #Criteria
SELECT 'SampleCriteria', 100 --Criteria created by user 100
--The following table is used to store cities that needs to be searched in 'Records' table for owner of criteriaId '1'.
DECLARE #CriteriaCities TABLE(
[CriteriaCityId] INT IDENTITY(1,1),
[CriteriaId] INT,
[CityId] INT
)
INSERT INTO #CriteriaCities
SELECT 1, 101 UNION
SELECT 1, 102 UNION
SELECT 1, 103
--The following table is used to store friend that needs to be searched in 'FriendsRecords' table for owner of criteriaId '1'.
DECLARE #CriteriaFriend TABLE(
[CriteriaCityId] INT IDENTITY(1,1),
[CriteriaId] INT,
[FriendId] INT
)
INSERT INTO #CriteriaFriend
SELECT 1, 202;
Basically, the user can create a criteria(#Criteria table) and store the search parameters(#CriteriaCities and #CriteriaFriend tables.) The requirement is to get the files according to the stored criteria. The query I am looking for should return records from #Records table that has cityIds 101, 102&103 AND FriendId '201'. The result is only 'C' from #Records table. If I create a left join on all the tables, I get the other records for ownerId '100' as well. If I include an inner join within tables I get no records if there is no entry for criteria in #CriteriaCities or #CriteriaFriend table. What should be the query that searches for records in the main table based on record that exist in the link tables(#CriteriaCities, #CriteriaFriend)? If the search parameter is not stored in these table the join should not be created between these tables.
I'm not sure you have your logic fully understood as your question doesn't quite make sense per my comment, but if I follow what you have stated as your rules you can use the following select statement to get your data. I am not sure how you would pass in which criteria you are searching for yet though, as you have not explained how this part of your process works:
-- Count the records in each criteria table.
declare #CriteriaCitiesCount int = (select count(1) from #CriteriaCities);
declare #CriteriaFriendCount int = (select count(1) from #CriteriaFriend);
select *
from #Records r
left join #CriteriaCities cc
on(r.CityId = cc.CityId)
left join #FriendRecords f
on(r.RecordId = f.RecordId)
left join #CriteriaFriend cf
on(f.FriendId = cf.FriendId)
where (#CriteriaCitiesCount = 0 -- Only return records where we aren't filtering,
or cc.CityId is not null -- Or we are filtering and we get a match.
)
and (#CriteriaFriendCount = 0
or cf.FriendId is not null
);
Sorry,i am not able to understand your output.
Why you should get only FileName C ?
Try something like,
;With CTE as
(
select cc.CityId,cf.FriendId
from #Criteria C
inner join #CriteriaCities CC
on c.CriteriaId=cc.CriteriaId
inner join #CriteriaFriend CF
on c.CriteriaId=cf.CriteriaId
where c.OwnerId=#ownerid
)
select *
from #Records R
inner join #FriendRecords FR
on r.RecordId=fr.RecordId
where EXISTS(
select cityid from cte c where r.CityId=c.cityid and c.FriendId=fr.FriendId
)

Unpivot dynamic table columns into key value rows

The problem that I need to resolve is data transfer from one table with many dynamic fields into other structured key value table.
The first table comes from a data export from another system, and has the following structure ( it can have any column name and data):
[UserID],[FirstName],[LastName],[Email],[How was your day],[Would you like to receive weekly newsletter],[Confirm that you are 18+] ...
The second table is where I want to put the data, and it has the following structure:
[UserID uniqueidentifier],[QuestionText nvarchar(500)],[Question Answer nvarchar(max)]
I saw many examples showing how to unpivot table, but my problem is that I dont know what columns the Table 1 will have. Can I somehow dynamically unpivot the first table,so no matter what columns it has, it is converted into a key-value structure and import the data into the second table.
I will really appreciate your help with this.
You can't pivot or unpivot in one query without knowing the columns.
What you can do, assuming you have privileges, is query sys.columns to get the field names of your source table then build an unpivot query dynamically.
--Source table
create table MyTable (
id int,
Field1 nvarchar(10),
Field2 nvarchar(10),
Field3 nvarchar(10)
);
insert into MyTable (id, Field1, Field2, Field3) values ( 1, 'aaa', 'bbb', 'ccc' );
insert into MyTable (id, Field1, Field2, Field3) values ( 2, 'eee', 'fff', 'ggg' );
insert into MyTable (id, Field1, Field2, Field3) values ( 3, 'hhh', 'iii', 'jjj' );
--key/value table
create table MyValuesTable (
id int,
[field] sysname,
[value] nvarchar(10)
);
declare #columnString nvarchar(max)
--This recursive CTE examines the source table's columns excluding
--the 'id' column explicitly and builds a string of column names
--like so: '[Field1], [Field2], [Field3]'.
;with columnNames as (
select column_id, name
from sys.columns
where object_id = object_id('MyTable','U')
and name <> 'id'
),
columnString (id, string) as (
select
2, cast('' as nvarchar(max))
union all
select
b.id + 1, b.string + case when b.string = '' then '' else ', ' end + '[' + a.name + ']'
from
columnNames a
join columnString b on b.id = a.column_id
)
select top 1 #columnString = string from columnString order by id desc
--Now I build a query around the column names which unpivots the source and inserts into the key/value table.
declare #sql nvarchar(max)
set #sql = '
insert MyValuestable
select id, field, value
from
(select * from MyTable) b
unpivot
(value for field in (' + #columnString + ')) as unpvt'
--Query's ready to run.
exec (#sql)
select * from MyValuesTable
In case you're getting your source data from a stored procedure, you can use OPENROWSET to get the data into a table, then examine that table's column names. This link shows how to do that part.
https://stackoverflow.com/a/1228165/300242
Final note: If you use a temporary table, remember that you get the column names from tempdb.sys.columns like so:
select column_id, name
from tempdb.sys.columns
where object_id = object_id('tempdb..#MyTable','U')

How to query with unknown combination of optional parameters in SQL Server without cursors?

I have a search that has three input fields (for arguments sake, let's say LastName, Last4Ssn, and DateOfBirth). These three input fields are in a dynamic grid where the user can choose to search for one or more combinations of these three fields. For example, a user might search based on the representation below:
LastName Last4Ssn DateOfBirth
-------- -------- -----------
Smith NULL 1/1/1970
Smithers 1234 NULL
NULL 5678 2/2/1980
In the example, the first row represents a search by LastName and DateOfBirth. The second, by LastName and Last4Ssn. And, the third, by Last4Ssn and DateOfBirth. This example is a bit contrived as the real-world scenario has four fields. At least two of the fields must be filled (don't worry about how to validate) with search data and it is possible that all fields are filled out.
Without using cursors, how does one use that data to join to existing tables using the given values in each row as the filter? Currently, I have a cursor that goes through each row of the above table, performs the join based on the columns that have values, and inserts the found data into a temp table. Something like this:
CREATE TABLE #results (
Id INT,
LastName VARCHAR (26),
Last4Ssn VARCHAR (4),
DateOfBirth DATETIME
)
DECLARE #lastName VARCHAR (26)
DECLARE #last4Ssn VARCHAR (4)
DECLARE #dateOfBirth DATETIME
DECLARE search CURSOR FOR
SELECT LastName, Last4Ssn, DateOfBirth
FROM #searchData
OPEN search
FETCH NEXT FROM search
INTO #lastName, #last4Ssn, #dateOfBirth
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT INTO #results
SELECT s.Id, s.LastName, s.Last4Ssn, s.DateOfBirth
FROM SomeTable s
WHERE Last4Ssn = ISNULL(#last4Ssn, Last4Ssn)
AND DateOfBirth = ISNULL(#dateOfBirth, DateOfBirth)
AND (
LastName = ISNULL(#lastName, LastName)
OR LastName LIKE #lastName + '%'
)
FETCH NEXT FROM search
INTO #lastName, #last4Ssn, #dateOfBirth
END
CLOSE search
DEALLOCATE search
I was hoping there was some way to avoid the cursor to make the code a bit more readable. Performance is not an issue as the table used to search will never have more than 5-10 records in it, but I would think that for more than a few, it'd be more efficient to query the data all at once rather than one row at a time. The SomeData table in my example can be very large.
I don't see why you can't just join the two tables together:
CREATE TABLE #results (
Id INT,
LastName VARCHAR (26),
Last4Ssn VARCHAR (4),
DateOfBirth DATETIME
)
INSERT INTO #results
select s.id, s.lastname, s.last4ssn, s.dateofbirth
from SomeTable s
join #searchData d
ON s.last4ssn = isnull(d.last4ssn, s.last4ssn)
AND s.dateofbirth = isnull(d.dateofbirth, s.dateofbirth)
AND (s.lastname = isnull(d.lastname, s.lastname) OR
OR s.lastname like d.lastname + '%')
EDIT:
Since the data is large, we'll need some good indices. One index isn't good enough since you effectively have 3 clauses OR'd together. So the first step is to create those indices:
CREATE TABLE SomeData (
Id INT identity(1,1),
LastName VARCHAR (26),
Last4Ssn VARCHAR (4),
DateOfBirth DATETIME
)
create nonclustered index ssnlookup on somedata (last4ssn)
create nonclustered index lastnamelookup on somedata (lastname)
create nonclustered index doblookup on somedata (dateofbirth)
The next step involves crafting the query to use those indices. I'm not sure what the best way here is, but I think it's to have 4 queries union'd together:
with searchbyssn as (
select somedata.* from somedata join #searchData
on somedata.last4ssn = #searchData.last4ssn
), searchbyexactlastname as (
select somedata.* from somedata join #searchData
on somedata.lastname = #searchData.lastname
), searchbystartlastname as (
select somedata.* from somedata join #searchData
on somedata.lastname like #searchdata.lastname + '%'
), searchbydob as (
select somedata.* from somedata join #searchData
on somedata.dateofbirth = #searchData.dateofbirth
), s as (
select * from searchbyssn
union select * from searchbyexactlastname
union select * from searchbystartlastname
union select * from searchbydob
)
select s.id, s.lastname, s.last4ssn, s.dateofbirth
from s
join #searchData d
ON (d.last4ssn is null or s.last4ssn = d.last4ssn)
AND s.dateofbirth = isnull(d.dateofbirth, s.dateofbirth)
AND (s.lastname = isnull(d.lastname, s.lastname)
OR s.lastname like d.lastname + '%')
Here's a fiddle showing the 4 index seeks: http://sqlfiddle.com/#!6/3741d/3
It shows significant resource usage for the union, but I think that would be negligible compared to the index scans for large tables. It wouldn't let me generate more than a few hundred rows of sample data. Since the number of result rows is small, it is not expensive to join to #searchData at the end and filter all the results again.

Using COALESCE in SQL view

I need to create a view from several tables. One of the columns in the view will have to be composed out of a number of rows from one of the table as a string with comma-separated values.
Here is a simplified example of what I want to do.
Customers:
CustomerId int
CustomerName VARCHAR(100)
Orders:
CustomerId int
OrderName VARCHAR(100)
There is a one-to-many relationship between Customer and Orders. So given this data
Customers
1 'John'
2 'Marry'
Orders
1 'New Hat'
1 'New Book'
1 'New Phone'
I want a view to be like this:
Name Orders
'John' New Hat, New Book, New Phone
'Marry' NULL
So that EVERYBODY shows up in the table, regardless of whether they have orders or not.
I have a stored procedure that i need to translate to this view, but it seems that you cant declare params and call stored procs within a view. Any suggestions on how to get this query into a view?
CREATE PROCEDURE getCustomerOrders(#customerId int)
AS
DECLARE #CustomerName varchar(100)
DECLARE #Orders varchar (5000)
SELECT #Orders=COALESCE(#Orders,'') + COALESCE(OrderName,'') + ','
FROM Orders WHERE CustomerId=#customerId
-- this has to be done separately in case orders returns NULL, so no customers are excluded
SELECT #CustomerName=CustomerName FROM Customers WHERE CustomerId=#customerId
SELECT #CustomerName as CustomerName, #Orders as Orders
EDIT: Modified answer to include creation of view.
/* Set up sample data */
create table Customers (
CustomerId int,
CustomerName VARCHAR(100)
)
create table Orders (
CustomerId int,
OrderName VARCHAR(100)
)
insert into Customers
(CustomerId, CustomerName)
select 1, 'John' union all
select 2, 'Marry'
insert into Orders
(CustomerId, OrderName)
select 1, 'New Hat' union all
select 1, 'New Book' union all
select 1, 'New Phone'
go
/* Create the view */
create view OrderView as
select c.CustomerName, x.OrderNames
from Customers c
cross apply (select stuff((select ',' + OrderName from Orders o where o.CustomerId = c.CustomerId for xml path('')),1,1,'') as OrderNames) x
go
/* Demo the view */
select * from OrderView
go
/* Clean up after demo */
drop view OrderView
drop table Customers
drop table Orders
go
In SQL Server 2008, you can take advantage of some of the features added for XML to do this all in one query without using a stored proc:
SELECT CustomerName,
STUFF( -- "STUFF" deletes the leading ', '
( SELECT ', ' + OrderName
FROM Orders
WHERE CustomerId = Customers.CutomerId
-- This causes the sub-select to be returned as a concatenated string
FOR XML PATH('')
),
1, 2, '' )
AS Orders
FROM Customers

Resources