Fulltext search multiple columns for multiple search terms - sql-server

I have a requirement from a client to have a search-field where he wants to input any text and search for every word in that text field in multiple full-text indexed columns which contain customer information, from a customer information table.
So, for example, if he inputs FL Diana Brooks Miami 90210, he wants all of these terms (FL, Diana, Brooks, Miami, 90210) to each be searched into the State, FirstName, LastName, City and Zip columns.
Now, this seems totally a bad idea to begin with and as an alternative I suggested using multiple fields where to input this information separately. Nonetheless, the point I am at is having to make a proof of concept as to why this won't work, from a performance perspective, and that it's better to have multiple fields where you input the term you want to search for.
So, getting to my query, I'm trying to write a Full-Text query to do what the client has asked for in order to get a benchmark for performance.
What I have so far doesn't seem to work, so I guess I am asking if it's even possible to do this?
declare
#zip varchar(10) = 90210
, #lastName varchar(50) = 'Brooks'
, #firstName varchar(50) = 'Diana'
, #city varchar(50) = 'Miami'
, #state char(2) = 'FL'
, #searchTerm varchar(250) = ''
, #s varchar(1) = ' '
set #searchTerm = #state + ' ' + #firstName + ' ' + #lastName + ' ' + #city
select *
from freetexttable(contacts, (zip, lastName, FirstName, city, state), #searchTerm) ftTbl
inner join contacts c on ftTbl.[key] = c.ContactID
The query I have above seems to work, but is not restrictive enough in order to find only the single record I'm looking for and is returning a whole lot more (which I'm guessing that it's because I'm using FREETEXTTABLE).
I've also tried replacing it with CONTAINSTABLE, but I get an error saying:
Msg 7630, Level 15, State 3, Line 26
Syntax error near 'Diana' in the full-text search condition 'FL Diana Brooks Miami'.
With using regular indexes I have been able to solve this, but I'm curious if it's even possible to do the same thing with Full-Text.
Using regular indexes I have a query with a adaptable WHERE clause, like below:
WHERE C.FirstName like coalesce(#FirstName + '%' , C.FirstName)
AND C.LastName like coalesce(#LastName + '%' , C.LastName)
etc.

You can create a view WITH SCHEMABINDING with id and concatinated columns:
CREATE VIEW dbo.SearchView WITH SCHEMABINDING
AS
SELECT id,
[State]+' ',
[FirstName]+' ',
[LastName]+' ',
[City]+' ',
[Zip] as search_string
FROM YourTable
Create index
CREATE UNIQUE CLUSTERED INDEX UCI_SearchView ON dbo.SearchView (id ASC)
Then create full-text index on search_string field.
USE YourDB
GO
--Enable Full-text search on the DB
IF (SELECT DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')) <> 1
EXEC sp_fulltext_database N'enable'
GO
--Create a full-text catalog
IF NOT EXISTS (SELECT * FROM dbo.sysfulltextcatalogs WHERE [name] = N'CatalogName')
EXEC sp_fulltext_catalog N'CatalogName', N'create'
GO
EXEC sp_fulltext_table N'dbo.SearchView', N'create', N'CatalogName', N'IndexName'
GO
--Add a column to catalog
EXEC sp_fulltext_column N'dbo.SearchView', N'search_string', N'add', 0 /* neutral */
GO
--Activate full-text for table/view
EXEC sp_fulltext_table N'dbo.SearchView', N'activate'
GO
--Full-text index update
exec sp_fulltext_catalog 'CatalogName', 'start_full'
GO
After that you need to write some function to construct a search condition. F.e.
FL Diana Brooks Miami 90210
Became:
"FL*" AND "Diana*" AND "Brooks*" AND "Miami*" AND "90210*"
And use it in FREETEXT or CONTAINS searches:
DECLARE #search nvarchar(4000) = '"FL*" AND "Diana*" AND "Brooks*" AND "Miami*" AND "90210*"'
SELECT sv.*
FROM dbo.SearchView sv
INNER JOIN CONTAINSTABLE (dbo.SearchView, search_string, #search) as c
ON c.[KEY] = sv.id

Related

Execution time for a SQL Query seems excessive

I have a data set of about 33 million rows and 20 columns. One of the columns is a raw data tab I'm using to extract relevant data from, inlcuding ID's and account numbers.
I extracted a column for User ID's into a temporary table to trim the User ID's of spaces. I'm now trying to add the trimmed User ID column back into the original data set using this code:
SELECT *
FROM [dbo].[DATA] AS A
INNER JOIN #TempTable AS B ON A. [RawColumn] = B. [RawColumn]
Extracting the User ID's and trimming the spaces took about a minute for each query. However, running this last query I'm at the 2 hour mark and I'm only 2% of the way through the dataset.
Is there a better way to run the query?
I'm running the query in SQL Server 2014 Management Studio
Thanks
Update:
I continued to let it run through the night. When I got back into work, only 6 million rows had been completed of the 33 million rows. I cancelled the execution and I'm trying to add a smaller primary key (The only other key I could see on the table was the [RawColumn], which was a very long string of text) using:
ALTER TABLE [dbo].[DATA]
ADD ID INT IDENTITY(1,1)
Right now I'm an hour into the execution.
Next, I'm planning to make it the primary key using
ALTER TABLE dbo.[DATA]
ADD CONSTRAINT PK_[DATA] PRIMARY KEY(ID)
I'm not familiar with using Indexes.. I've tried looking up on Stack Overflow how to create one, but from what I'm reading it sounds like it would take just as long to create an index as it would to run this query. Am I wrong about that?
For context on the RawColumn data, it looks something like this:
FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000
Update #2:
I'm now learning that using "ALTER TABLE" is a bad idea. I should have done a little bit more research into how to add a primary key to a table.
Update #3
Here's the code I used to extract the "UserID" code out of the "RawColumn" data.
DROP #TEMPTABLE1
GO
SELECT [RAWColumn],
SUBSTRING([RAWColumn], CHARINDEX('USERID:', [RAWColumn])+LEN('USERID:'), CHARINDEX('Account#:', [RAWColumn])-Charindex('Username:', [RAWColumn]) - LEN('Account#:') - LEN('USERID:')) AS 'USERID_NEW'
INTO #TempTable1
FROM [dbo].[DATA]
Next I trimmed the data from the temporary tables
DROP #TEMPTABLE2
GO
SELECT [RawColumn],
LTRIM([USERID_NEW]) AS 'USERID_NEW'
INTO #TempTable2
FROM #TempTable1
So now I'm trying to get the data from #TEMPTABLE2 back into my original [DATA] table. Hopefully this is more clear now.
So I think your parsing code is a little bit wrong. Here's an approach that doesn't assume that the values appear in any particular order. It does assume that the header/tag name has a space after the colon character and it assumes that the value end at the subsequent space character. Here's a snippet that manipulates a single value.
declare #dat varchar(128) = 'FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000';
declare #tag varchar(16) = 'UserID: ';
/* datalength() counts the trailing space character unlike len() */
declare #idx int = charindex(#tag, #dat) + datalength(#tag);
select substring(#dat, #idx, charindex(' ', #dat + ' ', #idx + 1) - #idx) as UserID
To use it in a single query without the temporary variable, the most straightforward approach is to just replace each instance of "#idx" with the original expression:
declare #tag varchar(16) = 'UserID: ';
select RawColumn,
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID
from dbo.DATA;
As an update it looks something like this:
declare #tag varchar(16) = 'UserID: ';
update dbo.DATA
set UserID =
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID;
You also appear to be ignoring upper/lower case in your string matches. It's not clear to me whether you need to consider that more carefully.

Queries with Dynamic Parameters - better ways?

I have the following stored procedure that is quite extensive because of the dynamic #Name parameter and the sub query.
Is there a better more efficient way to do this?
CREATE PROCEDURE [dbo].[spGetClientNameList]
#Name varchar(100)
AS
BEGIN
SET NOCOUNT ON;
SELECT
*
FROM
(
SELECT
ClientID,
FirstName + ' ' + LastName as Name
FROM
Client
) a
where a.Name like '%' + #Name + '%'
Shamelessly stealing from two recent articles by Aaron Bertrand:
Follow-up #1 on leading wildcard seeks - Aaron Bertrand
One way to get an index seek for a leading %wildcard - Aaron Bertrand
The jist is to create something that we can use that resembles a trigram (or trigraph) in PostgreSQL.
Aaron Bertrand also includes a disclaimer as follows:
"Before I start to show how my proposed solution would work, let me be absolutely clear that this solution should not be used in every single case where LIKE '%wildcard%' searches are slow. Because of the way we're going to "explode" the source data into fragments, it is likely limited in practicality to smaller strings, such as addresses or names, as opposed to larger strings, like product descriptions or session abstracts."
test setup: http://rextester.com/IIMT54026
Client table
create table dbo.Client (
ClientId int not null primary key clustered
, FirstName varchar(50) not null
, LastName varchar(50) not null
);
insert into dbo.Client (ClientId, FirstName, LastName) values
(1, 'James','')
, (2, 'Aaron','Bertrand')
go
Function used by Aaron Bertrand to explode string fragments (modified for input size):
create function dbo.CreateStringFragments(#input varchar(101))
returns table with schemabinding
as return
(
with x(x) as (
select 1 union all select x+1 from x where x < (len(#input))
)
select Fragment = substring(#input, x, len(#input)) from x
);
go
Table to store fragments for FirstName + ' ' + LastName:
create table dbo.Client_NameFragments (
ClientId int not null
, Fragment varchar(101) not null
, constraint fk_ClientsNameFragments_Client
foreign key(ClientId) references dbo.Client
on delete cascade
);
create clustered index s_cat on dbo.Client_NameFragments(Fragment, ClientId);
go
Loading the table with fragments:
insert into dbo.Client_NameFragments (ClientId, Fragment)
select c.ClientId, f.Fragment
from dbo.Client as c
cross apply dbo.CreateStringFragments(FirstName + ' ' + LastName) as f;
go
Creating trigger to maintain fragments:
create trigger dbo.Client_MaintainFragments
on dbo.Client
for insert, update as
begin
set nocount on;
delete f from dbo.Client_NameFragments as f
inner join deleted as d
on f.ClientId = d.ClientId;
insert dbo.Client_NameFragments(ClientId, Fragment)
select i.ClientId, fn.Fragment
from inserted as i
cross apply dbo.CreateStringFragments(i.FirstName + ' ' + i.LastName) as fn;
end
go
Quick trigger tests:
/* trigger tests --*/
insert into dbo.Client (ClientId, FirstName, LastName) values
(3, 'Sql', 'Zim')
update dbo.Client set LastName = 'unknown' where LastName = '';
delete dbo.Client where ClientId = 3;
--select * from dbo.Client_NameFragments order by ClientId, len(Fragment) desc
/* -- */
go
New Procedure:
create procedure [dbo].[Client_getNameList] #Name varchar(100) as
begin
set nocount on;
select
ClientId
, Name = FirstName + ' ' + LastName
from Client c
where exists (
select 1
from dbo.Client_NameFragments f
where f.ClientId = c.ClientId
and f.Fragment like #Name+'%'
)
end
go
exec [dbo].[Client_getNameList] #Name = 'On Bert'
returns:
+----------+----------------+
| ClientId | Name |
+----------+----------------+
| 2 | Aaron Bertrand |
+----------+----------------+
I guess search operation on Concatenated column wont take Indexes sometimes. I got situation like above and I replaced the Concatenated search with OR which gave me better performance most of the times.
Create Non Clustered Indexes on FirstName and LastName if not present.
Check the performance after modifying the above Procedure like below
CREATE PROCEDURE [dbo].[spGetClientNameList]
#Name varchar(100)
AS
BEGIN
SET NOCOUNT ON;
SELECT
ClientID,
FirstName + ' ' + LastName as Name
FROM
Client
WHERE FirstName LIKE '%' + #Name + '%'
OR LastName LIKE '%' + #Name + '%'
END
And do check execution plans to verify those Indexes are used or not.
The problem really comes down to having to compute the column (concat the first name and last name), that pretty much forces sql server into doing a full scan of the table to determine what is a match and what isn't. If you're not allowed to add indexes or alter the table, you'll have to change the query around (supply firstName and lastName separately). If you are, you could add a computed column and index that:
Create Table client (
ClientId INT NOT NULL PRIMARY KEY IDENTITY(1,1)
,FirstName VARCHAR(100)
,LastName VARCHAR(100)
,FullName AS FirstName + ' ' + LastName
)
Create index FullName ON Client(FullName)
This will at least speed your query up by doing index seeks instead of full table scans. Is it worth it? It's difficult to say without looking at how much data there is, etc.
where a.Name like '%' + #Name + '%'
This statement never can use index. In this situation it's beter to use Full Text Search
if you can restrict your like to
where a.Name like #Name + '%'
it will use index automaticaly. Moreover you can use REVERSE() function to index statement like :
where a.Name like '%' + #Name

Stored procedure with inner join using coalesce

I have a simple table tblAllUsers which stores simple values like Name,Date Of Birth etc of a UserId.
Another table tblInterest stores the interest(s) of a UserId.Here a user may have any number of Interest and are stored seperately in separate rows :
Create table tblInterest
(
Id int primary key identity,
UserId varchar(10),
InterestId int,
Interest varchar(20)
)
So when i want to display the set of Interest together of a particular user, I use the below query :
DECLARE #listStr VARCHAR(MAX)
SELECT #listStr = COALESCE(#listStr + ', ' ,'') + Interest FROM tblInterest where UserId=#UserId
SELECT #listStr
Now, want to display a users info from both these tables wherein the Interest(S) are displayed in ONE string.
I have tried the below ;
Create proc spPlayersGridview
#listStr VARCHAR(MAX)
as
begin
Select tblAllUsers.Category, tblAllUsers.DOB, tblAllUsers.FirstName, tblAllUsers.LastName, tblAllUsers.City, tblAllUsers.State,
#listStr = COALESCE(#listStr + ', ' ,'') + tblInterest.Interest
from tblAllUsers
INNER JOIN tblInterest
ON tblAllUsers.UserId=tblInterest.UserId
where Category='Player'
end
throws an exception "A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations."
I had a similar problem a while back, and a bit of SQL STUFF magic helps - Maybe it will work for you as well.
CREATE PROC spPlayersGridview
AS
BEGIN
SELECT
tblAllUsers.Category
, tblAllUsers.DOB
, tblAllUsers.FirstName
, tblAllUsers.LastName
, tblAllUsers.City
, tblAllUsers.State
, listStr = STUFF((
SELECT ',' + tblInterest.Interest
FROM tblInterest
WHERE tblAllUsers.UserId=tblInterest.UserId
ORDER BY tblInterest.Interest
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')
FROM tblAllUsers
WHERE Category='Player'
END
Hope it helps - For more reading look at: https://msdn.microsoft.com/en-us/library/ms188043.aspx

T-SQL - Merge all columns from source to target table w/o listing all the columns

I'm trying to merge a very wide table from a source (linked Oracle server) to a target table (SQL Server 2012) w/o listing all the columns. Both tables are identical except for the records in them.
This is what I have been using:
TRUNCATE TABLE TargetTable
INSERT INTO TargetTable
SELECT *
FROM SourceTable
When/if I get this working I would like to make it a procedure so that I can pass into it the source, target and match key(s) needed to make the update. For now I would just love to get it to work at all.
USE ThisDatabase
GO
DECLARE
#Columns VARCHAR(4000) = (
SELECT COLUMN_NAME + ','
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
MERGE TargetTable AS T
USING (SELECT * FROM SourceTable) AS S
ON (T.ID = S.ID AND T.ROWVERSION = S.ROWVERSION)
WHEN MATCHED THEN
UPDATE SET #Columns = S.#Columns
WHEN NOT MATCHED THEN
INSERT (#Columns)
VALUES (S.#Columns)
Please excuse my noob-ness. I feel like I'm only half way there, but I don't understand some parts of SQL well enough to put it all together. Many thanks.
As previously mentioned in the answers, if you don't want to specify the columns , then you have to write a dynamic query.
Something like this in your case should help:
DECLARE
#Columns VARCHAR(4000) = (
SELECT COLUMN_NAME + ','
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
DECLARE #MergeQuery NVARCHAR(MAX)
DECLARE #UpdateQuery VARCHAR(MAX)
DECLARE #InsertQuery VARCHAR(MAX)
DECLARE #InsertQueryValues VARCHAR(MAX)
DECLARE #Col VARCHAR(200)
SET #UpdateQuery='Update Set '
SET #InsertQuery='Insert ('
SET #InsertQueryValues=' Values('
WHILE LEN(#Columns) > 0
BEGIN
SET #Col=left(#Columns, charindex(',', #Columns+',')-1);
IF #Col<> 'ID' AND #Col <> 'ROWVERSION'
BEGIN
SET #UpdateQuery= #UpdateQuery+ 'TargetTable.'+ #Col + ' = SourceTable.'+ #Col+ ','
SET #InsertQuery= #InsertQuery+#Col + ','
SET #InsertQueryValues=#InsertQueryValues+'SourceTable.'+ #Col+ ','
END
SET #Columns = stuff(#Columns, 1, charindex(',', #Columns+','), '')
END
SET #UpdateQuery=LEFT(#UpdateQuery, LEN(#UpdateQuery) - 1)
SET #InsertQuery=LEFT(#InsertQuery, LEN(#InsertQuery) - 1)
SET #InsertQueryValues=LEFT(#InsertQueryValues, LEN(#InsertQueryValues) - 1)
SET #InsertQuery=#InsertQuery+ ')'+ #InsertQueryValues +')'
SET #MergeQuery=
N'MERGE TargetTable
USING SourceTable
ON TargetTable.ID = SourceTable.ID AND TargetTable.ROWVERSION = SourceTable.ROWVERSION ' +
'WHEN MATCHED THEN ' + #UpdateQuery +
' WHEN NOT MATCHED THEN '+#InsertQuery +';'
Execute sp_executesql #MergeQuery
If you want more information about Merge, you could read the this excellent article
Don't feel bad. It takes time. Merge has interesting syntax. I've actually never used it. I read Microsoft's documentation on it, which is very helpful and even has examples. I think I covered everything. I think there may be a slight amount of tweaking you might have to do, but I think it should work.
Here's the documentation for MERGE:
https://msdn.microsoft.com/en-us/library/bb510625.aspx
As for your code, I commented pretty much everything to explain it and show you how to do it.
This part is to help write your merge statement
USE ThisDatabase --This says what datbase context to use.
--Pretty much what database your querying.
--Like this: database.schema.objectName
GO
DECLARE
#SetColumns VARCHAR(4000) = (
SELECT CONCAT(QUOTENAME(COLUMN_NAME),' = S.',QUOTENAME(COLUMN_NAME),',',CHAR(10)) --Concat just says concatenate these values. It's adds the strings together.
--QUOTENAME adds brackets around the column names
--CHAR(10) is a line break for formatting purposes(totally optional)
FROM INFORMATION_SCHEMA.COLUMNS
--WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
) --This uses some fancy XML trick to get your Columns concatenated into one row.
--What really is in your table is a column of your column names in different rows.
--BTW If the columns names in both tables are identical, then this will work.
DECLARE #Columns VARCHAR(4000) = (
SELECT QUOTENAME(COLUMN_NAME) + ','
FROM INFORMATION_SCHEMA.COLUMNS
--WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
SET #Columns = SUBSTRING(#Columns,0,LEN(#Columns)) -- this gets rid off the comma at the end of your list
SET #SetColumns = SUBSTRING(#SetColumns,0,LEN(#SetColumns)) --same thing here
SELECT #SetColumns --Your going to want to copy and paste this into your WHEN MATCHED statement
SELECT #Columns --Your going to want to copy this into your WHEN NOT MATCHED statement
GO
Merge Statement
Especially look at my notes on ROWVERSION.
MERGE INTO TargetTable AS T
USING SourceTable AS S --Don't really need to write SELECT * FROM since you need the whole table anyway
ON (T.ID = S.ID AND T.[ROWVERSION] = S.[ROWVERSION]) --These are your matching parameters
--One note on this, if ROWVERSION is different versions of the same data you don't want to have RowVersion here
--Like lets say you have ID 1 ROWVERSION 2 in your source but only version 1 in your targetTable
--If you leave T.ID =S.ID AND T.ROWVERSION = S.ROWVERSION, then it will insert the new ROWVERSION
--So you'll have two versions of ID 1
WHEN MATCHED THEN --When TargetTable ID and ROWVERSION match in the matching parameters
--Update the values in the TargetTable
UPDATE SET /*Copy and Paste #SetColumnss here*/
--Should look like this(minus the "--"):
--Col1 = S.Col1,
--Col2 = S.Col2,
--Col3 = S.Col3,
--Etc...
WHEN NOT MATCHED THEN --This says okay there are no rows with the existing ID, now insert a new row
INSERT (col1,col2,col3) --Copy and paste #Columns in between the parentheses. Should look like I show it. Note: This is insert into target table so your listing the target table columns
VALUES (col1,col2,col3) --Same thing here. This is the list of source table columns

TSQL - A join using full-text CONTAINS

I currently have the following select statement, but I wish to move to full text search on the Keywords column. How would I re-write this to use CONTAINS?
SELECT MediaID, 50 AS Weighting
FROM Media m JOIN #words w ON m.Keywords LIKE '%' + w.Word + '%'
#words is a table variable filled with words I wish to look for:
DECLARE #words TABLE(Word NVARCHAR(512) NOT NULL);
If you are not against using a temp table, and EXEC (and I realize that is a big if), you could do the following:
DECLARE #KeywordList VARCHAR(MAX), #KeywordQuery VARCHAR(MAX)
SELECT #KeywordList = STUFF ((
SELECT '"' + Keyword + '" OR '
FROM FTS_Keywords
FOR XML PATH('')
), 1, 0, '')
SELECT #KeywordList = SUBSTRING(#KeywordList, 0, LEN(#KeywordList) - 2)
SELECT #KeywordQuery = 'SELECT RecordID, Document FROM FTS_Demo_2 WHERE CONTAINS(Document, ''' + #KeywordList +''')'
--SELECT #KeywordList, #KeywordQuery
CREATE TABLE #Results (RecordID INT, Document NVARCHAR(MAX))
INSERT INTO #Results (RecordID, Document)
EXEC(#KeywordQuery)
SELECT * FROM #Results
DROP TABLE #Results
This would generate a query like:
SELECT RecordID
,Document
FROM FTS_Demo_2
WHERE CONTAINS(Document, '"red" OR "green" OR "blue"')
And results like this:
RecordID Document
1 one two blue
2 three red five
If CONTAINS allows a variable or column, you could have used something like this.
SELECT MediaID, 50 AS Weighting
FROM Media m
JOIN #words w ON CONTAINS(m.Keywords, w.word)
However, according to Books Online for SQL Server CONTAINS, it is not supported. Therefore, no there is no way to do it.
Ref: (column_name appears only in the first param to CONTAINS)
CONTAINS
( { column_name | ( column_list ) | * }
,'<contains_search_condition>'
[ , LANGUAGE language_term ]
)

Resources