I want to search a table's varchar column for content in another table's varchar column.
Certain words are banned and I want to identify the rows that have the banned words. I want an EXACT match on the banned word.
I'm using MS SQL Server 2016.
Table 1:
CREATE TABLE [dbo].[BlogComment](
[BlogCommentId] [int] IDENTITY(1,1) NOT NULL,
[BlogCommentContent] [varchar](max) NOT NULL,
CONSTRAINT [PK_BlogComment] PRIMARY KEY CLUSTERED
(
[BlogCommentId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS =
ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
3 rows - and the data in BlogCommentContent:
There are many of us.
This is the man.
I hear you.
Table 2:
CREATE TABLE [dbo].[BannedWords](
[BannedWordsId] [int] IDENTITY(1,1) NOT NULL,
[Description] [varchar](250) NOT NULL
CONSTRAINT [PK_BannedWords] PRIMARY KEY CLUSTERED
(
[BannedWordsId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS =
ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
3 rows - and the data in Description:
though
man
hear
My Sql:
SELECT BlogCommentContent
FROM dbo.BlogComment,
dbo.BannedWords
WHERE ( CHARINDEX( [Description], BlogCommentContent, 1 ) ) > 1
It's finding 'man', 'hear' and 'man' in the word 'many'.
So it returns 3 rows.
I only WANT EXACT matches.
So only return 2 rows.
How do I accomplish this?
Please try the following solution.
It will work starting from SQL Server 2017 onwards.
SQL
-- DDL and sample data population, start
DECLARE #BlogComment TABLE (
BlogCommentId INT IDENTITY PRIMARY KEY,
BlogCommentContent VARCHAR(MAX));
INSERT INTO #BlogComment (BlogCommentContent) VALUES
('There are many of us.'),
('This is the man.'),
('I hear you.');
DECLARE #BannedWords TABLE (
BannedWordsId INT IDENTITY PRIMARY KEY,
Word varchar(250))
INSERT INTO #BannedWords (Word) VALUES
('though'),
('man'),
('hear');
-- DDL and sample data population, end
;WITH rs AS
(
SELECT word = TRIM('.,' FROM value )
FROM #BlogComment
CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
SELECT DISTINCT bw.Word
FROM rs
INNER JOIN #BannedWords bw ON rs.word = bw.Word;
SQL Server 2016
;WITH rs AS
(
--SELECT word = TRIM('.,' FROM [value])
SELECT word = REPLACE(REPLACE([value],'.',''),',','')
FROM #BlogComment
CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
SELECT DISTINCT bw.Word
FROM rs
INNER JOIN #BannedWords bw ON rs.word = bw.Word;
Output
+------+
| Word |
+------+
| man |
| hear |
+------+
If you want exact matches what you mean is that there must not be another work touching so man can match "man", "man and women", "mice and men", "a man or two". You need to check
BlogCommentContent = 'man'
left(BlogCommentContent,3) = 'man'
right(BlogCommentContent,3) = 'man'
BlogCommentContent like ' man '
the length of man can be found with len('man') to be used in right() and left().
The last value, for like, can be constructed with concat(' ','man',' ')
I have a view which comprises 4 yearly tables:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE VIEW [dbo].[BGT_BETWAYDETAILS]
WITH SCHEMABINDING
AS
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_ResultID] [bigint] NOT NULL,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2020]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2019]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2018]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2017];
GO
Each table has the following structure:
CREATE TABLE [dbo].[BGT_BETWAYDETAILS_2020]
(
[bwd_BetTicketNr] [bigint] NOT NULL,
[bwd_LineID] [int] NOT NULL,
[bwd_ResultID] [bigint] NOT NULL,
[bwd_DateModified] [datetime] NULL,
[bwd_DateModifiedTrunc] [date] NULL,
[bwd_LineMaxPayout] [decimal](18, 4) NULL,
CONSTRAINT [CSTR__BGT_BETWAYDETAILS_2020_CKEY]
PRIMARY KEY CLUSTERED ([bwd_BetTicketNr] ASC, [bwd_LineID] ASC, [bwd_ResultID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
I have added an non-clustered index on
CREATE NONCLUSTERED INDEX [NCI__DATEMODIFIED]
ON [dbo].[BGT_BETWAYDETAILS_2020] ([bwd_DateModifiedTrunc] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
I am running the following 3 queries:
SELECT COALESCE(MAX([bwd_DateModifiedTrunc]), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS_2020]
SELECT COALESCE(MAX([bwd_DateModifiedTrunc]), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS]
SELECT COALESCE(CAST(MAX([bwd_DateModified]) AS date), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS]
The first one, when run on each yearly table, runs instantly.
The second one, seems to take forever. The query plan for this, seems very strange.
The plan shows two index scans on each yearly table.
The plan for each yearly table is what I expected to see:
Finally, the plan on the non-indexed date column is also what I expected to see (a clustered index scan). A clustered index scan on each table. This query runs in ~3mins which is expected.
What is the issue here? Some anti-pattern I am missing? Why the index scan on the non-clustered index is done 2 times according to the live plan? I expected the view to respond as fast as the individual tables.
For the record, I am running this on SQL Server 2017.
This just looks like an optimiser limitation. I have submitted a suggestion that this should be improved.
A simpler example is
CREATE TABLE T1(X INT NULL UNIQUE CLUSTERED);
CREATE TABLE T2(X INT NULL UNIQUE CLUSTERED);
INSERT INTO T1
OUTPUT INSERTED.X INTO T2
SELECT TOP 100000 NULLIF(ROW_NUMBER() OVER (ORDER BY 1/0),1)
FROM sys.all_objects o1,
sys.all_objects o2;
And then
WITH CTE AS
(
SELECT X FROM T1
UNION ALL
SELECT X FROM T2
)
SELECT MAX(X)
FROM CTE
OPTION (QUERYRULEOFF ScalarGbAggToTop)
This disables the query optimizer rule ScalarGbAggToTop and the query plan does a MAX on each individual table then computes a MAX of the MAX-es - so the same as
SELECT MAX(MaxX)
FROM
(
SELECT MAX(X) AS MaxX FROM T1
UNION ALL
SELECT MAX(X) AS MaxX FROM T1
) T
With the ScalarGbAggToTop rule enabled the plan now looks like this
It is effectively doing the following...
SELECT MAX(MaxX)
FROM (SELECT MAX(X) AS MaxX
FROM (SELECT TOP 1 X
FROM T1
WHERE X IS NULL
UNION ALL
SELECT TOP 1 X
FROM T1
WHERE X IS NOT NULL
ORDER BY X DESC) T1
UNION ALL
SELECT MAX(X) AS MaxX
FROM (SELECT TOP 1 X
FROM T2
WHERE X IS NULL
UNION ALL
SELECT TOP 1 X
FROM T2
WHERE X IS NOT NULL
ORDER BY X DESC) T2) T0
... but in a very inefficient way. Running the SQL above would give a plan with seeks and each branch only reading a single row.
The plan produced by ScalarGbAggToTop only has minimal changes to the stream aggregate plan. It looks like it takes the scan from that and applies a backwards ordering to it and then uses the backwards ordering for both the NOT NULL and NULL branches. And does not perform any additional exploration to see if there is a more efficient access path.
This means that in the pathological case that all of the rows are either NULL or NOT NULL one of the scans will end up reading all of the rows in the table (5 billion in your case if applicable to all 4 tables). Even if there is a mix of NULL and NOT NULL the fact that the IS NULL branch is doing a backwards scan is sub optimal because NULL is ordered first in SQL Server so would be at the beginning of the index.
The addition of a NOT NULL branch in the first place seems largely unnecessary as the query would return the same results without it. I imagine it is only needed so that it knows whether or not to display the message
Warning: Null value is eliminated by an aggregate or other SET
operation.
but I doubt you care about that. In which case adding an explicit WHERE ... NOT NULL resolves the issue.
WITH CTE AS
(
SELECT X FROM T1
UNION ALL
SELECT X FROM T2
)
SELECT MAX(X)
FROM CTE
WHERE X IS NOT NULL
;
It now has a seek into the NOT NULL part of the index and reads backwards (stopping after the first row is read from each table)
My INSERT statement fails while it is trying to add a new record into an empty table (Attribute) (no record yet).
I am surprised by the error raised by the system:
Violation of UNIQUE KEY constraint 'CK_Attribute_Name_IDproject'. Cannot insert duplicate key in object 'dbo.Attribute'. The duplicate key value is (dummy, 55).
The creation script for this table looks like
CREATE TABLE [dbo].[Attribute](
[ID] [int] IDENTITY(1,1) NOT NULL,
[IDproject] [int] NOT NULL,
[IDtype] [int] NOT NULL,
[IDgroup] [int] NOT NULL,
[name] [varchar](50) NOT NULL,
[color] [int] NULL,
[protected] [tinyint] NULL,
[datemodified] [datetime] NOT NULL,
PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [CK_Attribute_Name_IDproject] UNIQUE NONCLUSTERED
(
[name] ASC,
[IDproject] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
I skiped foreign keys references and default values which does not seem of interest in this context.
The UNIQUE constraint applies to [name] and [IDproject].
When running the following statement
SELECT *
FROM [dbo].[Attribute]
GO
SELECT *
FROM [dbo].[Project]
GO
I get the results
(0 row(s) affected)
(2 row(s) affected)
The first result indicats the Attribute Table is empty
The second that there are 2 Projects
then running the following INSERT in table Attribute it failed with the above mentioned UNIQUE CONSTRAINT error
INSERT INTO [dbo].[Attribute] ([IDproject], [name], [IDtype], [IDgroup], [color], [protected], [datemodified])
SELECT DISTINCT
p.[ID],'dummy',t.[ID],g.[ID],-1,0,getdate()
FROM [dbo].[Project] p
INNER JOIN [dbo].[Group] g ON g.[name]='none' AND g.[IDproject] = p.[ID]
INNER JOIN [dbo].[AttributeType] t ON t.[format]='text' AND g.[IDproject] = p.[ID]
WHERE p.[name]='TESTPROJ'
GO
How can i get such an error on an empty table ?
I have found the solution myself: the derived SELECT returns 2 records with 'dummy' due to a duplicate INTO one of table, AttributeType, with which INNER JOIN is performed.
I'm looking to retrieve values from a sub-select query on a rowset where I only want values from the current row of the main query (SQL Server versions 2005-2012 have shown this same behavior). I've written a sub-select query which is returning multiple rows (HOW, I'm matching on the primary KEY?!)
The following example code illustrates what I'm trying to accomplish:
CREATE TABLE [dbo].[TestTable]
(
[TestID] [int] IDENTITY(1,1) NOT NULL,
[TestValue] [nvarchar](255) NULL,
[TestValue2] [nvarchar](255) NULL,
[TestValue3] [nvarchar](255) NULL,
[TestValue4] [nvarchar](255) NULL,
PRIMARY KEY CLUSTERED ([TestID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO [TESTDB].[dbo].[TestTable] ([TestValue], [TestValue2], [TestValue3],[TestValue4])
VALUES('01234', '56789', '98765', '43210')
GO
INSERT INTO [TESTDB].[dbo].[TestTable] ([TestValue], [TestValue2], [TestValue3],[TestValue4])
VALUES('01234', '98765', '56789', '43210')
GO
INSERT INTO [TESTDB].[dbo].[TestTable] ([TestValue], [TestValue2], [TestValue3],[TestValue4])
VALUES('01234', '43210', '56789' ,'98765')
GO
INSERT INTO [TESTDB].[dbo].[TestTable] ([TestValue], [TestValue2], [TestValue3],[TestValue4])
VALUES('01234', '98765', '43210', '56789')
GO
SELECT TOP 1000
[TestID]
,[TestValue]
,[TestValue2]
,[TestValue3]
,(SELECT TestValue + TestValue2 AS CompositeValue
FROM [TESTDB].[dbo].TestTable AS foo
WHERE foo.TestID = TestID)
FROM
[TESTDB].[dbo].[TestTable]
Error being returned is:
Msg 512, Level 16, State 1, Line 2
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
If you can offer an alternate way of performing this query - i.e. using ROW_NUMBER or some other method (without performing a select into a temporary table, and without declaring individual variables).
Thanks in advance!
Try to alias the base table
SELECT TOP 1000 [TestID]
,[TestValue]
,[TestValue2]
,[TestValue3]
,(SELECT foo.TestValue + foo.TestValue2 FROM [TESTDB].[dbo].TestTable AS foo WHERE foo.TestID=B.TestID) AS CompositeValue
FROM [TESTDB].[dbo].[TestTable] AS B
SCENARIO
I need to select records from test_userData based on a 1-to-1 match from test_userCheck on the columns customer or account_info. The code below will create a mock-up of the tables and will populate with random data for the purpose of my question. Based on this code, it's looking for any records where test_userData.customer = 'Guerrero, Unity' or test_userData.account_info = 'XXXXXXXXXXXXXXXX0821', and should return three rows (confirmation_id = 6836985, 5502798, and 3046441)
PROBLEM
As it stands, the query returns what I need... however, my real userData table has almost 2 million records, and the userCheck table has about 10,000. The query takes about 7 seconds as it is and I think that's way too long. I'm also worried because the userData table will start to grow quickly (by tens of thousands of unique records a day), and I envision my current method becoming unmanageable.
QUESTION
Any ideas on how I can optimize this to scale with millions of records? The data resides on a shared SQL 2008 server with limited permissions.
--setup temporary testing tables
IF EXISTS
(
SELECT * FROM dbo.sysobjects
WHERE id = object_id(N'[dbo].[test_userData]')
AND OBJECTPROPERTY(id, N'IsUserTable') = 1
)
DROP TABLE [dbo].[test_userData]
GO
IF EXISTS
(
SELECT * FROM dbo.sysobjects
WHERE id = object_id(N'[dbo].[test_userCheck]')
AND OBJECTPROPERTY(id, N'IsUserTable') = 1
)
DROP TABLE [dbo].[test_userCheck]
GO
CREATE TABLE [dbo].[test_userData](
[id] [int] IDENTITY(1,1) NOT NULL,
[merchant_id] [int] NOT NULL,
[sales_date] [datetime] NOT NULL,
[confirmation_id] [int] NOT NULL,
[customer] [nvarchar](max) NOT NULL,
[total] [smallmoney] NOT NULL,
[account_info] [nvarchar](max) NOT NULL,
[email_address] [nvarchar](max) NOT NULL
CONSTRAINT [PK_test_userData] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[test_userCheck](
[confirmation_id] [int] NOT NULL,
[customer] [nvarchar](max) NOT NULL,
[total] [smallmoney] NOT NULL,
[account_info] [nvarchar](max) NOT NULL
CONSTRAINT [PK_test_userCheck] PRIMARY KEY CLUSTERED
(
[confirmation_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
--insert some random user transactions
INSERT INTO [dbo].[test_userData] (merchant_id,sales_date,confirmation_id,customer,total,account_info,email_address) VALUES
('99','03/25/2010','3361424','Soto, Ahmed','936','XXXXXXXXXXXXXXXX8744','Donec.egestas#NullainterdumCurabitur.ca'),
('17','09/12/2010','6710165','Holcomb, Eden','1022','XXXXXXXXXXXXXXXX6367','Curabitur#dolortempus.org'),
('32','05/04/2010','4489509','Foster, Nasim','1463','XXXXXXXXXXXXXXXX7115','augue.eu.tellus#ullamcorperviverraMaecenas.ca'),
('95','01/02/2011','5384061','Browning, Owen','523','XXXXXXXXXXXXXXXX0576','sed.dictum.eleifend#accumsaninterdum.edu'),
('91','08/21/2010','6075234','Dawson, McKenzie','141','XXXXXXXXXXXXXXXX3580','dolor.sit.amet#etmagnis.org'),
('63','01/29/2010','1055619','Mathews, Keefe','1110','XXXXXXXXXXXXXXXX2682','ligula#Sednuncest.edu'),
('27','10/20/2010','1819662','Clarke, Briar','1474','XXXXXXXXXXXXXXXX7481','Donec.non.justo#malesuada.org'),
('82','03/05/2010','3184936','Holman, Dana','560','XXXXXXXXXXXXXXXX7080','Aenean.eget.magna#accumsan.edu'),
('24','06/11/2010','1007427','Kirk, Desiree','206','XXXXXXXXXXXXXXXX3681','parturient#at.com'),
('49','06/17/2010','6137066','Foley, Sopoline','1831','XXXXXXXXXXXXXXXX1718','ac.urna.Ut#pellentesqueafacilisis.org'),
('22','05/08/2010','3545367','Howell, Uriel','638','XXXXXXXXXXXXXXXX1945','ad.litora#arcuvelquam.ca'),
('5','10/25/2010','6836985','Little, Caryn','743','XXXXXXXXXXXXXXXX0821','Suspendisse.aliquet#auctor.org'),
('91','06/16/2010','6852582','Buckner, Chiquita','99','XXXXXXXXXXXXXXXX1533','tellus.sem#semvitaealiquam.edu'),
('63','06/12/2010','7930230','Nolan, Wyoming','1192','XXXXXXXXXXXXXXXX1291','Sed#diam.org'),
('32','02/01/2010','8407102','Cummings, Deacon','1315','XXXXXXXXXXXXXXXX4375','a.odio.semper#massaSuspendisseeleifend.ca'),
('75','06/29/2010','5502798','Guerrero, Unity','858','XXXXXXXXXXXXXXXX8000','eget#lectus.edu'),
('50','09/13/2010','8312525','Russo, Yvette','1680','XXXXXXXXXXXXXXXX2046','In.mi#eu.com'),
('11','04/13/2010','6204132','Small, Calista','426','XXXXXXXXXXXXXXXX0269','lacus#Cumsociisnatoque.org'),
('16','01/01/2011','7522507','Mosley, Thor','1459','XXXXXXXXXXXXXXXX8451','netus.et#Pellentesqueutipsum.com'),
('5','01/27/2010','1472120','Case, Kiona','1419','XXXXXXXXXXXXXXXX7097','Duis#duilectusrutrum.edu'),
('70','02/17/2010','1095935','Snyder, Tanner','1655','XXXXXXXXXXXXXXXX8556','metus.sit.amet#inconsequatenim.edu'),
('63','11/10/2010','3046441','Guerrero, Unity','629','XXXXXXXXXXXXXXXX0807','nonummy.ac.feugiat#Phasellusdapibus.org'),
('22','08/19/2010','5435100','Turner, Patrick','1133','XXXXXXXXXXXXXXXX6734','pede#Duis.edu'),
('96','10/05/2010','6381992','May, Dominic','1858','XXXXXXXXXXXXXXXX7227','hymenaeos#etcommodo.edu'),
('96','02/26/2010','8630748','Chandler, Olympia','1016','XXXXXXXXXXXXXXXX4001','sed.dui.Fusce#pellentesqueSed.com');
--insert a random fraud transaction to check against (based on customer and account_info only)
INSERT INTO [dbo].[test_userCheck] (confirmation_id, customer, total, account_info) VALUES
('2055015', 'Guerrero, Unity', '20.02', 'XXXXXXXXXXXXXXXX0821');
--get result, which is correct
SELECT a.confirmation_id, a.customer, a.total, a.account_info, a.email_address
FROM dbo.test_userData AS a RIGHT OUTER JOIN
dbo.test_userCheck AS b ON a.customer = b.customer OR a.account_info = b.account_info;
DROP TABLE [dbo].[test_userData];
DROP TABLE [dbo].[test_userCheck];
Create the appropriate index or indices. Just based on your question, I'd suggest two indices, one on test_userData.customer, and a second index on test_userData.account_info
Creating indexes would probably help, but have you considered another design that complies with normal forms. It would be better if you access the date through index on a integer column instead of string...