Search index in SQL Server ignoring special characters - sql-server

I have an [nvarchar] column in a SQL Server table containing data like 123456789, 123-456789, 1234.56.789, 1.23456-789 and so on. The users just add dots, minus and spaces somewhere for readability and I don't know where.
Is there a way to create an index which ignores Special characters and find these when searching for plain "123456789"?

No there is no way to do exactly what you want in the way that you want.
The best mechanism for doing this is to use a computed column. It does not need to be persisted to be indexed.
Initial Position
CREATE TABLE YourTable
(
YourColumn NVARCHAR(50)
);
INSERT INTO YourTable
VALUES ('123456789'),
('123-456789'),
('1234.56.789'),
('1.23456-789');
Create computed column and index it.
ALTER TABLE YourTable
ADD CanonicalForm AS
CAST(REPLACE(REPLACE(REPLACE(YourColumn, '.', ''), '-', ''), ' ', '') AS NVARCHAR(50));
CREATE INDEX ix
ON YourTable(CanonicalForm)
INCLUDE (YourColumn);
Test it
SELECT *
FROM YourTable
WHERE CanonicalForm = '123456789'
Execution plan seeks on the index

Related

Can I do a bulk insert into a table from Microsoft SQL Server Management Studio with a copy and paste list?

I often get a list of names I need to update in a table from an Excel list, and I end up creating a SSIS program to reads the file into a staging table and doing it that way. But is there I way I could just copy and past the names into a table from Management Studio directly? Something like this:
create table #temp (personID int, userName varchar(15))
Insert
Into #temp (userName)
values (
'kmcenti1',
'ladams5',
'madams3',
'haguir1',
)
Obviously this doesn't work but I've tried different variations and nothing seems to work.
Here's an option with less string manipulation. Just paste your values between the single quotes
Declare #List varchar(max) = '
kmcenti1
ladams5
madams3
haguir1
'
Insert into #Temp (userName)
Select username=value
From string_split(replace(#List,char(10),''),char(13))
Where Value <>''
For Multiple Columns
Source:
-- This is a copy/paste from Excel --
-- This includes Headers which is optional --
-- There is a TAB between cells --
Declare #List nvarchar(max) = '
Name Age email
kmcenti1 25 kmcenti1#gmail.com
ladams5 32 ladams5#gmail.com
madams3 18 madams3#gmail.com
haguir1 36 haguir1#gmail.com
'
Select Pos1 = JSON_VALUE(JS,'$[0]')
,Pos2 = JSON_VALUE(JS,'$[1]') -- could try_convert(int)
,Pos3 = JSON_VALUE(JS,'$[2]')
From string_split(replace(replace(#List,char(10),''),char(9),'||'),char(13)) A
Cross Apply (values ('["'+replace(string_escape(Value,'json'),'||','","')+'"]') ) B(JS)
Where Value <>''
and nullif(JSON_VALUE(JS,'$[0]'),'')<>'Name'
Results
Is this along the lines you're looking for?
create table #temp (personID int identity(1,1), userName varchar(15))
insert into #temp (userName)
select n from (values
('kmcenti1'),
('ladams5'),
('madams3'),
('haguir1'))x(n);
This assumes you want the ID generated for you since it's not in your data.
That SQL statement you have won't work (That's one row). But I have a work around. Build what you need with a formula in Excel.
Assuming user IDs are in column A:
In Cell B2, insert this formula:
="('"&A1&"'),"
And then drag the formula down you list.
Go to SSMS and type in:
insert into [your table](userName) values
And then paste in column B from Excel and delete the last comma.

Perform exact search honouring spaces

I'm using SQL Server 2017 and my collation is SQL_LATIN1_GENERAL_CP1_CI_AS and ANSI_PADDING is default value (ON).
In my table, one of the columns is of type NVARCHAR(255) and one of the values is inserted like this (including space):
N'abc '
And when I search it without space (N'abc'), I don't want to get N'abc ', but it finds it.
I know I can remove spaces during inserting record, but can't change already inserted records.
How can I prevent to find it with querying like this?
CREATE TABLE #tmp (c1 nvarchar(255))
INSERT INTO #tmp
VALUES (N'abc ')
SELECT *
FROM #tmp
WHERE c1 = N'abc'
DROP TABLE #tmp
I also found this article but want to prevent while when I querying it.
Why the SQL Server ignore the empty space at the end automatically?
I'm using Linq-to-entities with C#, and with SQL query, I can search with 'LIKE' keyword without percent character
SELECT *
FROM #tmp
WHERE c1 LIKE N'abc'
But with Linq, I don't know how to write this query:
entity.Temp.Where(p => p.c1 == "abc");
entity.Temp.Where(p => p.c1.Equals("abc"));
entity.Temp.Where(p => p.c1.Contains("abc"));
You can try:
SELECT * FROM #tmp WHERE cast(c1 as varbinary(510)) = cast(N'abc' as varbinary(510))
This would be very slow if you have a lot of rows, but it works.

Splitting multiple fields by delimiter

I have to write an SP that can perform Partial Updates on our databases, the changes are stored in a record of the PU table. A values fields contains all values, delimited by a fixed delimiter. A tables field refers to a Schemes table containing the column names for each table in a similar fashion in a Colums fiels.
Now for my SP I need to split the Values field and Columns field in a temp table with Column/Value pairs, this happens for each record in the PU table.
An example:
Our PU table looks something like this:
CREATE TABLE [dbo].[PU](
[Table] [nvarchar](50) NOT NULL,
[Values] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','John Doe;26');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Jane Doe;22');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mike Johnson;20');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mary Jane;24');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Mathematics');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','English');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Geography');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus A;Schools Road 1;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus B;Schools Road 31;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus C;Schools Road 22;Educationville');
And we have a Schemes table similar to this:
CREATE TABLE [dbo].[Schemes](
[Table] [nvarchar](50) NOT NULL,
[Columns] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Person','[Name];[Age]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Course','[Name]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Campus','[Name];[Address];[City]');
As a result the first record of the PU table should result in a temp table like:
The 5th will have:
Finally, the 8th PU record should result in:
You get the idea.
I tried use the following query to create the temp tables, but alas it fails when there's more that one value in the PU record:
DECLARE #Fields TABLE
(
[Column] INT,
[Value] VARCHAR(MAX)
)
INSERT INTO #Fields
SELECT TOP 1
(SELECT Value FROM STRING_SPLIT([dbo].[Schemes].[Columns], ';')),
(SELECT Value FROM STRING_SPLIT([dbo].[PU].[Values], ';'))
FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
TOP 1 correctly gets the first PU record as each PU record is removed once processed.
The error is:
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
In the case of a Person record, the splits are indeed returning 2 values/colums at a time, I just want to store the values in 2 records instead of getting an error.
Any help on rewriting the above query?
Also do note that the data is just generic nonsense. Being able to have 2 fields that both have delimited values, always equal in amount (e.g. a 'person' in the PU table will always have 2 delimited values in the field), and break them up in several column/header rows is the point of the question.
UPDATE: Working implementation
Based on the (accepted) answer of Sean Lange, I was able to work out followin implementation to overcome the issue:
As I need to reuse it, the combine column/value functionality is performed by a new function, declared as such:
CREATE FUNCTION [dbo].[JoinDelimitedColumnValue]
(#splitValues VARCHAR(8000), #splitColumns VARCHAR(8000),#pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH MyValues AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnValue = x.Item
FROM dbo.DelimitedSplit8K(#splitValues, #pDelimiter) x
)
, ColumnData AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnName = x.Item
FROM dbo.DelimitedSplit8K(#splitColumns, #pDelimiter) x
)
SELECT cd.ColumnName,
v.ColumnValue
FROM MyValues v
JOIN ColumnData cd ON cd.ColumnPosition = v.ColumnPosition
;
In case of the above sample data, I'd call this function with the following SQL:
DECLARE #FieldValues VARCHAR(8000), #FieldColumns VARCHAR(8000)
SELECT TOP 1 #FieldValues=[dbo].[PU].[Values], #FieldColumns=[dbo].[Schemes].[Columns] FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
INSERT INTO #Fields
SELECT [Column] = x.[ColumnName],[Value] = x.[ColumnValue] FROM [dbo].[JoinDelimitedColumnValue](#FieldValues, #FieldColumns, #Delimiter) x
This data structure makes this way more complicated than it should be. You can leverage the splitter from Jeff Moden here. http://www.sqlservercentral.com/articles/Tally+Table/72993/ The main difference of that splitter and all the others is that his returns the ordinal position of each element. Why all the other splitters don't do this is beyond me. For things like this it is needed. You have two sets of delimited data and you must ensure that they are both reassembled in the correct order.
The biggest issue I see is that you don't have anything in your main table to function as an anchor for ordering the results correctly. You need something, even an identity to ensure the output rows stay "together". To accomplish I just added an identity to the PU table.
alter table PU add RowOrder int identity not null
Now that we have an anchor this is still a little cumbersome for what should be a simple query but it is achievable.
Something like this will now work.
with MyValues as
(
select p.[Table]
, ColumnPosition = x.ItemNumber
, ColumnValue = x.Item
, RowOrder
from PU p
cross apply dbo.DelimitedSplit8K(p.[Values], ';') x
)
, ColumnData as
(
select ColumnName = replace(replace(x.Item, ']', ''), '[', '')
, ColumnPosition = x.ItemNumber
, s.[Table]
from Schemes s
cross apply dbo.DelimitedSplit8K(s.Columns, ';') x
)
select cd.[Table]
, v.ColumnValue
, cd.ColumnName
from MyValues v
join ColumnData cd on cd.[Table] = v.[Table]
and cd.ColumnPosition = v.ColumnPosition
order by v.RowOrder
, v.ColumnPosition
I recommended not storing values like this in the first place. I recommend having a key value in the tables and preferably not using Table and Columns as a composite key. I recommend to avoid using reserved words. I also don't know what version of SQL you are using. I am going to assume you are using a fairly recent version of Microsoft SQL Server that will support my provided stored procedure.
Here is an overview of the solution:
1) You need to convert both the PU and the Schema table into a table where you will have each "column" value in the list of columns isolated in their own row. If you can store the data in this format rather than the provided format, you will be a little better off.
What I mean is
Table|Columns
Person|Jane Doe;22
needs converted to
Table|Column|OrderInList
Person|Jane Doe|1
Person|22|2
There are multiple ways to do this, but I prefer an xml trick that I picked up. You can find multiple split string examples online so I will not focus on that. Use whatever gives you the best performance. Unfortunately, You might not be able to get away from this table-valued function.
Update:
Thanks to Shnugo's performance enhancement comment, I have updated my xml splitter to give you the row number which reduces some of my code. I do the exact same thing to the Schema list.
2) Since the new Schema table and the new PU table now have the order each column appears, the PU table and the schema table can be joined on the "Table" and the OrderInList
CREATE FUNCTION [dbo].[fnSplitStrings_XML]
(
#List NVARCHAR(MAX),
#Delimiter VARCHAR(255)
)
RETURNS TABLE
AS
RETURN
(
SELECT y.i.value('(./text())[1]', 'nvarchar(4000)') AS Item,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as RowNumber
FROM
(
SELECT CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>')
+ '</i>').query('.') AS x
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
CREATE Procedure uspGetColumnValues
as
Begin
--Split each value in PU
select p.[Table],p.[Values],a.[Item],CHARINDEX(a.Item,p.[Values]) as LocationInStringForSorting,a.RowNumber
into #PuWithOrder
from PU p
cross apply [fnSplitStrings_XML](p.[Values],';') a --use whatever string split function is working best for you (performance wise)
--Split each value in Schema
select s.[Table],s.[Columns],a.[Item],CHARINDEX(a.Item,s.[Columns]) as LocationInStringForSorting,a.RowNumber
into #SchemaWithOrder
from Schemes s
cross apply [fnSplitStrings_XML](s.[Columns],';') a --use whatever string split function is working best for you (performance wise)
DECLARE #Fields TABLE --If this is an ETL process, maybe make this a permanent table with an auto incrementing Id and reference this table in all steps after this.
(
[Table] NVARCHAR(50),
[Columns] NVARCHAR(MAX),
[Column] VARCHAR(MAX),
[Value] VARCHAR(MAX),
OrderInList int
)
INSERT INTO #Fields([Table],[Columns],[Column],[Value],OrderInList)
Select pu.[Table],pu.[Values] as [Columns],s.Item as [Column],pu.Item as [Value],pu.RowNumber
from #PuWithOrder pu
join #SchemaWithOrder s on pu.[Table]=s.[Table] and pu.RowNumber=s.RowNumber
Select [Table],[Columns],[Column],[Value],OrderInList
from #Fields
order by [Table],[Columns],OrderInList
END
GO
EXEC uspGetColumnValues
GO
Update:
Since your working implementation is a table-valued function, I have another recommendation. The problem I see is that your using a table valued function which ultimately handles one record at a time. You are going to have better performance with set based operations and batching as needed. With a tabled valued function, you are likely going to be looping through each row. If this is some sort of ETL process, your team will be better off if you have a stored procedure that processes the rows in bulk. It might make sense to stage the results into a better table that your team can work with down stream rather than have them use a potentially slow table-valued function.

Improve performance of query with conditional filtering

let's say i have a table with 3 million rows, the table does not have a PK nor Indexes.
the query is as follows
SELECT SKU, Store, ColumnA, ColumnB, ColumnC
FROM myTable
WHERE (SKU IN (select * from splitString(#skus)) OR #skus IS NULL)
AND (Store IN (select * from splitString(#stores)) OR #stores IS NULL)
Please consider that #sku and #store are NVARCHAR(MAX) containing a list of ids separated by comma.
SplitString is a function which converts a string in format '1,2,3' to a table of 1 column and 3 rows as shown in the following picture.
This pattern allows me to send arguments from the application and filter by sku or by store or both or none.
What can I do to improve performance of this query? - I know Indexes are a good idea, but I don't really know about that stuff, so a guidance to that will be helpful.
Any other ideas?
This type of generic search query tends to be rough on performance.
In addition to the suggestion to use temp tables to store the results of the string parsing, there are a couple other things you could do:
Add indexes
It's usually recommended that each table have a clustered index (although it seems there is still room for debate): Will adding a clustered index to an existing table improve performance?
In addition to that, you will probably also want to add indexes on the fields that you're searching on.
In this case, that might be something like:
SKU (for searches on SKU alone)
Store, SKU (for searches on Store and the combination of both Store and SKU)
Keep in mind that if the query matches too many records, these indexes might not be used.
Also keep in mind that making the indexes cover the query can improve performance:
Why use the INCLUDE clause when creating an index?
Here is a link to Microsoft's documentation on creating indexes:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql
Use dynamic SQL to build the query
I need to preface this with a warning. Please be aware of SQL injection, and make sure to code appropriately!
How to cleanse dynamic SQL in SQL Server -- prevent SQL injection
Building a dynamic SQL query allows you to write more streamlined and direct SQL, and thus allows the optimizer to do a better job. This is normally something to be avoided, but I believe it fits this particular situation.
Here is an example (should be adjusted to take SQL injection into account as needed):
DECLARE #sql VARCHAR(MAX) = '
SELECT SKU, Store, ColumnA
FROM myTable
WHERE 1 = 1
';
IF #skus IS NOT NULL BEGIN
SET #sql += ' AND SKU IN (' + #skus + ')';
END
IF #stores IS NOT NULL BEGIN
SET #sql += ' AND Store IN (' + #stores + ')';
END
EXEC sp_executesql #sql;
Another thing to avoid is using functions in your Where clause. That will slow a query down.
Try putting this at the beginning of your script, before the first SELECT:
SELECT skus_group INTO #skus_group
FROM (SELECT item AS skus_group FROM
splitstring(#skus, ','))A;
Then replace your WHERE clause:
WHERE SKU IN(Select skus_group FROM #skus_group)
This normally improves performance because it takes advantage of indexes instead of a table scan, but since you're not using any indexes I'm not sure how much performance gain you'll get.
This will work faster i believe:
SELECT SKU, Store, ColumnA, ColumnB, ColumnC FROM myTable WHERE #skus IS NULL AND #stores IS NULL
UNION ALL
SELECT SKU, Store, ColumnA, ColumnB, ColumnC
FROM myTable
INNER JOIN (select colname AS myskus from splitString(#skus))skuses ON skuses.myskus = myTable.SKU
INNER JOIN (select colname AS mystore from splitString(#stores))stores ON stores.mystore = myTable.Store

SQL Server index - ideas?

I have this query :
SELECT
c.violatorname
FROM
dbo.crimecases AS c,
dbo.people AS p
WHERE
REPLACE(c.violatorname, ' ', '') = CONCAT(CONCAT(CONCAT(p.firstname, p.secondname), p.thirdname), p.lastname);
The query is very slow, I need to create an index on violatorname column with replace function. Any ideas?
I would suggest you to add computed columns and create index on it.
ALTER TABLE crimecases
ADD violatornameProcessed AS Replace(violatorname, ' ', '') PERSISTED
ALTER TABLE people
ADD fullName AS Concat(firstname, secondname, thirdname, lastname) PERSISTED
Persisted will store the computed data on the disk instead of computing every time. Now create index on it.
CREATE INDEX Nix_crimecases_violatornameProcessed
ON crimecases (violatornameProcessed)
include (violatorname)
CREATE INDEX Nix_people_fullName
ON people (fullName)
Query can be written like
SELECT c.violatorname
FROM dbo.crimecases AS c
INNER JOIN dbo.people AS p
ON c.violatornameProcessed = p.fullName

Resources