Large scale word substitutions in SQL - sql-server

I'm attempting to perform data cleansing on a field in a large database. I have a reference table that contains words with their replacements, macros if you like. I'd like to apply those changes to a table that contains millions of rows, in the most efficient manner possible. With that said, let me provide some dummy data below so you can visualize the process:
Street_Addresses Table:
Street_Name | Expanded_Name
------------------+--------------
100 Main St Ste 5 | NULL
25 10th Ave Apt 2 | NULL
75 Bridge Rd | NULL
Word_Substitutions Table:
Word | Replacement
-----+------------
St | Street
Ave | Avenue
Rd | Road
Ste | Suite
Apt | Apartment
So the end result would be the following after updates:
Street_Name | Expanded_Name
------------------+--------------
100 Main St Ste 5 | 100 Main Street Suite 5
25 10th Ave Apt 2 | 25 10th Avenue Apartment 2
75 Bridge Rd | 75 Bridge Road
The challenge here is the sheer number of substitutions that need to take place, indeed multiple replacements on a single value. The intial thought that sprang to mind was to use a scalar function to encapsulate this logic. But as you can imagine, this is not performant over millions of rows.
CREATE FUNCTION Substitute_Words (#Text varchar(MAX))
RETURNS varchar(MAX) AS
BEGIN
SELECT #Text = REPLACE(' ' + #Text + ' ', ' ' + Word + ' ',
' ' + Replacement + ' ') FROM Word_Substitutions
RETURN LTRIM(RTRIM(#Text))
END
I decided to look at a set based operation instead and came up with the following:
WHILE (1 = 1)
BEGIN
UPDATE A SET Expanded_Name = LTRIM(RTRIM(REPLACE(
' ' + ISNULL(A.Expanded_Name, A.Street_Name) + ' ',
' ' + W.Word + ' ', ' ' + W.Replacement + ' ')))
FROM Street_Addresses AS A
CROSS APPLY (SELECT TOP 1 Word, Replacement
FROM Word_Substitutions WHERE CHARINDEX(' ' + Word + ' ',
' ' + ISNULL(A.Expanded_Name, A.Street_Name) + ' ') > 0) AS W
IF (##ROWCOUNT = 0)
BREAK
END
Right now, this takes about 2 hours based on my actual dataset and I would like to reduce that if possible - does anyone have suggestions for optimization?
UPDATE:
By just using an inner join instead, I was able to reduce the execution time to about 5 minutes. I had initially thought that using update with an inner join which returns multiple rows would not work. It appears that the update will still work, but the source row will get a single, not multiple updates. Apparently SQL Server chooses a random result row for the update, discarding the others.
WHILE (1 = 1)
BEGIN
UPDATE A SET Expanded_Name = LTRIM(RTRIM(REPLACE(
' ' + ISNULL(A.Expanded_Name, A.Street_Name) + ' ',
' ' + W.Word + ' ', ' ' + W.Replacement + ' ')))
FROM Street_Addresses AS A
INNER JOIN Word_Substitutions AS W ON CHARINDEX(' ' + W.Word + ' ',
' ' + ISNULL(A.Expanded_Name, A.Street_Name) + ' ') > 0
IF (##ROWCOUNT = 0)
BREAK
END

I think the best approach here is to have the modified data stored in your database. You can create a separate table with ID and the formatted address or you can rather add additional column in your current table.
Then, because you have already a lot of records, you should update them. Here, I thing you have to options, to create a internal function and use it for update current records (it might be slow, but once it ended you will have the data already in your table) or create CLR procedure and use the power of regular expressions.
Then for new inserted records, it will be very flexible to create AFTER INSERT TRIGGER that will call your SQL or CLR function and update the current inserted records.

You could always do something ridiculous and run this as dynamic SQL with all of the replacements inline:
declare #sql nvarchar(max)
set #sql = 'Street_Name'
select #sql = 'replace(' + #sql + ', '' ' + Word + ' '', '' ' + Replacement + ' '')'
from Word_Substitutions
set #sql = 'update Street_Addresses set Expanded_Name = ' + #sql
exec sp_executesql #sql
Yes, I totally expect a downvote or two, but this method can work well on occasion given how UDFs and recursive CTEs can sometimes be very slow on large datasets. And it's fun to post off-the-wall solutions from time to time.
Regardless, I would be curious to see how this would run, especially if combined with the suggestion of storing and trigger-based updating by #gotqn (which I agree with and have upvoted).
I'm currently running about 3 seconds with 275 replacement words and 100k addresses on a modest box.

Related

Delay in sql server query result with leading and trailing character removed

I have a query in SQL Server 2012 that takes 8 seconds to retrieve 1158 rows. In the query I have to join on two fields, I have to trim leading and trailing 0s while joining.
below Query : It takes 8 seconds to get 1158 rows
Select value1, value2
from TableA LEFT JOIN TableB
ON
SUBSTRING(REVERSE(SUBSTRING(REVERSE(TableA.[policy#]),PATINDEX('%[^' + '0' + ' ]%',REVERSE(TableA.[policy#])),100)),PATINDEX('%[^' + '0' + ' ]%',REVERSE(SUBSTRING(REVERSE(TableA.[policy#]),PATINDEX('%[^' + '0' + ' ]%',REVERSE(TableA.[policy#])),100))),100)
= SUBSTRING(REVERSE(SUBSTRING(REVERSE(TableB.[policy#]),PATINDEX('%[^' + '0' + ' ]%',REVERSE(TableB.[policy#])),100)),PATINDEX('%[^' + '0' + ' ]%',REVERSE(SUBSTRING(REVERSE(TableB.[policy#]),PATINDEX('%[^' + '0' + ' ]%',REVERSE(TableB.[policy#])),100))),100)
WHERE {some conditions}
Instead of writing the ugly code for removing leading and trailing 0s I have created two functions to do the same
ALTER FUNCTION [dbo].[L_TRIM](#String VARCHAR(MAX), #Char varchar(5))
RETURNS VARCHAR(MAX)
BEGIN
RETURN SUBSTRING(#String,PATINDEX('%[^' + #Char + ' ]%',#String),100)
END
ALTER FUNCTION [dbo].[R_TRIM](#String VARCHAR(MAX), #Char varchar(5))
RETURNS VARCHAR(MAX)
BEGIN
RETURN REVERSE(SUBSTRING(REVERSE(#String),PATINDEX('%[^' + #Char + ' ]%',REVERSE(#String)),100))
END
Below Query : It takes 1 minute 40 seconds to get 1158 rows
Select value1, value2
from TableA LEFT JOIN TableB
ON
dbo.L_trim(dbo.R_trim( TableA.[policy#], '0'),'0') = dbo.L_trim(dbo.R_trim(TableB.[policy#], '0'),'0')
WHERE {some conditions}
Can someone guide me how can I retrieve the rows in lesser time without writing ugly code ?

SQL Server Full Text Search to match contact name to prevent duplicates

Using SQL Server Azure or 2017 with Full Text Search, I need to return possible matches on names.
Here's the simple scenario: an administrator is entering contact information for a new employee, first name, last name, address, etc. I want to be able to search the Employee table for a possible match on the name(s) to see if this employee has already been entered in the database.
This might happen as an autosuggest type of feature, or simply display some similar results, like here in Stackoverflow, while the admin is entering the data.
I need to prevent duplicates!
If the admin enters "Bob", "Johnson", I want to be able to match on:
Bob Johnson
Rob Johnson
Robert Johnson
This will give the administrator the option of seeing if this person has already been entered into the database and choose one of those choices.
Is it possible to do this type of match on words like "Bob" and include "Robert" in the results? If so, what is necessary to accomplish this?
Try this.
You need to change the #per parameter value to your requirement. It indicates how many letters out of the length of the first name should match for the result to return. I just set it to 50% for testing purposes.
The dynamic SQL piece inside the loop adds all the CHARINDEX result per letter of the first name in question, to all existing first names.
Caveats:
Repeating letters will of course be misleading, like Bob will count 3 matches in Rob because there's 2 Bs in Bob.
I didn't consider 2 first names, like Bob Robert Johnson, etc so this will fail. You can improve on that however, but you get the idea.
The final SQL query gets the LetterMatch that is greater than or equal to the set value in #per.
DECLARE #name varchar(MAX) = 'Bobby Johnson' --sample name
DECLARE #first varchar(50) = SUBSTRING(#name, 0, CHARINDEX(' ', #name)) --get the first part of the name before space
DECLARE #last varchar(50) = SUBSTRING(#name, CHARINDEX(' ', #name) + 1, LEN(#name) - LEN(#first) - 1) --get the last part of the name after space
DECLARE #walker int = 1 --for looping
DECLARE #per float = LEN(#first) * 0.50 --declare percentage of how many letters out of the length of the first name should match. I just used 50% for testing
DECLARE #char char --for looping
DECLARE #sql varchar(MAX) --for dynamic SQL use
DECLARE #matcher varchar(MAX) = '' --for dynamic SQL use
WHILE #walker <> LEN(#first) + 1 BEGIN --loop through all the letters of the first name saved in #first variable
SET #char = SUBSTRING(#first, #walker, 1) --save the current letter in the iteration
SET #matcher = #matcher + IIF(#matcher = '', '', ' + ') + 'IIF(CHARINDEX(''' + #char + ''', FirstName) > 0, 1, 0)' --build the additional column to be added to the dynamic SQL
SET #walker = #walker + 1 --move the loop
END
SET #sql = 'SELECT * FROM (SELECT FirstName, LastName, ' + #matcher + ' AS LetterMatch
FROM TestName
WHERE LastName LIKE ' + '''%' + #last + '%''' + ') AS src
WHERE CAST(LetterMatch AS int) >= ROUND(' + CAST(#per AS varchar(50)) + ', 0)'
SELECT #sql
EXEC(#sql)
SELECT * FROM tbl_Names
WHERE Name LIKE '% user defined text %';
using a text in between % % will search those text on any position in the data.

Generate column name dynamically in sql server

Please look at the below query..
select name as [Employee Name] from table name.
I want to generate [Employee Name] dynamically based on other column value.
Here is the sample table
s_dt dt01 dt02 dt03
2015-10-26
I want dt01 value to display as column name 26 and dt02 column value will be 26+1=27
I'm not sure if I understood you correctly. If I'am going into the wrong direction, please add comments to your question to make it more precise.
If you really want to create columns per sql you could try a variation of this script:
DECLARE #name NVARCHAR(MAX) = 'somename'
DECLARE #sql NVARCHAR(MAX) = 'ALTER TABLE aps.tbl_Fabrikkalender ADD '+#name+' nvarchar(10) NULL'
EXEC sys.sp_executesql #sql;
To retrieve the column name from another query insert the following between the above declares and fill the placeholders as needed:
SELECT #name = <some colum> FROM <some table> WHERE <some condition>
You would need to dynamically build the SQL as a string then execute it. Something like this...
DECLARE #s_dt INT
DECLARE #query NVARCHAR(MAX)
SET #s_dt = (SELECT DATEPART(dd, s_dt) FROM TableName WHERE 1 = 1)
SET #query = 'SELECT s_dt'
+ ', NULL as dt' + RIGHT('0' + CAST(#s_dt as VARCHAR), 2)
+ ', NULL as dt' + RIGHT('0' + CAST((#s_dt + 1) as VARCHAR), 2)
+ ', NULL as dt' + RIGHT('0' + CAST((#s_dt + 2) as VARCHAR), 2)
+ ', NULL as dt' + RIGHT('0' + CAST((#s_dt + 3) as VARCHAR), 2)
+ ' FROM TableName WHERE 1 = 1)
EXECUTE(#query)
You will need to replace WHERE 1 = 1 in two places above to select your data, also change TableName to the name of your table and it currently puts NULL as the dynamic column data, you probably want something else there.
To explain what it is doing:
SET #s_dt is selecting the date value from your table and returning only the day part as an INT.
SET #query is dynamically building your SELECT statement based on the day part (#s_dt).
Each line is taking #s_dt, adding 0, 1, 2, 3 etc, casting as VARCHAR, adding '0' to the left (so that it is at least 2 chars in length) then taking the right two chars (the '0' and RIGHT operation just ensure anything under 10 have a leading '0').
It is possible to do this using dynamic SQL, however I would also consider looking at the pivot operators to see if they can achieve what you are after a lot more efficiently.
https://technet.microsoft.com/en-us/library/ms177410(v=sql.105).aspx

SQL Server Row size

We are creating this crosstab report.. Generating query at time in SQL Server 2008.
In one of selection when user make Program Name as a column it is giving below error:
Creating or altering table 'FakeWorkTable' failed because the minimum row size would be 11852, including 189 bytes of internal overhead. This exceeds the maximum allowable table row size of 8094 bytes.
Query shall return like:
Date Program 1 ... Program 100... Program 500
It will tell some information about TV program datewise.
Is there any way to increase this row size?
Please let me know in-case any other information is needed.
Best Regards
My Code as below:
set #query
= 'SELECT ' + #PivotRowColumn + ',' + #cols + ' from
(
SELECT ' + #SelectCalculationColumn + ', ' + #PivotRowColumn + ', ' + #PivotColumn + '
FROM ##ResultCrosstab
--where programName like ''S%''
) x
pivot
(
' + #CalculateColumn + '
for ' + #PivotColumn + ' in (' + #cols + ')
) p '
#PivotRowColumn/#PivotColumn can be anything out of (SalesHouse/Station/Day/Week/Month/Date/Product/Program/SpotLength/Timeband/Campaign.
#SelectCalculationColumn is a KPI e.g Spots/Budget/Impacts/Variance/TVR etc.
#cols are column
This issue comes very rarely for bigger campaigns. I have added a drop down in-case user select programs as column. So in-case user gets error they can limit the program or use Filter (which is already in place)

Replace duplicate spaces with a single space in T-SQL

I need to ensure that a given field does not have more than one space (I am not concerned about all white space, just space) between characters.
So
'single spaces only'
needs to be turned into
'single spaces only'
The below will not work
select replace('single spaces only',' ',' ')
as it would result in
'single spaces only'
I would really prefer to stick with native T-SQL rather than a CLR based solution.
Thoughts?
Even tidier:
select string = replace(replace(replace(' select single spaces',' ','<>'),'><',''),'<>',' ')
Output:
select single spaces
This would work:
declare #test varchar(100)
set #test = 'this is a test'
while charindex(' ',#test ) > 0
begin
set #test = replace(#test, ' ', ' ')
end
select #test
If you know there won't be more than a certain number of spaces in a row, you could just nest the replace:
replace(replace(replace(replace(myText,' ',' '),' ',' '),' ',' '),' ',' ')
4 replaces should fix up to 16 consecutive spaces (16, then 8, then 4, then 2, then 1)
If it could be significantly longer, then you'd have to do something like an in-line function:
CREATE FUNCTION strip_spaces(#str varchar(8000))
RETURNS varchar(8000) AS
BEGIN
WHILE CHARINDEX(' ', #str) > 0
SET #str = REPLACE(#str, ' ', ' ')
RETURN #str
END
Then just do
SELECT dbo.strip_spaces(myText) FROM myTable
This is somewhat brute force, but will work
CREATE FUNCTION stripDoubleSpaces(#prmSource varchar(max)) Returns varchar(max)
AS
BEGIN
WHILE (PATINDEX('% %', #prmSource)>0)
BEGIN
SET #prmSource = replace(#prmSource ,' ',' ')
END
RETURN #prmSource
END
GO
-- Unit test --
PRINT dbo.stripDoubleSpaces('single spaces only')
single spaces only
update mytable
set myfield = replace (myfield, ' ', ' ')
where charindex(' ', myfield) > 0
Replace will work on all the double spaces, no need to put in multiple replaces. This is the set-based solution.
It can be done recursively via the function:
CREATE FUNCTION dbo.RemSpaceFromStr(#str VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS
BEGIN
RETURN (CASE WHEN CHARINDEX(' ', #str) > 0 THEN
dbo.RemSpaceFromStr(REPLACE(#str, ' ', ' ')) ELSE #str END);
END
then, for example:
SELECT dbo.RemSpaceFromStr('some string with many spaces') AS NewStr
returns:
NewStr
some string with many spaces
Or the solution based on method described by #agdk26 or #Neil Knight (but safer)
both examples return output above:
SELECT REPLACE(REPLACE(REPLACE('some string with many spaces'
, ' ', ' ' + CHAR(7)), CHAR(7) + ' ', ''), ' ' + CHAR(7), ' ') AS NewStr
--but it remove CHAR(7) (Bell) from string if exists...
or
SELECT REPLACE(REPLACE(REPLACE('some string with many spaces'
, ' ', ' ' + CHAR(7) + CHAR(7)), CHAR(7) + CHAR(7) + ' ', ''), ' ' + CHAR(7) + CHAR(7), ' ') AS NewStr
--but it remove CHAR(7) + CHAR(7) from string
How it works:
Caution:
Char/string used to replace spaces shouldn't exist on begin or end of string and stand alone.
Here is a simple function I created for cleaning any spaces before or after, and multiple spaces within a string. It gracefully handles up to about 108 spaces in a single stretch and as many blocks as there are in the string. You can increase that by factors of 8 by adding additional lines with larger chunks of spaces if you need to. It seems to perform quickly and has not caused any problems in spite of it's generalized use in a large application.
CREATE FUNCTION [dbo].[fnReplaceMultipleSpaces] (#StrVal AS VARCHAR(4000))
RETURNS VARCHAR(4000)
AS
BEGIN
SET #StrVal = Ltrim(#StrVal)
SET #StrVal = Rtrim(#StrVal)
SET #StrVal = REPLACE(#StrVal, ' ', ' ') -- 16 spaces
SET #StrVal = REPLACE(#StrVal, ' ', ' ') -- 8 spaces
SET #StrVal = REPLACE(#StrVal, ' ', ' ') -- 4 spaces
SET #StrVal = REPLACE(#StrVal, ' ', ' ') -- 2 spaces
SET #StrVal = REPLACE(#StrVal, ' ', ' ') -- 2 spaces (for odd leftovers)
RETURN #StrVal
END
Method #1
The first method is to replace extra spaces between words with an uncommon symbol combination as a temporary marker. Then you can replace the temporary marker symbols using the replace function rather than a loop.
Here is a code example that replaces text within a String variable.
DECLARE #testString AS VARCHAR(256) = ' Test text with random* spacing. Please normalize this spacing!';
SELECT REPLACE(REPLACE(REPLACE(#testString, ' ', '*^'), '^*', ''), '*^', ' ');
Execution Time Test #1: In ten runs of this replacement method, the average wait time on server replies was 1.7 milliseconds and total execution time was 4.6 milliseconds.
Execution Time Test #2: The average wait time on server replies was 1.7 milliseconds and total execution time was 3.7 milliseconds.
Method #2
The second method is not quite as elegant as the first, but also gets the job done. This method works by nesting four (or optionally more) replace statements that replace two blank spaces with one blank space.
DECLARE #testString AS VARCHAR(256) = ' Test text with random* spacing. Please normalize this spacing!';
SELECT REPLACE(REPLACE(REPLACE(REPLACE(#testString,' ',' '),' ',' '),' ',' '),' ',' ')
Execution Time Test #1: In ten runs of this replacement method, the average wait time on server replies was 1.9 milliseconds and total execution time was 3.8 milliseconds.
Execution Time Test #2: The average wait time on server replies was 1.8 milliseconds and total execution time was 4.8 milliseconds.
Method #3
The third method of replacing extra spaces between words is to use a simple loop. You can do a check on extra spaces in a while loop and then use the replace function to reduce the extra spaces with each iteration of the loop.
DECLARE #testString AS VARCHAR(256) = ' Test text with random* spacing. Please normalize this spacing!';
WHILE CHARINDEX(' ',#testString) > 0
SET #testString = REPLACE(#testString, ' ', ' ')
SELECT #testString
Execution Time Test #1: In ten runs of this replacement method, the average wait time on server replies was 1.8 milliseconds and total execution time was 3.4 milliseconds.
Execution Time Test #2: The average wait time on server replies was 1.9 milliseconds and total execution time was 2.8 milliseconds.
This is the solution via multiple replace, which works for any strings (does not need special characters, which are not part of the string).
declare #value varchar(max)
declare #result varchar(max)
set #value = 'alpha beta gamma delta xyz'
set #result = replace(replace(replace(replace(replace(replace(replace(
#value,'a','ac'),'x','ab'),' ',' x'),'x ',''),'x',''),'ab','x'),'ac','a')
select #result -- 'alpha beta gamma delta xyz'
You can try this:
select Regexp_Replace('single spaces only','( ){2,}', ' ') from dual;
Just Adding Another Method-
Replacing Multiple Spaces with Single Space WITHOUT Using REPLACE in SQL Server-
DECLARE #TestTable AS TABLE(input VARCHAR(MAX));
INSERT INTO #TestTable VALUES
('HAPPY NEWYEAR 2020'),
('WELCOME ALL !');
SELECT
CAST('<r><![CDATA[' + input + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')
AS Expected_Result
FROM #TestTable;
--OUTPUT
/*
Expected_Result
HAPPY NEWYEAR 2020
WELCOME ALL !
*/
Found this while digging for an answer:
SELECT REPLACE(
REPLACE(
REPLACE(
LTRIM(RTRIM('1 2 3 4 5 6'))
,' ',' '+CHAR(7))
,CHAR(7)+' ','')
,CHAR(7),'') AS CleanString
where charindex(' ', '1 2 3 4 5 6') > 0
The full answer (with explanation) was pulled from: http://techtipsbysatish.blogspot.com/2010/08/sql-server-replace-multiple-spaces-with.html
On second look, seems to be just a slightly different version of the selected answer.
Please Find below code
select trim(string_agg(value,' ')) from STRING_SPLIT(' single spaces only ',' ')
where value<>' '
This worked for me..
Hope this helps...
With the "latest" SQL Server versions (Compatibility level 130) you could also use string_split and string_agg.
string_split can return an ordinal column when provided with a third argument. (https://learn.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-ver16#enable_ordinal). So we can preserve the order of the string_split.
Using a common table expression:
with cte(value) as (select value from string_split(' a b c d e ', ' ', 1) where value <> '' order by ordinal offset 0 rows)
select string_agg(value, ' ') from cte
a b c d e results in a b c d e
I use FOR XML PATH solution to replace multiple spaces into single space
The idea is to replace spaces with XML tags
Then split XML string into string fragments without XML tags
Finally concatenating those string values by adding single space characters between two
Here is how final UDF function can be called
select dbo.ReplaceMultipleSpaces(' Sample text with multiple space ')

Resources