I have a data set of about 33 million rows and 20 columns. One of the columns is a raw data tab I'm using to extract relevant data from, inlcuding ID's and account numbers.
I extracted a column for User ID's into a temporary table to trim the User ID's of spaces. I'm now trying to add the trimmed User ID column back into the original data set using this code:
SELECT *
FROM [dbo].[DATA] AS A
INNER JOIN #TempTable AS B ON A. [RawColumn] = B. [RawColumn]
Extracting the User ID's and trimming the spaces took about a minute for each query. However, running this last query I'm at the 2 hour mark and I'm only 2% of the way through the dataset.
Is there a better way to run the query?
I'm running the query in SQL Server 2014 Management Studio
Thanks
Update:
I continued to let it run through the night. When I got back into work, only 6 million rows had been completed of the 33 million rows. I cancelled the execution and I'm trying to add a smaller primary key (The only other key I could see on the table was the [RawColumn], which was a very long string of text) using:
ALTER TABLE [dbo].[DATA]
ADD ID INT IDENTITY(1,1)
Right now I'm an hour into the execution.
Next, I'm planning to make it the primary key using
ALTER TABLE dbo.[DATA]
ADD CONSTRAINT PK_[DATA] PRIMARY KEY(ID)
I'm not familiar with using Indexes.. I've tried looking up on Stack Overflow how to create one, but from what I'm reading it sounds like it would take just as long to create an index as it would to run this query. Am I wrong about that?
For context on the RawColumn data, it looks something like this:
FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000
Update #2:
I'm now learning that using "ALTER TABLE" is a bad idea. I should have done a little bit more research into how to add a primary key to a table.
Update #3
Here's the code I used to extract the "UserID" code out of the "RawColumn" data.
DROP #TEMPTABLE1
GO
SELECT [RAWColumn],
SUBSTRING([RAWColumn], CHARINDEX('USERID:', [RAWColumn])+LEN('USERID:'), CHARINDEX('Account#:', [RAWColumn])-Charindex('Username:', [RAWColumn]) - LEN('Account#:') - LEN('USERID:')) AS 'USERID_NEW'
INTO #TempTable1
FROM [dbo].[DATA]
Next I trimmed the data from the temporary tables
DROP #TEMPTABLE2
GO
SELECT [RawColumn],
LTRIM([USERID_NEW]) AS 'USERID_NEW'
INTO #TempTable2
FROM #TempTable1
So now I'm trying to get the data from #TEMPTABLE2 back into my original [DATA] table. Hopefully this is more clear now.
So I think your parsing code is a little bit wrong. Here's an approach that doesn't assume that the values appear in any particular order. It does assume that the header/tag name has a space after the colon character and it assumes that the value end at the subsequent space character. Here's a snippet that manipulates a single value.
declare #dat varchar(128) = 'FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000';
declare #tag varchar(16) = 'UserID: ';
/* datalength() counts the trailing space character unlike len() */
declare #idx int = charindex(#tag, #dat) + datalength(#tag);
select substring(#dat, #idx, charindex(' ', #dat + ' ', #idx + 1) - #idx) as UserID
To use it in a single query without the temporary variable, the most straightforward approach is to just replace each instance of "#idx" with the original expression:
declare #tag varchar(16) = 'UserID: ';
select RawColumn,
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID
from dbo.DATA;
As an update it looks something like this:
declare #tag varchar(16) = 'UserID: ';
update dbo.DATA
set UserID =
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID;
You also appear to be ignoring upper/lower case in your string matches. It's not clear to me whether you need to consider that more carefully.
Related
Using SQL Server Azure or 2017 with Full Text Search, I need to return possible matches on names.
Here's the simple scenario: an administrator is entering contact information for a new employee, first name, last name, address, etc. I want to be able to search the Employee table for a possible match on the name(s) to see if this employee has already been entered in the database.
This might happen as an autosuggest type of feature, or simply display some similar results, like here in Stackoverflow, while the admin is entering the data.
I need to prevent duplicates!
If the admin enters "Bob", "Johnson", I want to be able to match on:
Bob Johnson
Rob Johnson
Robert Johnson
This will give the administrator the option of seeing if this person has already been entered into the database and choose one of those choices.
Is it possible to do this type of match on words like "Bob" and include "Robert" in the results? If so, what is necessary to accomplish this?
Try this.
You need to change the #per parameter value to your requirement. It indicates how many letters out of the length of the first name should match for the result to return. I just set it to 50% for testing purposes.
The dynamic SQL piece inside the loop adds all the CHARINDEX result per letter of the first name in question, to all existing first names.
Caveats:
Repeating letters will of course be misleading, like Bob will count 3 matches in Rob because there's 2 Bs in Bob.
I didn't consider 2 first names, like Bob Robert Johnson, etc so this will fail. You can improve on that however, but you get the idea.
The final SQL query gets the LetterMatch that is greater than or equal to the set value in #per.
DECLARE #name varchar(MAX) = 'Bobby Johnson' --sample name
DECLARE #first varchar(50) = SUBSTRING(#name, 0, CHARINDEX(' ', #name)) --get the first part of the name before space
DECLARE #last varchar(50) = SUBSTRING(#name, CHARINDEX(' ', #name) + 1, LEN(#name) - LEN(#first) - 1) --get the last part of the name after space
DECLARE #walker int = 1 --for looping
DECLARE #per float = LEN(#first) * 0.50 --declare percentage of how many letters out of the length of the first name should match. I just used 50% for testing
DECLARE #char char --for looping
DECLARE #sql varchar(MAX) --for dynamic SQL use
DECLARE #matcher varchar(MAX) = '' --for dynamic SQL use
WHILE #walker <> LEN(#first) + 1 BEGIN --loop through all the letters of the first name saved in #first variable
SET #char = SUBSTRING(#first, #walker, 1) --save the current letter in the iteration
SET #matcher = #matcher + IIF(#matcher = '', '', ' + ') + 'IIF(CHARINDEX(''' + #char + ''', FirstName) > 0, 1, 0)' --build the additional column to be added to the dynamic SQL
SET #walker = #walker + 1 --move the loop
END
SET #sql = 'SELECT * FROM (SELECT FirstName, LastName, ' + #matcher + ' AS LetterMatch
FROM TestName
WHERE LastName LIKE ' + '''%' + #last + '%''' + ') AS src
WHERE CAST(LetterMatch AS int) >= ROUND(' + CAST(#per AS varchar(50)) + ', 0)'
SELECT #sql
EXEC(#sql)
SELECT * FROM tbl_Names
WHERE Name LIKE '% user defined text %';
using a text in between % % will search those text on any position in the data.
I have the following stored procedure that is quite extensive because of the dynamic #Name parameter and the sub query.
Is there a better more efficient way to do this?
CREATE PROCEDURE [dbo].[spGetClientNameList]
#Name varchar(100)
AS
BEGIN
SET NOCOUNT ON;
SELECT
*
FROM
(
SELECT
ClientID,
FirstName + ' ' + LastName as Name
FROM
Client
) a
where a.Name like '%' + #Name + '%'
Shamelessly stealing from two recent articles by Aaron Bertrand:
Follow-up #1 on leading wildcard seeks - Aaron Bertrand
One way to get an index seek for a leading %wildcard - Aaron Bertrand
The jist is to create something that we can use that resembles a trigram (or trigraph) in PostgreSQL.
Aaron Bertrand also includes a disclaimer as follows:
"Before I start to show how my proposed solution would work, let me be absolutely clear that this solution should not be used in every single case where LIKE '%wildcard%' searches are slow. Because of the way we're going to "explode" the source data into fragments, it is likely limited in practicality to smaller strings, such as addresses or names, as opposed to larger strings, like product descriptions or session abstracts."
test setup: http://rextester.com/IIMT54026
Client table
create table dbo.Client (
ClientId int not null primary key clustered
, FirstName varchar(50) not null
, LastName varchar(50) not null
);
insert into dbo.Client (ClientId, FirstName, LastName) values
(1, 'James','')
, (2, 'Aaron','Bertrand')
go
Function used by Aaron Bertrand to explode string fragments (modified for input size):
create function dbo.CreateStringFragments(#input varchar(101))
returns table with schemabinding
as return
(
with x(x) as (
select 1 union all select x+1 from x where x < (len(#input))
)
select Fragment = substring(#input, x, len(#input)) from x
);
go
Table to store fragments for FirstName + ' ' + LastName:
create table dbo.Client_NameFragments (
ClientId int not null
, Fragment varchar(101) not null
, constraint fk_ClientsNameFragments_Client
foreign key(ClientId) references dbo.Client
on delete cascade
);
create clustered index s_cat on dbo.Client_NameFragments(Fragment, ClientId);
go
Loading the table with fragments:
insert into dbo.Client_NameFragments (ClientId, Fragment)
select c.ClientId, f.Fragment
from dbo.Client as c
cross apply dbo.CreateStringFragments(FirstName + ' ' + LastName) as f;
go
Creating trigger to maintain fragments:
create trigger dbo.Client_MaintainFragments
on dbo.Client
for insert, update as
begin
set nocount on;
delete f from dbo.Client_NameFragments as f
inner join deleted as d
on f.ClientId = d.ClientId;
insert dbo.Client_NameFragments(ClientId, Fragment)
select i.ClientId, fn.Fragment
from inserted as i
cross apply dbo.CreateStringFragments(i.FirstName + ' ' + i.LastName) as fn;
end
go
Quick trigger tests:
/* trigger tests --*/
insert into dbo.Client (ClientId, FirstName, LastName) values
(3, 'Sql', 'Zim')
update dbo.Client set LastName = 'unknown' where LastName = '';
delete dbo.Client where ClientId = 3;
--select * from dbo.Client_NameFragments order by ClientId, len(Fragment) desc
/* -- */
go
New Procedure:
create procedure [dbo].[Client_getNameList] #Name varchar(100) as
begin
set nocount on;
select
ClientId
, Name = FirstName + ' ' + LastName
from Client c
where exists (
select 1
from dbo.Client_NameFragments f
where f.ClientId = c.ClientId
and f.Fragment like #Name+'%'
)
end
go
exec [dbo].[Client_getNameList] #Name = 'On Bert'
returns:
+----------+----------------+
| ClientId | Name |
+----------+----------------+
| 2 | Aaron Bertrand |
+----------+----------------+
I guess search operation on Concatenated column wont take Indexes sometimes. I got situation like above and I replaced the Concatenated search with OR which gave me better performance most of the times.
Create Non Clustered Indexes on FirstName and LastName if not present.
Check the performance after modifying the above Procedure like below
CREATE PROCEDURE [dbo].[spGetClientNameList]
#Name varchar(100)
AS
BEGIN
SET NOCOUNT ON;
SELECT
ClientID,
FirstName + ' ' + LastName as Name
FROM
Client
WHERE FirstName LIKE '%' + #Name + '%'
OR LastName LIKE '%' + #Name + '%'
END
And do check execution plans to verify those Indexes are used or not.
The problem really comes down to having to compute the column (concat the first name and last name), that pretty much forces sql server into doing a full scan of the table to determine what is a match and what isn't. If you're not allowed to add indexes or alter the table, you'll have to change the query around (supply firstName and lastName separately). If you are, you could add a computed column and index that:
Create Table client (
ClientId INT NOT NULL PRIMARY KEY IDENTITY(1,1)
,FirstName VARCHAR(100)
,LastName VARCHAR(100)
,FullName AS FirstName + ' ' + LastName
)
Create index FullName ON Client(FullName)
This will at least speed your query up by doing index seeks instead of full table scans. Is it worth it? It's difficult to say without looking at how much data there is, etc.
where a.Name like '%' + #Name + '%'
This statement never can use index. In this situation it's beter to use Full Text Search
if you can restrict your like to
where a.Name like #Name + '%'
it will use index automaticaly. Moreover you can use REVERSE() function to index statement like :
where a.Name like '%' + #Name
I have a field in my table which has multiple reason codes concatenated in 1 column.
e.g. 2 records
Reason_Codes
Record1: 001,002,004,009,010
Record2: 001,003,005,006
In my SSRS report the user will be searching for data using one of the above reason codes. e.g.
001 will retrieve both records.
005 will retrieve the second record
and so on.
Kindly advise how this can be achieved using SQL or Stored Procedure.
Many thanks.
If you are just passing in a single Reason Code to search on, you don't even need to bother with splitting the comma-separated list: you can just use a LIKE clause as follows:
SELECT tb.field1, tb.field2
FROM SchemaName.TableName tb
WHERE ',' + tb.Reason_Codes + ',' LIKE '%,' + #ReasonCode + ',%';
Try the following to see:
DECLARE #Bob TABLE (ID INT IDENTITY(1, 1) NOT NULL, ReasonCodes VARCHAR(50));
INSERT INTO #Bob (ReasonCodes) VALUES ('001,002,004,009,010');
INSERT INTO #Bob (ReasonCodes) VALUES ('001,003,005,006');
DECLARE #ReasonCode VARCHAR(5);
SET #ReasonCode = '001';
SELECT tb.ID, tb.ReasonCodes
FROM #Bob tb
WHERE ',' + tb.ReasonCodes + ',' LIKE '%,' + #ReasonCode + ',%';
-- returns both rows
SET #ReasonCode = '005';
SELECT tb.ID, tb.ReasonCodes
FROM #Bob tb
WHERE ',' + tb.ReasonCodes + ',' LIKE '%,' + #ReasonCode + ',%';
-- returns only row #2
I have blogged about something like this a long time ago. May be this will help: http://dotnetinternal.blogspot.com/2013/10/comma-separated-to-temp-table.html
The core solution would be to convert the comma separated values into a temporary table and then do a simple query on the temporary table to get your desired result.
I have SQL Express 2012 with a table that I added a column (integer) for a Word Count and the table already has a few hundred rows. I am not sure how I update the column to have the word count from the "entry" column.
I created a query that shows me the data, but how do I use this to update the table to store the Word Count for each entry?
SELECT
ID,
[UserName],
[DateCreated],
LEN([Entry]) - LEN(REPLACE([Entry], ' ', '')) + 1 AS 'Word Count'
FROM [dbo].[Notes]
The verb in the SQL language to update data in a table is not surprisingly UPDATE. The documentation has the full syntax.
If you want to update all rows and there are no NULL values in the Entry column (that would make the calculation fail) then this query will update a column named WordCount:
UPDATE Notes SET WordCount = LEN([Entry]) - LEN(REPLACE([Entry], ' ', '')) + 1
Try this piece of code.
UPDATE Notes SET
WordCount = LEN([Entry]) - LEN(REPLACE([Entry], ' ', '')) + 1
This should then update all rows in the table with the word count for that row.
Thanks,
Here is how you could do this so that your values are always current. Two big advantages here. First, you don't have to ever update your table. Second, the values will ALWAYS be current even if somebody updates your table with a query and doesn't update the WordCount.
create table #test
(
Entry varchar(100)
, WordCount as LEN(Entry) - LEN(REPLACE(Entry, ' ', ''))
)
insert #test
select 'two words' union all
select 'three words now'
select * from #test
drop table #test
I have data like below in one of the column in table.
john;144;ny;
Nelson;154;NY;
john;144;NC;
john;144;kw;
I want to retrieve the rows which has lowercase in 3rd part of the data
so i need to get
john;144;kw;
john;144;ny;
is possible to get the data like this?
Force a case-sensitive matching, and then compare forced-lowercase to original:
SELECT ...
FROM ..
WHERE LOWER(name) = name COLLATE Latin1_General_CS_AS
^^---case sensitive
If the name is all-lower to start with, then LOWER() won't change it, and you'll get a match. If it's something like John, then you'd be doing john = John and the case-sensitivity will fail the match.
This does not really answer your question, it certainly adds nothing to Marc's existing answer in terms of resolving your actual problem, it is merely meant as a demonstration of how simple it is to correct your design (this whole script runs in about a second on my local express instance of SQL Server 2012).
CREATE TABLE dbo.T
(
ThreePartData VARCHAR(60)
);
-- INSERT 20,000 ROWS
INSERT dbo.T (ThreePartData)
SELECT t.ThreePartName
FROM (VALUES ('john;144;ny;'), ('Nelson;154;NY;'), ('john;144;NC;'), ('john;144;kw;')) t (ThreePartName)
CROSS JOIN
( SELECT TOP (5000) Number = 1
FROM sys.all_objects a
CROSS APPLY sys.all_objects b
) n;
GO
-- HERE IS WHERE THE CHANGES START
/**********************************************************************/
-- ADD A COLUMN FOR EACH COMPONENT
ALTER TABLE dbo.T ADD PartOne VARCHAR(20),
PartTwo VARCHAR(20),
PartThree VARCHAR(20);
GO
-- UPDATE THE PARTS WITH THEIR CORRESPONDING COMPONENT
UPDATE dbo.T
SET PartOne = PARSENAME(REPLACE(ThreePartData, ';', '.') + 't', 4),
PartTwo = PARSENAME(REPLACE(ThreePartData, ';', '.') + 't', 3),
PartThree = PARSENAME(REPLACE(ThreePartData, ';', '.') + 't', 2);
GO
-- GET RID OF CURRENT COLUMN
ALTER TABLE dbo.T DROP COLUMN ThreePartData;
GO
-- CREATE A NEW COMPUTED COLUMN THAT REBUILDS THE CONCATENATED STRING
ALTER TABLE dbo.T ADD ThreePartData AS CONCAT(PartOne, ';', PartTwo, ';', PartThree, ';');
GO
-- OR FOR VERSIONS BEFORE 2012
--ALTER TABLE dbo.T ADD ThreePartData AS PartOne + ';' + PartTwo + ';' + PartThree + ';';
Then your query is as simple as:
SELECT *
FROM T
WHERE LOWER(PartThree) = PartThree COLLATE Latin1_General_CS_AS;
And since you have recreated a computed column with the same name, any select statements in use will not be affected, although updates and inserts will need addressing.
Using BINARY_CHECKSUM we can retrieve the rows which has lowercase or upper case
CREATE TABLE #test
(
NAME VARCHAR(50)
)
INSERT INTO #test
VALUES ('john;144;ny;'),
('Nelson;154;NY;'),
('john;144;NC;'),
('john;144;kw;')
SELECT *
FROM #test
WHERE Binary_checksum(NAME) = Binary_checksum(Lower(NAME))
OUTPUT
name
-----------
john;144;ny;
john;144;kw;