How to convert varchar to ASCII7 - sql-server

I want to replace any latin / accented characters with their basic alphabet letters and strip out everything that cant be converted
examples:
'ë' to be replaced with 'e'
'ß' to be replaced with 's' , 'ss' if possible, if neither then strip it
i am able to do this in c# code but im just not well experienced in MSSQL to solve this without taking many days
UPDATE: the data in the varchar column is populated from a trigger on another table which should have normal UNICODE text. i want to convert the text to ascii7 in a function to use for further processing.
UPDATE: i prefer a solution where this can be done in SQL only and avoiding custom character mapping. can this be done, or is it currently just not possible?

As Aaron said, I don't think you can dispose of mapping tables entirely in SQL, but mapping characters to ASCII-7 should involve some fairly simple tables, used in conjunction with AI collations. Here there are two tables, one to map characters in the column, and one for the letter of the alphabet (which could be expanded if necessary).
By using the AI collations, I get around a lot of explicit mapping definitions.
-----------------------------------------------
-- One time mapping table setup
CREATE TABLE t4000(i INT PRIMARY KEY);
GO
INSERT INTO t4000 --Just a simple list of integers from 1 to 4000
SELECT ROW_NUMBER()OVER(ORDER BY a.x)
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) a(x)
CROSS APPLY (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) b(x)
CROSS APPLY (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) c(x)
CROSS APPLY (VALUES(1),(2),(3),(4)) d(x)
GO
CREATE TABLE TargetChars(ch NVARCHAR(2) COLLATE Latin1_General_CS_AI PRIMARY KEY);
GO
INSERT TargetChars -- A-Z, a-z, ss
SELECT TOP(128) CHAR(i)
FROM t4000
WHERE i BETWEEN 65 AND 90
OR i BETWEEN 97 AND 122
UNION ALL
SELECT 'ss'
-- plus any other special targets here
GO
-----------------------------------------------
-- function
CREATE FUNCTION dbo.TrToA7(#str NVARCHAR(4000))
RETURNS NVARCHAR(4000)
AS
BEGIN
DECLARE #mapped NVARCHAR(4000) = '';
SELECT TOP(LEN(#str))
#mapped += ISNULL(tc.ch, SUBSTRING(#str, i, 1))
FROM t4000
LEFT JOIN TargetChars tc ON tc.ch = SUBSTRING(#str, i, 1)
COLLATE Latin1_General_CS_AI;
RETURN #mapped;
END
GO
Usage example:
SELECT dbo.TrToA7('It was not á tötal löß.');
Result:
--------------------------
It was not a total loss.

Related

not able to identify difference between same value

I have data inside a table's column. I SELECT DISTINCT of that column, i also put LTRIM(RTRIM(col_name)) as well while writing SELECT. But still I am getting duplicate column record.
How can we identify why it is happening and how we can avoid it?
I tried RTRIM, LTRIM, UPPER function. Still no help.
Query:
select distinct LTRIM(RTRIM(serverstatus))
from SQLInventory
Output:
Development
Staging
Test
Pre-Production
UNKNOWN
NULL
Need to be decommissioned
Production
Pre-Produc​tion
Decommissioned
Non-Production
Unsupported Edition
Looks like there's a unicode character in there, somewhere. I copied and pasted the values out initially as a varchar, and did the following:
SELECT DISTINCT serverstatus
FROM (VALUES('Development'),
('Staging'),
('Test'),
('Pre-Production'),
('UNKNOWN'),
('NULL'),
('Need to be decommissioned'),
('Production'),
(''),
('Pre-Produc​tion'),
('Decommissioned'),
('Non-Production'),
('Unsupported Edition'))V(serverstatus);
This, interestingly, returned the values below:
Development
Staging
Test
Pre-Production
UNKNOWN
NULL
Need to be decommissioned
Production
Pre-Produc?tion
Decommissioned
Non-Production
Unsupported Edition
Note that one of the values is Pre-Produc?tion, meaning that there is a unicode character between the c and t.
So, let's find out what it is:
SELECT 'Pre-Produc​tion', N'Pre-Produc​tion',
UNICODE(SUBSTRING(N'Pre-Produc​tion',11,1));
The UNICODE function returns back 8203, which is a zero-width space. I assume you want to remove these, so you can update your data by doing:
UPDATE SQLInventory
SET serverstatus = REPLACE(serverstatus, NCHAR(8203), N'');
Now your first query should work as you expect.
(I also suggest you might therefore want a lookup table for your status' with a foreign key, so that this can't happen again).
DB<>fiddle
I deal with this type of thing all the time. For stuff like this NGrams8K and PatReplace8k and PATINDEX are your best friends.
Putting what you posted in a table variable we can analyze the problem:
DECLARE #table TABLE (txtID INT IDENTITY, txt NVARCHAR(100));
INSERT #table (txt)
VALUES ('Development'),('Staging'),('Test'),('Pre-Production'),('UNKNOWN'),(NULL),
('Need to be decommissioned'),('Production'),(''),('Pre-Produc​tion'),('Decommissioned'),
('Non-Production'),('Unsupported Edition');
This query will identify items with characters other than A-Z, spaces and hyphens:
SELECT t.txtID, t.txt
FROM #table AS t
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;
This returns:
txtID txt
----------- -------------------------------------------
10 Pre-Produc​tion
To identify the bad character we can use NGrams8k like this:
SELECT t.txtID, t.txt, ng.position, ng.token -- ,UNICODE(ng.token)
FROM #table AS t
CROSS APPLY dbo.NGrams8K(t.txt,1) AS ng
WHERE PATINDEX('%[^a-zA-Z -]%',ng.token)>0;
Which returns:
txtID txt position token
------ ----------------- -------------------- ---------
10 Pre-Produc​tion 11 ?
PatReplace8K makes cleaning up stuff like this quite easily and quickly. First note this query:
SELECT OldString = t.txt, p.NewString
FROM #table AS t
CROSS APPLY dbo.patReplace8K(t.txt,'%[^a-zA-Z -]%','') AS p
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;
Which returns this on my system:
OldString NewString
------------------ ----------------
Pre-Produc?tion Pre-Production
To fix the problem you can use patreplace8K like this:
UPDATE t
SET txt = p.newString
FROM #table AS t
CROSS APPLY dbo.patReplace8K(t.txt,'%[^a-zA-Z -]%','') AS p
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;

Change wrong french characters on table

For some reason we found on the database characters like this: Ã
I can assume this character represent the character: é
Now I need to revise the whole table but checking all other characters to make sure there are no others.
Where I can find the relation of characters for example between this à and é? or probably find an SQL function that is already done to make those replacement.
I'm using SQL Server 2014
As mentioned by Daniel E, your dirty data might have been caused by the use of incorrect code pages (UTF-8 that was interpreted as ISO 8859-1).
One way to find entries with dirty data, is to use a "not exists" ("^") like expression with the list of valid characters in that expression. See example below.
declare #t table (name varchar(20))
insert into #t values ('touché')
insert into #t values ('encore touché')
insert into #t values ('reçu')
insert into #t values ('hello world')
select * from #t where name like '%[^a-zA-Z., -]%'
select * from #t where name like '%[^a-zA-Z.,èêé -]%' COLLATE Latin1_General_BIN

Separating Numeric Values from string

I'm having an issue, whereby I need to separate the following BPO007 to show BPO and 007. In some cases another example would be GFE0035, whereby I still need to split the numeric value from the characters.
When I do the following select isnumeric('BPO007') the result is 0, which is correct, however, I'm not sure how split these from each other.
I've had a look at this link, but it does not really answer my question.
I need the above split for a separate validation purpose in my trigger.
How would I develop something like this?
Thank you in advance.
As told in a comment before:
About your question How would I develop something like this?:
Do not store two pieces of information within the same column (read about 1.NF). You should keep separate columns in your database for BPO and for 007 (or rather an integer 7).
Then use some string methods to compute the BPO007 when you need it in your output.
Just not to let you alone in the rain.
DECLARE #tbl TABLE(YourColumn VARCHAR(100));
INSERT INTO #tbl VALUES('BPO007'),('GFE0035');
SELECT YourColumn,pos
,LEFT(YourColumn,pos) AS CharPart
,CAST(SUBSTRING(YourColumn,pos+1,1000) AS INT) AS NumPart
FROM #tbl
CROSS APPLY(SELECT PATINDEX('%[0-9]%',YourColumn)-1) AS A(pos);
Will return
YourColumn pos CharPart NumPart
BPO007 3 BPO 7
GFE0035 3 GFE 35
Hint: I use a CROSS APPLY here to compute the position of the first numeric character and then use pos in the actual query like you'd use a variable. Otherwise the PATINDEX would have to be repeated...
Since the number and text varies, you can use the following codes
DECLARE #NUMERIC TABLE (Col VARCHAR(50))
INSERT INTO #NUMERIC VALUES('BPO007'),('GFE0035'),('GFEGVT003509'),('GFEMTS10035')
SELECT
Col,
LEFT(Col,LEN(Col)-LEN(SUBSTRING(Col,PATINDEX('%[0-9]%',Col),DATALENGTH(Col)))) AS TEXTs,
RIGHT(Col,LEN(Col)-LEN(LEFT(Col,LEN(Col)-LEN(SUBSTRING(Col,PATINDEX('%[0-9]%',Col),DATALENGTH(Col)))))) AS NUMERICs
FROM #NUMERIC

Using SQLServer contains for partial words

We are running many products search on a huge catalog with partially matched barcodes.
We started with a simple like query
select * from products where barcode like '%2345%'
But that takes way too long since it requires a full table scan.
We thought a fulltext search will be able to help us here using contains.
select * from products where contains(barcode, '2345')
But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').
My answer is: #DenisReznik was right :)
ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.
So I have made some tests on my own.
I have created a table to store test data:
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](40) NULL
)
I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.
Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:
insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode
from FN_NUMBERS(10000000)
FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table)
to get records quick.
I got 10 million records like that:
N barcode
1 1C333262C2D74E11B688281636FAF0FB
2 3680E11436FC4CBA826E684C0E96E365
3 7763D29BD09F48C58232C7D33551E6C9
Let's declare a var to search for:
declare #s varchar(20) = 'D34F15' -- random alfanumeric string
Let's take a base try with LIKE to compare results to:
select * from like_test where barcode like '%'+#s+'%'
On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.
SSMS suggests to add an index on barcode column:
CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])
500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.
Let's try the CHARINDEX solution:
select * from like_test where charindex(#s, barcode) > 0
This time it took 23.5 second to complete, not really so much better than LIKE.
Now let's check the #DenisReznik 's suggestion that using the Binary Collation should speed up things.
select * from like_test
where barcode collate Latin1_General_BIN like '%'+#s+'%' collate Latin1_General_BIN
WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:
select * from like_test
where charindex(#s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0
Unbelivable! 2.4 secs, 10 times better..
Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.
Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..
First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:
select * from like_test
where charindex(
#s collate Latin1_General_BIN,
SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
)>0
Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8) (head of the string) and SUBSTRING(barcode, 12, 8) (middle of the string) with same results.
Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()
Finally I have tried to drop the index on barcode column and repeated ALL above tests...
I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.
Naturally, for a direct key lookup like where barcode = #s with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = #s collate Latin1_General_BIN
This was interesting.
I hope this helps
I often use charindex and just as often have this very debate.
As it turns out, depending on your structure you may actually have a substantial performance boost.
http://cc.davelozinski.com/sql/like-vs-substring-vs-leftright-vs-charindex
The good option here for your case - creating your FTS index. Here is how it could be implemented:
1) Create table Terms:
CREATE TABLE Terms
(
Id int IDENTITY NOT NULL,
Term varchar(21) NOT NULL,
CONSTRAINT PK_TERMS PRIMARY KEY (Term),
CONSTRAINT UK_TERMS_ID UNIQUE (Id)
)
Note: index declaration in the table definition is a feature of 2014. If you have a lower version, just bring it out of CREATE TABLE statement and create separately.
2) Cut barcodes to grams, and save each of them to a table Terms. For example: barcode = '123456', your table should have 6 rows for it: '123456', '23456', '3456', '456', '56', '6'.
3) Create table BarcodeIndex:
CREATE TABLE BarcodesIndex
(
TermId int NOT NULL,
BarcodeId int NOT NULL,
CONSTRAINT PK_BARCODESINDEX PRIMARY KEY (TermId, BarcodeId),
CONSTRAINT FK_BARCODESINDEX_TERMID FOREIGN KEY (TermId) REFERENCES Terms (Id),
CONSTRAINT FK_BARCODESINDEX_BARCODEID FOREIGN KEY (BarcodeId) REFERENCES Barcodes (Id)
)
4) Save a pair (TermId, BarcodeId) for the barcode into the table BarcodeIndex. TermId was generated on the second step or exists in the Terms table. BarcodeId - is an identifier of the barcode, stored in Barcodes (or whatever name you use for it) table. For each of the barcodes, there should be 6 rows in the BarcodeIndex table.
5) Select barcodes by their parts using the following query:
SELECT b.* FROM Terms t
INNER JOIN BarcodesIndex bi
ON t.Id = bi.TermId
INNER JOIN Barcodes b
ON bi.BarcodeId = b.Id
WHERE t.Term LIKE 'SomeBarcodePart%'
This solution force all similar parts of barcodes to be stored nearby, so SQL Server will use Index Range Scan strategy to fetch data from the Terms table. Terms in the Terms table should be unique to make this table as small as possible. This could be done in the application logic: check existence -> insert new if a term doesn't exist. Or by setting option IGNORE_DUP_KEY for clustered index of the Terms table. BarcodesIndex table is used to reference Terms and Barcodes.
Please note that foreign keys and constraints in this solution are the points of consideration. Personally, I prefer to have foreign keys, until they hurt me.
After further testing and reading and talking with #DenisReznik I think the best option could be to add virtual columns to barcode table to split barcode.
We only need columns for start positions from 2nd to 4th because for the 1st we will use original barcode column and the last I think it is not useful at all (what kind of partial match is 1 char on 6 when 60% of records will match?):
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](6) NOT NULL,
[BC2] AS (substring([BARCODE],(2),(5))),
[BC3] AS (substring([BARCODE],(3),(4))),
[BC4] AS (substring([BARCODE],(4),(3))),
[BC5] AS (substring([BARCODE],(5),(2)))
)
and then to add indexes on this virtual columns:
CREATE NONCLUSTERED INDEX [IX_BC2] ON [like_test2] ([BC2]);
CREATE NONCLUSTERED INDEX [IX_BC3] ON [like_test2] ([BC3]);
CREATE NONCLUSTERED INDEX [IX_BC4] ON [like_test2] ([BC4]);
CREATE NONCLUSTERED INDEX [IX_BC5] ON [like_test2] ([BC5]);
CREATE NONCLUSTERED INDEX [IX_BC6] ON [like_test2] ([barcode]);
now we can simply find partial matches with this query
declare #s varchar(40)
declare #l int
set #s = '654'
set #l = LEN(#s)
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6)) -- to match full code (rem if not needed)
OR ((barcode like #s+'%') and (#l<6)) -- to match strings up to 5 chars from beginning
or ((BC2 like #s+'%') and (#l<6)) -- to match strings up to 5 chars from 2nd position
or ((BC3 like #s+'%') and (#l<5)) -- to match strings up to 4 chars from 3rd position
or ((BC4 like #s+'%') and (#l<4)) -- to match strings up to 3 chars from 4th position
or ((BC5 like #s+'%') and (#l<3)) -- to match strings up to 2 chars from 5th position
this is HELL fast!
for search strings of 6 chars 15-20 milliseconds (full code)
for search strings of 5 chars 25 milliseconds (20-80)
for search strings of 4 chars 50 milliseconds (40-130)
for search strings of 3 chars 65 milliseconds (50-150)
for search strings of 2 chars 200 milliseconds (190-260)
There will be no additional space used for table, but each index will take up to 200Mb (for 1 million barcodes)
PAY ATTENTION
Tested on a Microsoft SQL Server Express (64-bit) and Microsoft SQL Server Enterprise (64-bit) the optimizer of the latter is slight better but the main difference is that:
on express edition you have to extract ONLY the primary key when searching your string, if you add other columns in the SELECT, the optimizer will not use indexes anymore but it will go for full clustered index scan so you will need something like
;with
k as (-- extract only primary key
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
)
select N
from like_test t
where exists (select 1 from k where k.n = t.n)
on standard (enterprise) edition you HAVE to go for
select * from like_test -- take a look at the star
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
You do not include many constraints, which means you want to search for string in a string -- and if there was a way to optimized an index to search a string in a string, it would be just built in!
Other things that make it hard to give a specific answer:
It's not clear what "huge" and "too long" mean.
It's not clear as to how your application works. Are you searching in batch as you add a 1,000 new products? Are you allowing a user to enter a partial barcode in a search box?
I can make some suggestions that may or may not be helpful in your case.
Speed up some of the queries
I have a database with lots of licence plates; sometimes an officer wants to search by the last 3-characters of the plate. To support this I store the license plate in reverse, then use LIKE ('ZYX%') to match ABCXYZ. When doing the search, they have the option of a 'contains' search (like you have) which is slow, or an option of doing 'Begins/Ends with' which is super because of the index. This would solve your problem some of the time (which may be good enough), especially if this is a common need.
Parallel Queries
An index works because it organizes data, an index cannot help with a string within a string because there is no organization. Speed seems to be your focus of optimization, so you could store/query your data in a way that searches in parallel. Example: if it takes 10-seconds to sequentially search 10-million rows, then having 10-parallel processes (so process searches 1-million) will take you from 10-seconds to 1-second (kind'a-sort'a). Think of it as scaling out. There are various options for this, within your single SQL Instance (try data partitioning) or across multiple SQL Servers (if that's an option).
BONUS: If you're not on a RAID setup, that can help with reads since it's a effectively of reading in parallel.
Reduce a bottleneck
One reason searching "huge" datasets take "too long" is because all that data needs to be read from the disk, which is always slow. You can skip-the-disk, and use InMemory Tables. Since "huge" isn't defined, this may not work.
UPDATED:
We know from that FULL-TEXT searches can be used for the following:
Full-Text Search -
MSDN
One or more specific words or phrases (simple term)
A word or a phrase where the words begin with specified text (prefix term)
Inflectional forms of a specific word (generation term)
A word or phrase close to another word or phrase (proximity term)
Synonymous forms of a specific word (thesaurus)
Words or phrases using weighted values (weighted term)
Are any of these fulfilled by your query requirements? If you are having to search for patterns as you described, without an consistent pattern (such as '1%'), then there may not be a way for SQL to use a SARG.
You could use Boolean statements
Coming from a C++ perspective, B-Trees are accessed from Pre-Order, In-Order, and Post-Order traversals and utilize Boolean statements to search the B-Tree. Processed much faster than string comparisons, booleans offer at the least an improved performance.
We can see this in the following two options:
PATINDEX
Only if your column is not numeric, as PATINDEX is designed for strings.
Returns an integer (like CHARINDEX) which is easier to process than strings.
CHARINDEX is a solution
CHARINDEX has no problem searching INTs and again, returns a number.
May require some extra cases built in (i.e. first number is always ignored), but you can add them like so: CHARINDEX('200', barcode) > 1.
Proof of what I am saying, let us go back to the old [AdventureWorks2012].[Production].[TransactionHistory]. We have TransactionID which contains the number of the items we want, and lets for fun assume you want every transactionID that has 200 at the end.
-- WITH LIKE
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE TransactionID LIKE '%200'
-- WITH CHARINDEX(<delimiter>, <column>) > 3
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE CHARINDEX('200', TransactionID) > 3
Note CHARINDEX removes the value 200200 in the search, so you may need to adjust your code appropriately. But look at the results:
Clearly, booleans and numbers are faster comparisons.
LIKE uses string comparisons, which again is much slower to process.
I was a bit surprised at the size of the difference, but the fundamentals are the same. Integers and Boolean statements are always faster to process than string comparisons.
I'm late to the game but here's another way to get a full-text like index in the spirit of #MtwStark's second answer.
This is a solution using a search table join
drop table if exists #numbers
select top 10000 row_number() over(order by t1.number) as n
into #numbers
from master..spt_values t1
cross join master..spt_values t2
drop table if exists [like_test]
create TABLE [like_test](
[N] INT IDENTITY(1,1) not null,
[barcode] [varchar](40) not null,
constraint pk_liketest primary key ([N])
)
insert into dbo.like_test (barcode)
select top (1000000) replace(convert(varchar(40), NEWID()), '-', '') barcode
from #numbers t,#numbers t2
drop table if exists barcodesearch
select distinct ps.n, trim(substring(ps.barcode,ty.n,100)) as searchstring
into barcodesearch
from like_test ps
inner join #numbers ty on ty.n < 40
where len(ps.barcode) > ty.n
create clustered index idx_barcode_search_index on barcodesearch (searchstring)
The final search should look like this:
declare #s varchar(20) = 'D34F15'
select distinct lt.* from dbo.like_test lt
inner join barcodesearch bs on bs.N = lt.N
where bs.searchstring like #s+'%'
If you have the option of full-text searching, you can speed this up even further by adding the full-text search column directly to the barcode table
drop table if exists #liketestupdates
select n, string_agg(searchstring, ' ')
within group (order by reverse(searchstring)) as searchstring
into #liketestupdates
from barcodesearch
group by n
alter table dbo.like_test add search_column varchar(559)
update lt
set search_column = searchstring
from like_test lt
inner join #liketestupdates lu on lu.n = lt.n
CREATE FULLTEXT CATALOG ftcatalog as default;
create fulltext index on dbo.like_test ( search_column )
key index pk_liketest
The final full-text search would look like this:
declare #s varchar(20) = 'D34F15'
set #s = '"*' + #s + '*"'
select n,barcode from dbo.like_test where contains(search_column, #s)
I understand that Estimated Costs aren't the best measure of expected performance but the number's aren't wildly off here.
With the search table join, the Estimated Subtree Cost is 2.13
With the full-text search, the Estimated Subtree Cost is 0.008
Full-text is aimed for bigger texts, let's say texts with more than about 100 chars. You can use LIKE '%string%'. (However it depends how the barcode column is defined.) Do you have an index for barcode? If not, then create one and it will improve your query.
First make the index on column on which you have to put as where clause .
Secondly for the datatype of the column which are used in where clause make them as Char in place of Varchar which will save you some space,in the table and in the indexes that will include that column.
varchar(1) column needs one more byte over char(1)
Do pull only the number of columns you need try to avoid * , be specific to number of columns you wish to select.
Don't write as
select * from products
In place of it write as
Select Col1, Col2 from products with (Nolock)

Removing characters from a alphanumeric field SQL

Im moving data from one table to another using insert into. in the select bit need to transfer from column with characters and numerical in to another with only the numerical. The original column is in varchar format.
original column -
ABC100
XYZ:200
DD2000
Wanted column
100
200
2000
Cant write a function because cant have a function in side select statement when inserting
Using MS SQL
I encourage you to read this:
Extracting Data
There is an example function that removes alpha characters from a string. This will be much faster than a bunch of replace statements.
You can probably do that with a regex replace. The syntax for this depends on your database software (which you haven't specified).
You should be able to do function calls in your SELECT statement, even when you're using it to INSERT INTO.
If your data is fixed-format I'd do something like
INSERT INTO SOME_TABLE(COLUMN1, COLUMN2, COLUMN3)
SELECT TO_NUMBER(SUBSTR(SOURCE_COLUMN, 4, 3)),
TO_NUMBER(SUBSTR(SOURCE_COLUMN, 12, 3)),
TO_NUMBER(SUBSTR(SOURCE_COLUMN, 18, 4))
FROM SOME_OTHER_TABLE
WHERE <conditions>;
The above code is for Oracle. Depending on the database you're using you may have to do things a bit differently.
I hope this helps.
You certainly can have a function inside a SELECT statement during an INSERT:
INSERT INTO CleanTable (CleanColumn)
SELECT dbo.udf_CleanString(DirtyColumn)
FROM DirtyTable
Your main problem is going to be getting the function right (the one the G Mastros linked to is pretty good) and getting it performing. If you're only talking thousands of rows, this should be fine. If you are talking about millions of rows, you might need a different strategy.
Writing a UDF is how I've solved this problem in the past. However, I got to thinking if there was a set-based solution. Here's what I have:
First my table which I used Red Gate's Data Generator to populate with a bunch of random alpha numeric values:
Create Table MixedValues (
Id int not null identity(1,1) Primary Key
, AlphaValue varchar(50)
)
Next I built a Tally table on the fly using a CTE but normally I have a fixed table for this. A Tally table is just a table of sequential numbers.
;With Tally As
(
Select ROW_NUMBER() OVER ( ORDER BY object_id ) As Num
From sys.columns
)
, IndividualChars As
(
Select MX.Id, Substring(MX.AlphaValue, Num, 1) As CharValue, Num
From Tally
Cross Join MixedValues As MX
Where Num Between 1 And Len(MX.AlphaValue)
)
Select MX.Id, MX.AlphaValue
, Replace(
(
Select '' + CharValue
From IndividualChars As IC
Where IC.Id = MX.Id
And PATINDEX('[ 0-9]', CharValue) > 0
Order By Num
For Xml Path('')
)
, ' ', ' ') As NewValue
From MixedValues As MX
From a top level, the idea here is to split the string into one row per individual character, test the type of pattern you want and then re-constitute it.
Note that my sys.columns table only contains 500 some odd rows. If you had strings larger than 500 characters, you could simply cross join sys.columns to itself and get 500^2 rows. In addition, For Xml Path returns a string with spaces escaped (note the space in my pattern index [ 0-9] which tells the system to include spaces.) so I use the replace function to reverse the escaping.
EDIT: Btw, this will only work in SQL 2005+ because of my use of the CTE. If you wanted a SQL 2000 solution, you would need to break up the CTE into separate table creation calls (e.g. Temp tables) but it could still be done.
EDIT: I added the Num column in the IndividualChars CTE and added an Order By to the NewValue query at the end. Although it probably will reconstitute the string in order, I wanted to ensure that it would by explicitly ordering the results.

Resources