How to remove weird Excel character in SQL Server? - sql-server

There is a weird whitespace character I can't seem to get rid of that occasionally shows up in my data when importing from Excel. Visibly, it comes across as a whitespace character BUT SQL Server sees it as a question mark (ASCII 63).
declare #temp nvarchar(255); set #temp = 'carolg#c?am.com'
select #temp
returns:
?carolg#c?am.com
How can I get rid of the whitespace without getting rid of real question marks? If I look at the ASCII code for each of those "?" characters I get 63 when in fact, only one of them is a real qustion mark.

Have a look at this answer for someone with a similar issue. Sorry if this is a bit long winded:
SQL Server seems to flatten Unicode to ASCII by mapping unrepresentable characters (for which there is no suitable substitution) to a question mark. To replicate this, try opening the Character Map Windows program (should be installed on most machines), select Arial as the font and find U+034f "Combining Grapheme Joiner". select this character, copy to clipboard and paste it between the single quotes below:
declare #t nvarchar(10)
set #t = '͏'
select rtrim(ltrim(#t)) -- we can try and trim it, but by this stage it's already a '?'
You'll get a question mark out, because it doesn't know how to represent this non-ASCII character when it casts it to varchar. To force it to accept it as a double-byte character (nvarchar) you need to use N'' instead, as has already been mentioned. Add an N before the quotes above and the question mark disappears (but the original invisible character is preserved in the output - and ltrim and rtrim won't remove it as demonstrated below):
declare #t nvarchar(10),
#s varchar(10) -- note: single-byte string
set #t = rtrim(ltrim(N'͏')) -- trimming doesn't work here either
set #s = #t
select #s -- still outputs a question mark
Imported data can definitely do this, I've seen it before, and characters like the one I've shown above are particularly hard to diagnose because you can't see them! You will need to create some sort of scrubbing process to remove these unprintables (and any other junk characters, for that matter), and make sure that you use nvarchar everywhere, or you'll end up with this issue. Worse, those phantom question marks will become real question marks that you won't be able to distinguish from legitimate ones.
To see what character code you're dealing with, you can cast as varbinary as follows:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- returns 0x4F0374006500730074003F00
-- Returns:
-- 0x4F03 7400 6500 7300 7400 3F00
-- badchar t e s t ?
Now to get rid of it:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- bad char
set #t = replace(#t COLLATE Latin1_General_100_BIN2, nchar(0x034f), N'');
select cast(#t as varbinary) -- gone!
Note I had to swap the byte order from 0x4f03 to 0x034f (same reason "t" appears in the output as 0x7400, not 0x0074). For some notes on why we're using binary collation, see this answer.
This is kind of messy, because you don't know what the dirty characters are, and they could be one of thousands of possibilities. One option is to iterate over strings using like or even the unicode() function and discard characters in strings that aren't in a list of acceptable characters, but this could be slow. It may be that most of your bad characters are either at the start or end of the string, which might speed this process up if that's an assumption you think you can make.
You may need to build additional processes either external to SQL Server or as part of a SSIS import based on what I've shown you above to strip this out quickly if you have a lot of data to import. If you aren't sure the best way to do this, that's probably best answered in a new question.
I hope that helps.

Related

SQL SUBSTRING & PATINDEX of varying lengths

SQL Server 2017.
Given the following 3 records with field of type nvarchar(250) called fileString:
_318_CA_DCA_2020_12_11-01_00_01_VM6.log
_319_CA_DCA_2020_12_12-01_VM17.log
_333_KF_DCA01_00_01_VM232.log
I would want to return:
VM6
VM17
VM232
Attempted thus far with:
SELECT
SUBSTRING(fileString, PATINDEX('%VM[0-9]%', fileString), 3)
FROM dbo.Table
But of course that only returns VM and 1 number.
How would I define the parameter for number of characters when it varies?
EDIT: to pre-emptively answer a question that may come up, yes, the VM pattern will always be proceeded immediately by .log and nothing else. But even if I took that approach and worked backwards, I still don't understand how to define the number of characters to take when the number varies.
here is one way :
DECLARE #test TABLE( fileString varchar(500))
INSERT INTO #test VALUES
('_318_CA_DCA_2020_12_11-01_00_01_VM6.log')
,('_319_CA_DCA_2020_12_12-01_00_01_VM17.log')
,('_333_KF_DCA_2020_12_15-01_00_01_VM232.log')
-- 5 is the length of file extension + 1 which is always the same size '.log'
SELECT
REVERSE(SUBSTRING(REVERSE(fileString),5,CHARINDEX('_',REVERSE(fileString))-5))
FROM #test AS t
This will dynamically grab the length and location of the last _ and remove the .log.
It is not the most efficient, if you are able to write a CLR function usnig C# and import it into SQL, that will be much more efficient. Or you can use this as starting point and tweak it as needed.
You can remove the variable and replace it with your table like below
DECLARE #TESTVariable as varchar(500)
Set #TESTVariable = '_318_CA_DCA_2020_12_11-01_00_01_VM6adf.log'
SELECT REPLACE(SUBSTRING(#TESTVariable, PATINDEX('%VM[0-9]%', #TESTVariable), PATINDEX('%[_]%', REVERSE(#TESTVariable))), '.log', '')
select *,
part = REPLACE(SUBSTRING(filestring, PATINDEX('%VM[0-9]%', filestring), PATINDEX('%[_]%', REVERSE(filestring))), '.log', '')
from table
Your lengths are consistent at the beginning. So get away from patindex and use substring to crop out the beginning. Then just replace the '.log' with an empty string at the end.
select *,
part = replace(substring(filestring,33,255),'.log','')
from table;
Edit:
Okay, from your edit you show differing prefix portions. Then patindex is in fact correct. Here's my solution, which is not better or worse than the other answers but differs with respect to the fact that it avoids reverse and delegates the patindex computation to a cross apply section. You may find it a bit more readable.
select filestring,
part = replace(substring(filestring, ap.vmIx, 255),'.log','')
from table
cross apply (select
vmIx = patindex('%_vm%', filestring) + 1
) ap

Central european characters in SQL

I have an issue. I have data stored on SQL server with central european characters like "č", "ř", "ž" etc. On the database I have the "Czech_CI_AS" collation which should accepted these characters. But when I try to select for example name of the street with this characters like this:
SELECT *
FROM Street where Name = 'Čáslavská'
It returns me nothing
When I remove the "č" it returns me what I need.
SELECT *
FROM Street where Name like '%áslavská'
I have this column in nvarchar type. But I cannot use the N character before my string because the external applications use this table for read and selects are made automaticlly.
Is here any solution? Or have I got something wrong?
Thanks for any help
#YuriyTsarkov really deservers the credit here. To elaborate on his answer.
From MSDN:
Prefix Unicode character string constants with the letter N. Without the N prefix, the string is converted to the default code page of the database. This default code page may not recognize certain characters.
Example
-- Storing Čáslavská in two vars, with and without N prefix.
DECLARE #Test_001 NVARCHAR(255) = 'Čáslavská' COLLATE Czech_CI_AS;
DECLARE #Test_002 NVARCHAR(255) = N'Čáslavská' COLLATE Czech_CI_AS;
-- Test output.
SELECT
#Test_001 AS T1,
#Test_002 AS T2
;
Returns
T1 T2
Cáslavská Čáslavská
You need to update all your external applications code to use selects with N, or, you need to change collation of your column to same, as used by external applications. It may cause some data loss.

Query to search an alphanumeric string in a non-alphanumeric column

Here is the issue - I have a database column that holds product serial number that are filled in by users, but without any kind of filters. For example, the user can fill the field as: DC-538, DC 538 or DC538, depending on his own interpretation - since the serial number is usually in the metal part of the product and it can be difficult to know If there's a blank space for examplo.
I can't format the current column values, because that are so many brands and we couldn't know for sure If taking out a non alpha numeric character can lead to problems. I mean, If they consider these kinds of character as part of an official number. For example: "DC-538-XXX" and "DC538-XXX" could be related to 2 different products. Very unlikely, but we cannot assume it doesn't happen.
Now I need to offer a search by serial number in my website... but, If the user searchs for "DC538" instead of "DC 538" he won't find it. What's the best approach ?
I believe that the perfect solution would be to have a kind of select that would search the exact string and also strip the non-alpha-num from the search term and compare to a stripped string in the database (that I don't have). But I don't know If there's a way to do that with SQL only.
Any ideas ?
Cheers
By using the below function, which was offered as an answer here and modifying it to return numeric characters:
CREATE FUNCTION [dbo].[RemoveNonAlphaCharacters] (#Temp VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE #KeepValues AS VARCHAR(MAX)
SET #KeepValues = '%[^a-z0-9]%'
WHILE PatIndex(#KeepValues, #Temp) > 0
SET #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
RETURN #Temp
END
You can do the following:
DECLARE Input NVARCHAR(MAX)
SET #Input = '%' + dbo.RemoveNonAlphaCharacters('text inputted by user') + '%'
SELECT *
FROM Table
WHERE dbo.RemoveNonAlphaCharacters(ColumnCode) LIKE #Input
Here is a sample working SQLFiddle

How can I make SQL Server return FALSE for comparing varchars with and without trailing spaces?

If I deliberately store trailing spaces in a VARCHAR column, how can I force SQL Server to see the data as mismatch?
SELECT 'foo' WHERE 'bar' = 'bar '
I have tried:
SELECT 'foo' WHERE LEN('bar') = LEN('bar ')
One method I've seen floated is to append a specific character to the end of every string then strip it back out for my presentation... but this seems pretty silly.
Is there a method I've overlooked?
I've noticed that it does not apply to leading spaces so perhaps I run a function which inverts the character order before the compare.... problem is that this makes the query unSARGable....
From the docs on LEN (Transact-SQL):
Returns the number of characters of the specified string expression, excluding trailing blanks. To return the number of bytes used to represent an expression, use the DATALENGTH function
Also, from the support page on How SQL Server Compares Strings with Trailing Spaces:
SQL Server follows the ANSI/ISO SQL-92 specification on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them.
Update: I deleted my code using LIKE (which does not pad spaces during comparison) and DATALENGTH() since they are not foolproof for comparing strings
This has also been asked in a lot of other places as well for other solutions:
SQL Server 2008 Empty String vs. Space
Is it good practice to trim whitespace (leading and trailing)
Why would SqlServer select statement select rows which match and rows which match and have trailing spaces
you could try somethign like this:
declare #a varchar(10), #b varchar(10)
set #a='foo'
set #b='foo '
select #a, #b, DATALENGTH(#a), DATALENGTH(#b)
Sometimes the dumbest solution is the best:
SELECT 'foo' WHERE 'bar' + 'x' = 'bar ' + 'x'
So basically append any character to both strings before making the comparison.
After some search the simplest solution i found was in Anthony Bloesch
WebLog.
Just add some text (a char is enough) to the end of the data (append)
SELECT 'foo' WHERE 'bar' + 'BOGUS_TXT' = 'bar ' + 'BOGUS_TXT'
Also works for 'WHERE IN'
SELECT <columnA>
FROM <tableA>
WHERE <columnA> + 'BOGUS_TXT' in ( SELECT <columnB> + 'BOGUS_TXT' FROM <tableB> )
The approach I’m planning to use is to use a normal comparison which should be index-keyable (“sargable”) supplemented by a DATALENGTH (because LEN ignores the whitespace). It would look like this:
DECLARE #testValue VARCHAR(MAX) = 'x';
SELECT t.Id, t.Value
FROM dbo.MyTable t
WHERE t.Value = #testValue AND DATALENGTH(t.Value) = DATALENGTH(#testValue)
It is up to the query optimizer to decide the order of filters, but it should choose to use an index for the data lookup if that makes sense for the table being tested and then further filter down the remaining result by length with the more expensive scalar operations. However, as another answer stated, it would be better to avoid these scalar operations altogether by using an indexed calculated column. The method presented here might make sense if you have no control over the schema , or if you want to avoid creating the calculated columns, or if creating and maintaining the calculated columns is considered more costly than the worse query performance.
I've only really got two suggestions. One would be to revisit the design that requires you to store trailing spaces - they're always a pain to deal with in SQL.
The second (given your SARG-able comments) would be to add acomputed column to the table that stores the length, and add this column to appropriate indexes. That way, at least, the length comparison should be SARG-able.

How to compare two column values which are comma separated values?

I have one table with specific columns, in that there is a column which contains comma separated values like test,exam,result,other.
I will pass a string like result,sample,unknown,extras as a parameter to the stored procedure. and then I want to get the related records by checking each and every phrase in this string.
For Example:
TableA
ID Name Words
1 samson test,exam,result,other
2 john sample,no query
3 smith tester,SE
Now I want to search for result,sample,unknown,extras
Then the result should be
ID Name Words
1 samson test,exam,result,other
2 john sample,no query
because in the first record result matched and in the second record sample matched.
That's not a great design, you know. Better to split Words off into a separate table (id, word).
That said, this should do the trick:
set nocount on
declare #words varchar(max) = 'result,sample,unknown,extras'
declare #split table (word varchar(64))
declare #word varchar(64), #start int, #end int, #stop int
-- string split in 8 lines
select #words += ',', #start = 1, #stop = len(#words)+1
while #start < #stop begin
select
#end = charindex(',',#words,#start)
, #word = rtrim(ltrim(substring(#words,#start,#end-#start)))
, #start = #end+1
insert #split values (#word)
end
select * from TableA a
where exists (
select * from #split w
where charindex(','+w.word+',',','+a.words+',') > 0
)
May I burn in DBA hell for providing you this!
Edit: replaced STUFF w/ SUBSTRING slicing, an order of magnitude faster on long lists.
Personally I think you'd want to look at your application/architecture and think carefully about whether you really want to do this in the database or the application. If it isn't appropriate or not an option then you'll need to create a custom function. The code in the article here should be easy enough to modify to do what you want:
Quick T-Sql to parse a delimited string (also look at the code in the comments)
Like the others have already said -- what you have there is a bad design. Consider using proper relations to represent these things.
That being said, here's a detailed article about how to do this using SQL Server:
http://www.sommarskog.se/arrays-in-sql-2005.html
One thing no one has covered so far, because it's often a very bad idea -- but then, you are already working with a bad idea, and sometimes two wrongs make a right -- is to extract all rows that match ANY of your strings (using LIKE or some such) and doing the intersection yourself, client-side. If your strings are fairly rare and highly correlated, this may work pretty well; it will be god-awful in most other cases.

Resources