Determining index of last uppercase letter in column value (SQL)? - sql-server

Short version: Is there a way to easily extract and ORDER BY a substring of values in a DB column based on the index (position) of the last upper case letter in that value, only using SQL?
Long version: I have a table with a username field, and the convention for usernames is the capitalized first initial of the first name, followed by the capitalized first initial of the last name, followed by the rest of the last name. As a result, ordering by the username field is 'wrong'. Ordering by a substring of the username value would theoretically work, e.g.
SUBSTRING(username,2, LEN(username))
...except that there are values with a capitalized middle initials between the other two initials. I am curious to know if, using only SQL (MS SQL Server,) there is a fairly straightforward / simple way to:
Test the case of a character in a DB value (and return a boolean)
Determine the index of the last upper case character in a string value
Assuming this is even remotely possible, I assume one would have to loop through the individual letters of each username to accomplish it, making it terribly inefficient, but if you have a magical shortcut, feel free to share. Note: This question is purely academic as I have since decided to go a much simpler way. I am just curious if this is even possible.

Test the case of a character in a DB value (and return a boolean)
SQL Server does not have a boolean datatype. bit is often used in its place.
DECLARE #Char CHAR(1) = 'f'
SELECT CAST(CASE
WHEN #Char LIKE '[A-Z]' COLLATE Latin1_General_Bin
THEN 1
ELSE 0
END AS BIT)
/* Returns 0 */
Note it is important to use a binary collation rather than a case sensitive collate clause with the above syntax. If using a CS collate clause the pattern would need to be spelled out in full as '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' to avoid matching lower case characters.
Determine the index of the last upper case character in a string value
SELECT PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin, REVERSE('Your String'))
/* Returns one based index (6 ) */
SELECT PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin, REVERSE('no capitals'))
/* Returns 0 if the test string doesn't contain any letters in the range A-Z */
To extract the surname according to those rules you can use
SELECT RIGHT(Name,PATINDEX('%[A-Z]%' COLLATE Latin1_General_Bin ,REVERSE(Name)))
FROM YourTable

Related

Select only alphabetic prefix from a column in sql server

I have a regex expression I need to convert to a sql server query. The goal is to select the alphabetic prefix from a column, so AAA-1234BC would become AAA, ABC123 would become ABC and 1234A would become ''.
From my understanding, regex isnt natively supported in SQL server, so I am wondering if there is another purely sql way to do this.
Thanks
We can use PATINDEX here:
SELECT
col,
CASE WHEN col LIKE '[A-Z]%'
THEN LEFT(col, PATINDEX('%[^A-Za-z]%', col) - 1)
ELSE '' END AS first_alpha
FROM yourTable;
Demo
The above logic finds, in the case of strings which do begin with a letter, the index of the first non letter character, and then takes a substring from the first letter until before that first non letter. Otherwise, it returns empty string.

How can I make LIKE match a number or empty string inside square brackets in T-SQL?

Is it possible to have a LIKE clause with one character number or an empty string?
I have a field in which I will write a LIKE clause (as a string). I will apply it later with an expression in the WHERE clause: ... LIKE tableX.FormatField .... It must contain a number (a single character or an empty string).
Something like [0-9 ]. Where the space bar inside square brackets means an empty string.
I have a table in which I have a configuration for parameters - TblParam with field DataFormat. I have to validate a value from another table, TblValue, with field ValueToCheck. The validation is made by a query. The part for the validation looks like:
... WHERE TblValue.ValueToCheck LIKE TblParam.DataFormat ...
For the configuration value, I need an expression for one numeric character or an empty string. Something like [0-9'']. Because of the automatic nature of the check, I need a single expression (without AND OR OR operators) which can fit the query (see the example above). The same check is valid for other types of the checks, so I have to fit my check engine.
I am almost sure that I can not use [0-9''], but is there another suitable solution?
Actually, I have difficulty to validate a version string: 1.0.1.2 or 1.0.2. It can contain 2-3 dots (.) and numbers.
I am pretty sure it is not possible, as '' is not even a character.
select ascii(''); returns null.
'' = ' '; is true
'' is null; is false
If you want exactly 0-9 '' (and not ' '), then you do to something like this (in a more efficient way than like):
where col in ('1','2','3','4','5','6','7','9','0') or (col = '' and DATALENGTH(col) = 0)
That's a tricky one... As far as I can tell, there isn't a way to do it with only one like clause. You need to do like '[0-9]' OR like ''.
You could accomplish this by having a second column in your TableX. That indicates either a second pattern, or whether or not to include blanks.
If I correctly understand your question, you need something that catches an empty string. Try to use the nullif() function:
create table t1 (a nvarchar(1))
insert t1(a) values('')
insert t1(a) values('1')
insert t1(a) values('2')
insert t1(a) values('a')
-- must select first three
select a from t1 where a like '[0-9]' or nullif(a,'') is null
It returns exactly three records: '', '1' and '2'.
A more convenient method with only one range clause is:
select a from t1 where isnull(nullif(a,''),0) like '[0-9]'

Microsoft SQL Server collation names

Does anybody know what the WS property of a collation does? Does it have anything to do with Asian type of scripts? The MSDN docs explain it to be "Width Sensitive", but that doesn't make any sense for say Swedish, or English...?
A good description of width sensitivity is summarized here: http://www.databasejournal.com/features/mssql/article.php/3302341/SQL-Server-and-Collation.htm
Width sensitivity
When a single-byte character
(half-width) and the same character
when represented as a double-byte
character (full-width) are treated
differently then it is width
sensitive.
Perhaps from an English character perspective, I would theorize that a width-sensitive collation would mean that 'abc' <> N'abc', because one string is a Unicode string (2 bytes per character), whereas the other one byte per character.
From a Latin characterset perspective it seems like something that wouldn't make sense to set. Perhaps in other languages this is important.
I try to set these types of collation properties to insensitive in general in order to avoid weird things like records not getting returned in search results. I usually keep accents set to insensitive, since that can cause a lot of user search headaches, depending on the audience of your applications.
Edit:
After creating a test database with the Latin1_General_CS_AS_WS collation, I found that the N'a' = N'A' is actually true. Test queries were:
select case when 'a' = 'A' then 'yes' else 'no' end
select case when 'a' = 'a' then 'yes' else 'no' end
select case when N'a' = 'a' then 'yes' else 'no' end
So in practice I'm not sure where this type of rule comes into play
The accepted answer demonstrates that it does not come into play for the comparison N'a' = 'a'. This is easily explained because the char will get implicitly converted to nchar in the comparison between the two anyway so both strings in the comparison are Unicode.
I just thought of an example of a place where width sensitivity might be expected to come into play in a Latin Collation only to discover that it appeared to make no difference at all there either...
DECLARE #T TABLE (
a VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS,
b VARCHAR(2) COLLATE Latin1_General_100_CS_AS_WS )
INSERT INTO #T
VALUES (N'Æ',
N'AE');
SELECT LEN(a) AS [LEN(a)],
LEN(b) AS [LEN(b)],
a,
b,
CASE
WHEN a = b THEN 'Y'
ELSE 'N'
END AS [a=b]
FROM #T
LEN(a) LEN(b) a b a=b
----------- ----------- ---- ---- ----
1 2 Æ AE Y
The Book "Microsoft SQL Server 2008 Internals" has this to say.
Width Sensitivity refers to East Asian
languages for which there exists both
half-width and full-width forms of
some characters.
There is absolutely nothing stopping you storing these characters in a collation such as Latin1_General_100_CS_AS_WS as long as the column has a unicode data type so I guess that the WS part would only apply in that particular situation.

How can I make SQL Server return FALSE for comparing varchars with and without trailing spaces?

If I deliberately store trailing spaces in a VARCHAR column, how can I force SQL Server to see the data as mismatch?
SELECT 'foo' WHERE 'bar' = 'bar '
I have tried:
SELECT 'foo' WHERE LEN('bar') = LEN('bar ')
One method I've seen floated is to append a specific character to the end of every string then strip it back out for my presentation... but this seems pretty silly.
Is there a method I've overlooked?
I've noticed that it does not apply to leading spaces so perhaps I run a function which inverts the character order before the compare.... problem is that this makes the query unSARGable....
From the docs on LEN (Transact-SQL):
Returns the number of characters of the specified string expression, excluding trailing blanks. To return the number of bytes used to represent an expression, use the DATALENGTH function
Also, from the support page on How SQL Server Compares Strings with Trailing Spaces:
SQL Server follows the ANSI/ISO SQL-92 specification on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them.
Update: I deleted my code using LIKE (which does not pad spaces during comparison) and DATALENGTH() since they are not foolproof for comparing strings
This has also been asked in a lot of other places as well for other solutions:
SQL Server 2008 Empty String vs. Space
Is it good practice to trim whitespace (leading and trailing)
Why would SqlServer select statement select rows which match and rows which match and have trailing spaces
you could try somethign like this:
declare #a varchar(10), #b varchar(10)
set #a='foo'
set #b='foo '
select #a, #b, DATALENGTH(#a), DATALENGTH(#b)
Sometimes the dumbest solution is the best:
SELECT 'foo' WHERE 'bar' + 'x' = 'bar ' + 'x'
So basically append any character to both strings before making the comparison.
After some search the simplest solution i found was in Anthony Bloesch
WebLog.
Just add some text (a char is enough) to the end of the data (append)
SELECT 'foo' WHERE 'bar' + 'BOGUS_TXT' = 'bar ' + 'BOGUS_TXT'
Also works for 'WHERE IN'
SELECT <columnA>
FROM <tableA>
WHERE <columnA> + 'BOGUS_TXT' in ( SELECT <columnB> + 'BOGUS_TXT' FROM <tableB> )
The approach I’m planning to use is to use a normal comparison which should be index-keyable (“sargable”) supplemented by a DATALENGTH (because LEN ignores the whitespace). It would look like this:
DECLARE #testValue VARCHAR(MAX) = 'x';
SELECT t.Id, t.Value
FROM dbo.MyTable t
WHERE t.Value = #testValue AND DATALENGTH(t.Value) = DATALENGTH(#testValue)
It is up to the query optimizer to decide the order of filters, but it should choose to use an index for the data lookup if that makes sense for the table being tested and then further filter down the remaining result by length with the more expensive scalar operations. However, as another answer stated, it would be better to avoid these scalar operations altogether by using an indexed calculated column. The method presented here might make sense if you have no control over the schema , or if you want to avoid creating the calculated columns, or if creating and maintaining the calculated columns is considered more costly than the worse query performance.
I've only really got two suggestions. One would be to revisit the design that requires you to store trailing spaces - they're always a pain to deal with in SQL.
The second (given your SARG-able comments) would be to add acomputed column to the table that stores the length, and add this column to appropriate indexes. That way, at least, the length comparison should be SARG-able.

Get a number from a sql string range

I have a column of data that contains a percentage range as a string that I'd like to convert to a number so I can do easy comparisons.
Possible values in the string:
'<5%'
'5-10%'
'10-15%'
...
'95-100%'
I'd like to convert this in my select where clause to just the first number, 5, 10, 15, etc. so that I can compare that value to a passed in "at least this" value.
I've tried a bunch of variations on substring, charindex, convert, and replace, but I still can't seem to get something that works in all combinations.
Any ideas?
Try this,
SELECT substring(replace(interest , '<',''), patindex('%[0-9]%',replace(interest , '<','')), patindex('%[^0-9]%',replace(interest, '<',''))-1) FROM table1
Tested at my end and it works, it's only my first try so you might be able to optimise it.
#Martin: Your solution works.
Here is another I came up with based on inspiration from #mercutio
select cast(replace(replace(replace(interest,'<',''),'%',''),'-','.0') as numeric) test
from table1 where interest is not null
You can convert char data to other types of char (convert char(10) to varchar(10)), but you won't be able to convert character data to integer data from within SQL.
I don't know if this works in SQL Server, but within MySQL, you can use several tricks to convert character data into numbers. Examples from your sample data:
"<5%" => 0
"5-10%" => 5
"95-100%" => 95
now obviously this fails your first test, but some clever string replacements on the start of the string would be enough to get it working.
One example of converting character data into numbers:
SELECT "5-10%" + 0 AS foo ...
Might not work in SQL Server, but future searches may help the odd MySQL user :-D
You can do this in sql server with a cursor. If you can create a CLR function to pull out number groupings that will help. Its possible in T-SQL, just will be ugly.
Create the cursor to loop over the list.
Find the first number, If there is only 1 number group in their then return it. Otherwise find the second item grouping.
if there is only 1st item grouping returned and its the first item in the list set it to upper bound.
if there is only 1st item grouping returned and its the last item in the list set it to lower bound.
Otherwise set the 1st item grouping to lower, and the 2nd item grouping to upper bound
Just set the resulting values back to a table
The issue you are having is a symptom of not keeping the data atomic. In this case it looks purely unintentional (Legacy) but here is a link about it.
To design yourself out of this create a range_lookup table:
Create table rangeLookup(
rangeID int -- or rangeCD or not at all
,rangeLabel varchar(50)
,LowValue int--real or whatever
,HighValue int
)
To hack yourself out here some pseudo steps this will be a deeply nested mess.
normalize your input by replacing all your crazy charecters.
replace(replace(rangeLabel,"%",""),"<","")
--This will entail many nested replace statments.
Add a CASE and CHARINDEX to look for a space if there is none you have your number
else use your substring to take everything before the first " ".
-- theses steps are wrapped around the previous step.
It's complicated, but for the test cases you provided, this works. Just replace #Test with the column you are looking in from your table.
DECLARE #TEST varchar(10)
set #Test = '<5%'
--set #Test = '5-10%'
--set #Test = '10-15%'
--set #Test = '95-100%'
Select CASE WHEN
Substring(#TEST,1,1) = '<'
THEN
0
ELSE
CONVERT(integer,SUBSTRING(#TEST,1,CHARINDEX('-',#TEST)-1))
END
AS LowerBound
,
CASE WHEN
Substring(#TEST,1,1) = '<'
THEN
CONVERT(integer,Substring(#TEST,2,CHARINDEX('%',#TEST)-2))
ELSE
CONVERT(integer,Substring(#TEST,CHARINDEX('-',#TEST)+1,CHARINDEX('%',#TEST)-CHARINDEX('-',#TEST)-1))
END
AS UpperBound
You'd probably be much better off changing <5% and 5-10% to store 2 values in 2 fields. Instead of storing <5%, you would store 0, and 5, and instead of 5-10%, yould end up with 5 and 10. You'd end up with 2 columns, one called lowerbound, and one called upperbound, and then just check value >= lowerbound AND value < upperbound.

Resources