Specify string length for substring length parameter in T-SQL - sql-server

According to this msdn article and using the Substring:
SUBSTRING (value_expression ,start_expression ,length_expression )
length_expression
Is a positive integer or bigint expression that specifies how many
characters of the value_expression will be returned. If
length_expression is negative, an error is generated and the statement
is terminated. If the sum of start_expression and length_expression is
greater than the number of characters in value_expression, the whole
value expression beginning at start_expression is returned.
So if I had:
DECLARE #data varchar(max)
SELECT TOP 1 #data = Data
FROM [SomeDatabase].[dbo].[MyTable]
SELECT SUBSTRING(#data, 10, LEN(#data))
My understanding that because SUBSTRING has found that the length asked for is longer than the length of string supplied it will return you everything from the start_index till then end of the string will be taken.
for example if:
#data = "hey there"; // char length of 9
SUBSTRING(#data, 4, 20)
This should return there
Is this a particularly bad thing to do?
Are there any caveats to doing this?
Should I be explicit about the length of string to return?

Related

Why does LEN( char(32) ) = 0 in T-SQL?

I wanted to write a function to count the number of delimiters or any substring (which could be a space) in a string of text, throwing a hack error if the delimiter was null or empty:
if len(#lookfor)=0 or #lookfor is null return Cast('substring must not be null or empty' as int)
But if the function is called with #lookfor = ' ' that trips the error.
I am aware of DATALENGTH(). Just curious why a single space is treated as "trailing" if there's nothing before it.
I am aware of DATALENGTH(). Just curious why a single space is treated
as "trailing" if there's nothing before it.
It's trailing because it's at the end of the string. It's also leading since it's the at the beginning.
But if the function is called with #lookfor = '' that trips the error
Something that messes a lot of people up with SQL is how '' = ' '; Note this query:
DECLARE #blank VARCHAR(10) = '', #space VARCHAR(10) = CHAR(32);
SELECT CASE WHEN #blank = #space THEN 'That the...!?!?' END;
You can change #space to CHAR(32)+CHAR(32)+.... and #space and #blank will still be equal.
Complicating things a little more note that the DATALENGTH for a blank/empty value is 0 when it's a VARCHAR(N) but the DATALENGTH is N when for CHAR(N) values. In other words,
SELECT DATALENGTH(CAST('' AS CHAR(1))) returns 1 and SELECT DATALENGTH(CAST('' AS CHAR(10))) returns 10.
That means that if your delimiter variable is say, CHAR(1) - that will mess you up. Here's the function for you:
CREATE FUNCTION dbo.CountDelimiters(#string VARCHAR(8000), #delimiter VARCHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT DCount = MAX(DATALENGTH(#string)-LEN(REPLACE(#string,#delimiter,'')))
WHERE DATALENGTH(#delimiter) > 0;
Note that #delimter is VARCHAR(1) and NOT a CHAR datatype.
The formula to count delimiters in #string is:
DATALENGTH(#string)-LEN(REPLACE(#string,#delimiter,''))
or
(DATALENGTH(#string)-LEN(REPLACE(#string,#delimiter,'')))/DATALENGTH(#delimiter) when dealing with delimiters longer than 1`.
WHERE DATALENGTH(#delimiter) > 0 will force the function to ignore a NULL or blank value. This is known as a Startup Predicate.
Putting a MAX around DATALENGTH(#string)-LEN(REPLACE(#string,#delimiter,'')) forces the function to rerturn a NULL value in the event you pass it a blank or NULL value.
This will return 10 for the number of spaces in my string:
SELECT f.DCount FROM dbo.CountDelimiters('one space two spaces three ', CHAR(32)) AS f;
Against a table you would use the function like this (note that I'm counting the number of times the letter "A" appears:
-- Sample Strings
DECLARE #table TABLE (SomeText VARCHAR(36));
INSERT #table VALUES('ABCABC'),('XXX'),('AAA'),(''),(NULL);
SELECT t.SomeText, f.DCount
FROM #table AS t
CROSS APPLY dbo.CountDelimiters(t.SomeText, 'A') AS f;
Which returns:
SomeText DCount
------------------------------------ -----------
ABCABC 2
XXX 0
AAA 3
0
NULL NULL
If a string has a chacacter at the end, it is considered trailing, even if there are no other characters before it. Same for logic regarding leading characters.
So ' ' can be considered an empty string ('') having a trailing space.
When I started using SQL, I also noticed the behavior that the LEN function ignores trailing spaces. And I think (but I am not sure) that is has to do with the fact that LEN should probably also behave "correctly" when used with CHAR/NCHAR values. Unlike VARCHAR/NVARCHAR, the CHAR/NCHAR values have a fixed width and will be filled with trailing spaces automatically. So when you put value 'abc' in a field/variable of type CHAR(5), the value will become 'abc ', but the LEN function will still "correctly" return 3 in that case.
I consider this just to be a strange quirk of SQL.
Remark:
The DATALENGTH function will not ignore trailing spaces in VARCHAR/NVARCHAR values. But note that DATALENGTH will return the size in bytes of the field's value. So if you use unicode data (NCHAR/NVARCHAR), the DATALENGTH function will return 6 for value N'abc', because each unicode character in SQL Server uses 2 bytes!

Do we need to remove some charaters from SUBSTRING length if start is not from 0?

I came across this sentence and thing what the difference between
DECLARE #MyStr nvarchar(max) = 'This is a sentence'
SELECT SUBSTRING(#MyStr, 6, LEN(#MyStr)**-5**)
and
DECLARE #MyStr nvarchar(max) = 'This is a sentence'
SELECT SUBSTRING(#MyStr, 6, LEN(#MyStr))
I ran both and the latter doesn't add any spaces or anything.
The third parameter to SUBSTRING is the length of substring to take. In the first example, the length covers exactly the remainder of the string. In the second example, the length exceeds the actual available length of the string, by 5 characters. Regarding what happens when the length is greater than the available length, we can turn to the documentation for SUBSTRING:
length
Is a positive integer or bigint expression that specifies how many characters of the expression will be returned. If length is negative, an error is generated and the statement is terminated. If the sum of start and length is greater than the number of characters in expression, the whole value expression beginning at start is returned.
In other words, if the length parameter plus the start passed is greater than what is available, then SUBSTRING just returns as much as is available.

Oracle: Check if number column contains a value from a formatted string of numbers

In my local table, I am try to check if an Oracle Number column called JOBNUMBER has a value that exists in a string parameter. Technically I am passing in the string as a stored procedure nvarchar2 parameter, but for simplicity, I hardcoded the string in my Query below:
SELECT FIRST_NAME, JOB_NUMBER
FROM JOBTABLE
WHERE TO_CHAR(JOB_NUMBER) IN ('00052, 00048');
When Oracle runs the query above, it returns no values even though 00052 is a number value in the table column for JOB_NUMBER. I'm thinking that it checks for the whole string ('00052, 00048') in JOB_NUMBER and can't find it, so it returns no values. The string will contain different values each time, and there will several numbers (of type string) in that string.
Does anyone know how to do this?
The trick is to keep the leading zeroes of the number when comparing to the string, then looping through the string to compare. Here a CTE is used is to simulate creating a numeric job number and a string to search. The TO_CHAR function makes sure to preserve the leading zeroes and the FM format removes the leading space that TO_CHAR leaves for the sign. CONNECT BY loops through the elements for the count of the delimiter + 1 times, keeping the count in the value in 'LEVEL'. This value is used in REGEXP_SUBSTR to iterate through the elements to compare the converted numeric value to each element to see if a match is found. Note this regular expression allows for NULL elements should you need to know which item in the list is your match.
SQL> with tbl(job_nbr_in, job_str_in) as (
select 00052, '00052, 00048' from dual
)
select --level element_nbr,
to_char(job_nbr_in, 'FM00000') search_for, job_str_in in_string,
regexp_substr(job_str_in, '(.*?)(, |$)', 1, level, NULL, 1) found
from tbl
where to_char(job_nbr_in, 'FM00000') = regexp_substr(job_str_in, '(.*?)(, |$)', 1, level, NULL, 1)
connect by level <= regexp_count(job_str_in, ',')+1;
SEARCH_FOR IN_STRING FOUND
---------- ------------ ------------
00052 00052, 00048 00052
If you are not sure if you will always have a space after the comma, remove spaces with REPLACE and adjust the delimiter in REGEXP_SUBSTR:
with tbl(job_nbr_in, job_str_in) as (
select 00052, '00052, 00048' from dual
)
select to_char(job_nbr_in, 'FM00000') search_for, job_str_in in_string,
regexp_substr(replace(job_str_in, ' '), '(.*?)(,|$)', 1, level, NULL, 1) found
from tbl
where to_char(job_nbr_in, 'FM00000') = regexp_substr(replace(job_str_in, ' '), '(.*?)(,|$)', 1, level, NULL, 1)
connect by level <= regexp_count(job_str_in, ',')+1;

What exactly is the meaning of nvarchar(n)

The documentation isn't super clear: https://msdn.microsoft.com/en-us/library/ms186939.aspx
What happens if I try to store a 20 character length string in a column defined as nvarchar(10)? Is 10 the max length the field could be or is it the expected length? If I can exceed n characters in the string, what are the performance implications of doing that?
The maximum number of characters you can store in a column or variable typed as nvarchar(n) is n. If you try to store more your string will be truncated, or in case of an insert into a table, the insert would be disallowed with a warning about possible truncation:
String or binary data would be truncated. The statement has been
terminated.
declare #n nvarchar(10)
set #n = N'more than ten chars'
select #n
Result:
----------
more than
(1 row(s) affected)
From my understanding, nvarchar will only only store the provided characters up to the amount defined. Nchar will actually fill in the unused characters with whitespace.

Sql Server's regex LIKE - behaviour clarification?

Someone asked here how to get only values which are a number :
So , if the table is :
DECLARE #Table TABLE(
Col nVARCHAR(50)
)
INSERT INTO #Table SELECT 'ABC'
INSERT INTO #Table SELECT '234.62'
INSERT INTO #Table SELECT '10:10:10:10'
INSERT INTO #Table SELECT 'France'
INSERT INTO #Table SELECT '2'
then - the desired results are :
234.62
2
But when I tested this query :
SELECT * FROM #Table WHERE Col LIKE '%[0-9.]%' --expected to see only 234.62
it showed :
234.62
10:10:10:10
2
Question #1
How come 10:10:10:10 , 2 satisfies the condition ?
Question #2
I saw this answer here which does work
SELECT * FROM #Table WHERE Col NOT LIKE '%[^0-9.]%'
But I don't understand why this works. AFAIU - it selects all values which are not like (not(has number) and not( has dot)) which is ===>(de morgan)===> not like ( has number or has dot)
Can someone please shed light ?
nb I already know that isnumeric can be used also , but it's unsafe (+). also valid wildcards are %,_,[],[^]
Any particular use of [set] within a LIKE expression is a check against one character in the target string.
So, LIKE '%[0-9.]%' says - % - match 0-to-many arbitrary characters, then [0-9.] match one character in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character in the set 0-9.". So, 10:10:10:10 can be matched as 0 arbitrary characters, then 1 matches [0-9.], and then 0:10:10:10 matches the final %.
LIKE '%[^0-9.]%' says - % - match 0-to-many arbitrary characters, then [^0-9.] match one character not in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character outside of the set 0-9.. So when we apply the NOT to the front of that, we are saying "match any string that doesn't contain at least one character outside of the set 0-9." or "match strings that only contain characters in the set 0-9..
Essentially, the double-negative is a way to make an assertion about all characters in the string.

Resources