I read through the documentation for the SqlServer EXCEPT operator and I see no mention of explicit trimming of white space at the end of a string. However, when running:
SELECT 'Test'
EXCEPT
SELECT 'Test '
no results are returned. Can anyone explain this behavior or how to avoid it when using EXCEPT?
ANSI SQL-92 requires strings to be the same length before comparing and the pad character is a space.
See https://support.microsoft.com/en-us/help/316626/inf-how-sql-server-compares-strings-with-trailing-spaces for more information
In the ANSI standard (accessed here section 8.2 )
3) The comparison of two character strings is determined as follows:
a) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad char-
acters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any char-
acter in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.
b) The result of the comparison of X and Y is given by the col-
lating sequence CS.
c) Depending on the collating sequence, two strings may com-
pare as equal even if they are of different lengths or con-
tain different sequences of characters. When the operations
MAX, MIN, DISTINCT, references to a grouping column, and the
UNION, EXCEPT, and INTERSECT operators refer to character
strings, the specific value selected by these operations from
a set of such equal values is implementation-dependent.
If this behaviour must be avoided, you can reverse the columns as part of your EXCEPT:
SELECT 'TEST', REVERSE('TEST')
EXCEPT
SELECT 'TEST ', REVERSE('TEST ')
which gives the expected result, though is quite annoying especially if you're dealing with multiple columns.
The alternative would be to find a collating sequence with an alternate pad character or a no pad option set, though this seems to not exist in t-sql after a quick google.
Alternatively, you could terminate each column with a character and then substring it out in the end:
SELECT SUBSTRING(col,1,LEN(col) -1) FROM
(
SELECT 'TEST' + '^' as col
EXCEPT
SELECT 'TEST ' + '^'
) results
Related
I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!
When I execute this
select PATINDEX('%[0 ]%', '03/SI/00807/18-19')
I am getting 1.
By using ^ like this:
select PATINDEX('%[^0 ]%', '03/SI/00807/18-19')
I am getting 2.
[^] Allows you to match on any character not in the [^] brackets (for example, [^abc] would match on any character that is not a, b, or c characters) Whereas
[ ] Allows you to match on any character in the [ ] brackets (for example, [abc] would match on a, b, or c characters)
_ Allows you to match on a single character
% Allows you to match any string of any length (including zero length)
[^abcd] means: any one character EXCEPT a,b,c or d
select PATINDEX('%[0 ]%', '03/SI/00807/18-19')
The first character in your string which is (0 or space) is the 0 in the first place, so patindex returns 1.
select PATINDEX('%[^0 ]%', '03/SI/00807/18-19')
The first character in your string which is (neither 0 nor space) is the 3 in the second place, so patindex returns 2.
I came across this sentence and thing what the difference between
DECLARE #MyStr nvarchar(max) = 'This is a sentence'
SELECT SUBSTRING(#MyStr, 6, LEN(#MyStr)**-5**)
and
DECLARE #MyStr nvarchar(max) = 'This is a sentence'
SELECT SUBSTRING(#MyStr, 6, LEN(#MyStr))
I ran both and the latter doesn't add any spaces or anything.
The third parameter to SUBSTRING is the length of substring to take. In the first example, the length covers exactly the remainder of the string. In the second example, the length exceeds the actual available length of the string, by 5 characters. Regarding what happens when the length is greater than the available length, we can turn to the documentation for SUBSTRING:
length
Is a positive integer or bigint expression that specifies how many characters of the expression will be returned. If length is negative, an error is generated and the statement is terminated. If the sum of start and length is greater than the number of characters in expression, the whole value expression beginning at start is returned.
In other words, if the length parameter plus the start passed is greater than what is available, then SUBSTRING just returns as much as is available.
I have a database that has multiple columns populated with various numeric fields. While trying to populate from a CSV, I must have mucked up assigning delimited fields. The end result is a column containing It's Correct information, but also contains the next column over's data- seperated by a comma.
So instead of Column UPC1 containing "958634", it contains "958634,95877456". The "95877456" is supposed to be in the UPC2 column, instead UPC2 is NULL.
Is there a way for me to split on the comma and send the data to UPC2 while keeping UPC1 data before the comma in tact?
Thanks.
You can do this with string functions. To query the values and verify the logic, try this:
SELECT
LEFT(UPC1, CHARINDEX(',', UPC1) - 1),
SUBSTRING(UPC1, CHARINDEX(',', UPC1) + 1, 1000)
FROM myTable;
If the result is what you want, turn it into an update:
UPDATE myTable SET
UPC1 = LEFT(UPC1, CHARINDEX(',', UPC1) - 1),
UPC2 = SUBSTRING(UPC1, CHARINDEX(',', UPC1) + 1, 1000);
The expression for UPC1 takes the left side of UPC1 up to one character before the comma.
The expression for UPC2 takes the remainder of the UPC1 string starting one character after the comma.
The third argument to SUBSTRING needs some explaining. It's the number of characters you want to include after the starting position of the string (which in this case is one character after the comma's location). If you specify a value that's longer than the string SUBSTRING will just return to the end of the string. Using 1000 here is a lot easier than calculating the exact number of characters you need to get to the end.
I have some sql function that returns character varying type. The output is something like this:
'TTFFFFNN'. I need to get this characters by index. How to convert character varying to array?
Use string_to_array() with NULL as delimiter (pg 9.1+):
SELECT string_to_array('TTFFFFNN'::text, NULL) AS arr;
Per documentation:
In string_to_array, if the delimiter parameter is NULL, each character
in the input string will become a separate element in the resulting array.
In older versions (pg 9.0-), the call with NULL returned NULL. (Fiddle.)
To get the 2nd position (example):
SELECT (string_to_array('TTFFFFNN'::text, NULL))[2] AS item2;
Alternatives
For single characters I would use substring() directly, like #a_horse commented:
SELECT substring('TTFFFFNN'::text, 2, 1) AS item2;
SQL Fiddle showing both.
For strings with actual delimiters, I suggest split_part():
Split comma separated column data into additional columns
Only use regexp_split_to_array() if you must. Regular expression processing is more expensive.
In addition to the solutions outlined by Erwin and a horse, you can use regexp_split_to_array() with an empty regexp instead:
select regexp_split_to_array('TTFFFFNN'::text, '');
With an index, that becomes:
select (regexp_split_to_array('TTFFFFNN'::text, ''))[2];