Are there limitations on Snowflake's split_part function? - snowflake-cloud-data-platform

Does Snowflake's split_part function have a limit on how large the string or individual delimited parts of the string can be? For e.g. in SQL Server, if any part of the string exceeds 256 bytes, the parsename function will return nullfor that part.
I looked here, but couldn't find any mention of such limitation

To prove that there's no limit close to 256 bytes, I generated a 3MB string with 3 substrings. split_part() was able to extract a 1MB string without problem:
create table LONG_STRING
as
select repeat('abcdefghijk', 100000)||','
||repeat('abcdefghijk', 100000)||','
||repeat('abcdefghijk', 100000) ls
;
select len(ls)
, len(split_part(ls, ',', 2))
from LONG_STRING
# 3,300,002 1,100,000

Related

TDengine insertion use taos_stmt apis

After creating super table and tables, call taos_load_table_info to load the table information. Then initialize stmt by calling taos_stmt_init and taos_stmt_set_tbname to set up table name.
Create the TAOS_BIND object with the following attributes:
buffer_type = TSDB_DATA_TYPE_NCHAR
buffer_length = sizeof(str)
buffer = &str
length = sizeof(str)
Then call taos_stmt_bind_param and taos_stmt_add_batch, and finally execute with taos_stmt_execute.
The problem is that the insertion failed because I check the shell and use select * to look for the data but it only shows an empty column.
I strongly recommend you first try to insert a simple nchar type data to check whether it is the taos_stmt API's problem. If that insertion success, then you can also check if the insert nchar string has the same length as str variable. Sometimes, buffer_length is greater than or equal to length. If the actual size of your nchar data is less than the length value in TAOS_BIND, then tdengine will still analyze the binding value with other extra empty values and will fail to insert.

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

Max bytes that can be sent to SQL Server using RODBC's sqlQuery()

What is the maximum character string length / bytes that can be sent to SQL Server using RODBC's function sqlQuery()? I've been using sqlQuery() primarily for updating and inserting records in tables by sending multiple statements in one string, batches. For example, this string has 3 query statements and has a string length / byte size of 146 using nchar() when it's not broken up by newlines and spaces.
update_queries = "UPDATE tbl SET col1 = newval1 WHERE col2 = val1;
UPDATE tbl SET col1 = newval2 WHERE col2 = val2;
UPDATE tbl SET col1 = newval3 WHERE col2 = val3;"
I can send it off with sqlQuery(db_conn, update_queries), and so this goes back to the question I posted. I've run into this concept of Network Packet Size from the SQL Server (64 bit) documentation. It states that the maximum batch size is 65,536 * Network Packet Size, where the the default packet size is 4 KB. I assume 65,636 is in bytes so then byte wise it's 65,536 * 4,000 = 262,144,000 bytes. Would 262,144,000 then be the maximum length a string containing valid queries could be? Can someone please clarify if I have the right idea here or is there another SQL Server concept or ODBC concept that I need to know about? Thanks

SQL server Varchar(max) and space taken

If varchar(max) is used as the datatype and the inserted data is less than the full allocation, i.e. only 200 chars, then will SQL Server always take the full space of varchar(max) or just the 200 chars' space?
Further, what are the other data types that will take the max space even if lesser data is inserted?
Are there any documents that specify this?
From MS DOCS on char and varchar (Transact-SQL):
char [ ( n ) ]
Fixed-length, non-Unicode string data. n defines the string length and must be a value from 1 through 8,000. The storage size is n bytes. The ISO synonym for char is character.
varchar [ ( n | max ) ]
Variable-length, non-Unicode string data. n defines the string length and can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes (2 GB). The storage size is the actual length of the data entered + 2 bytes. The ISO synonyms for varchar are char varying or character varying.
So for varchar, including max - the storage will depend on actual data length, while char is always fixed size even when entire space is not used.
Use CHAR only for strings
whose length you know to be fixed. For example, if you define a domain
whose values are restricted to 'T' and 'F', you should probably make
that CHAR[1]. If you're storing US social security numbers, make the
domain CHAR[9] (or CHAR[11] if you want punctuation).
Use VARCHAR for strings that can vary in length, like names, short
descriptions, etc. Use VARCHAR when you don't want to worry about
stripping trailing blanks. Use VARCHAR unless there's a good reason
not to.
varchar size depends on the length of the data. So in your case, it will just take 200 chars.

What is the maximum characters for the NVARCHAR(MAX)? [duplicate]

This question already has answers here:
What is the maximum number of characters that nvarchar(MAX) will hold?
(3 answers)
Closed 1 year ago.
I have declared a column of type NVARCHAR(MAX) in SQL Server 2008, what would be its exact maximum characters having the MAX as the length?
The max size for a column of type NVARCHAR(MAX) is 2 GByte of storage.
Since NVARCHAR uses 2 bytes per character, that's approx. 1 billion characters.
Leo Tolstoj's War and Peace is a 1'440 page book, containing about 600'000 words - so that might be 6 million characters - well rounded up. So you could stick about 166 copies of the entire War and Peace book into each NVARCHAR(MAX) column.
Is that enough space for your needs? :-)
By default, nvarchar(MAX) values are stored exactly the same as nvarchar(4000) values would be, unless the actual length exceed 4000 characters; in that case, the in-row data is replaced by a pointer to one or more seperate pages where the data is stored.
If you anticipate data possibly exceeding 4000 character, nvarchar(MAX) is definitely the recommended choice.
Source: https://social.msdn.microsoft.com/Forums/en-US/databasedesign/thread/d5e0c6e5-8e44-4ad5-9591-20dc0ac7a870/
From MSDN Documentation
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes (2 GB).
The storage size, in bytes, is two times the actual length of data entered + 2 bytes
I think actually nvarchar(MAX) can store approximately 1070000000 chars.

Resources