SQL Server: Handle of string concatenation - sql-server

Given a variable nvarchar(max), the input is 'aaaaa...' with 16000 length.
The value of the variable has no problem with this setup.
If I break down the input into 3 smaller ones let's say (7964,4594,3442) the variable truncates the concatenation of them.
On the other hand, if at least 1 variable is over 8000 size, the concatenation works without an issue.
Is there any documentation regarding the mentioned behavior?

Taken from the docs:
If the result of the concatenation of strings exceeds the limit of
8,000 bytes, the result is truncated. However, if at least one of the
strings concatenated is a large value type, truncation does not occur.
Operations between varchar and nvarchar are limited to 8000 and 4000 characters respectively, unless you treat any of the involved data types as MAX. Please be very cautious with the order of the operations, this is a very good example from the docs:
DECLARE #x varchar(8000) = replicate('x', 8000)
DECLARE #y varchar(max) = replicate('y', 8000)
DECLARE #z varchar(8000) = replicate('z',8000)
SET #y = #x + #z + #y
-- The result of following select is 16000
SELECT len(#y) AS y
The result is 16k and not 24k because the first operation is #x + #z which is truncated at 8000 because neither of them are MAX. Then the result is concatenated to a type that is MAX, thus breaking the restriction of 8000 as limit, which adds another 8000 characters from #y. In the result, the characters from variable #z are lost at the first concatenation.

If your using CONCAT function
If none of the input arguments has a supported large object (LOB)
type, then the return type truncates to 8000 characters in length,
regardless of the return type. This truncation preserves space and
supports plan generation efficiency.
try
CONCAT(CAST('' as VARCHAR(MAX)),#var1,#var2)
or
CAST(#var1 as VARCHAR(MAX)) + #var2

Related

Concatenation of two varchar columns in select into

I have a insert into tableA select from someTables and in my select I have two text columns that I concatenate e.g. colA + colB. They have type varchar(n). Should the column in TableA simply be varchar(2n)? Is it bad for performance if say I have varchar(5*n)?
If the two columns are concatenated from varchar(n) is it possible that the result is more than varchar(2n) or e.g. nvarchar(3n)?
When you concatenate 2 (n)varchar values the resulting datatype is the 2 length properties added together, or 8,000 bytes (which ever is lower). If you concatenating a varchar and an nvarchar the varchar will be implicitly cast to an nvarchar first.
Unless at least 1 of the values concatenated is of MAX length, the return datatype will not be converted to a MAX and any trailing characters will be truncated.
Take the below examples, which return the data types of their aliases:
SELECT REPLICATE('A',10) + REPLICATE('B',10) AS varchar20,
REPLICATE(N'A',10) + REPLICATE(N'B',10) AS nvarchar20,
REPLICATE(N'A',10) + REPLICATE('B',5) AS nvarchar15,
REPLICATE('A',5000) + REPLICATE('B',5000) AS varchar8000, --Truncation occurs
REPLICATE(N'A',3000) + REPLICATE('B',3000) AS nvarchar4000, --Truncation occurs
REPLICATE(CONVERT(nvarchar(MAX),N'A'),3000) + REPLICATE('B',3000) AS nvarcharMAX;
And this can be validated using dm_exec_describe_first_result_set:
SELECT [name], system_type_name
FROM sys.dm_exec_describe_first_result_set(N'SELECT REPLICATE(''A'',10) + REPLICATE(''B'',10) AS varchar20,
REPLICATE(N''A'',10) + REPLICATE(N''B'',10) AS nvarchar20,
REPLICATE(N''A'',10) + REPLICATE(''B'',5) AS nvarchar15,
REPLICATE(''A'',5000) + REPLICATE(''B'',5000) AS varchar8000, --Truncation occurs
REPLICATE(N''A'',3000) + REPLICATE(''B'',3000) AS nvarchar4000, --Truncation occurs
REPLICATE(CONVERT(nvarchar(MAX),N''A''),3000) + REPLICATE(''B'',3000) AS nvarcharMAX;',NULL, NULL);
Obviously, if you concatenate 3 (n)varchar values, then the resulting length is the sum of the 3 length values, etc.
Note that I explicitly state 8,000 bytes not 8,000 or 4,000 characters length. Many confuse the length value for varchar and nvarchar to mean the number of characters it can hold, but this is not actually true, it's the number of bytes; for varchar it's 8,000 single bytes and for nvarchar it is 4,000 double bytes. This is far more important now that SQL Server supports UTF-8 collations.
For example, the below returns a value of 2666, as the character I chose at random (◘) uses 3 bytes per character.
SELECT LEN(REPLICATE(CONVERT(varchar(3),N'◘' COLLATE Latin1_General_100_CI_AI_SC_UTF8),8000));

Sybase, data type

I have 2 queries:
(1)
declare #m varchar
set #m='10'
select * from test where month=#m
(2)
declare #m varchar(2)
set #m='10'
select * from test where month=#m
Number of rows in result is different. In 2 variant more than in first. What is the reason could be?
That's because when you don't specify how many bytes the varchar variable can hold, the engines uses the default with is 1:
When n isn't specified in a data definition or variable declaration
statement, the default length is 1. If n isn't specified when using
the CAST and CONVERT functions, the default length is 30.
So, in the first case you have:
select * from test where month=1
and in the second:
select * from test where month=10

What exactly is the meaning of nvarchar(n)

The documentation isn't super clear: https://msdn.microsoft.com/en-us/library/ms186939.aspx
What happens if I try to store a 20 character length string in a column defined as nvarchar(10)? Is 10 the max length the field could be or is it the expected length? If I can exceed n characters in the string, what are the performance implications of doing that?
The maximum number of characters you can store in a column or variable typed as nvarchar(n) is n. If you try to store more your string will be truncated, or in case of an insert into a table, the insert would be disallowed with a warning about possible truncation:
String or binary data would be truncated. The statement has been
terminated.
declare #n nvarchar(10)
set #n = N'more than ten chars'
select #n
Result:
----------
more than
(1 row(s) affected)
From my understanding, nvarchar will only only store the provided characters up to the amount defined. Nchar will actually fill in the unused characters with whitespace.

Format string in SQL Server 2005 from numeric value

How I can format string with D in start and leading zeros for digits with length of less than four. E.g:
D1000 for 1000
D0100 for 100
I have tried to work with casting and stuff function, but it didn't work as I expected.
SELECT STUFF('D0000', LEN(#OperatingEndProc) - 2, 4, CAST((CAST(SUBSTRING(#OperatingEndProc, 2, 4) AS INT) + 1) AS VARCHAR(10)));
adding 10000 to the value will cause the number to have have extra zeros first, then casting it as varchar and only using the last 4 will ignore the added 10000. This require that all numbers are between 0 and 9999
declare #value int = 100
select 'D' + right(cast(#value + 10000 as varchar(5)), 4)
This illustration board can come in handy when you wanna get the proper casting practices..
This shows all explicit and implicit data type conversions that are
allowed for SQL Server system-supplied data types. These include xml,
bigint, and sql_variant. There is no implicit conversion on assignment
from the sql_variant data type, but there is implicit conversion to
sql_variant
You can download it here http://www.microsoft.com/en-us/download/details.aspx?id=35834

Can I use a hash of fields instead of direct field comparison to simplify comparison of records?

I am integrating between 4 data sources:
InternalDeviceRepository
ExternalDeviceRepository
NightlyDeviceDeltas
MidDayDeviceDeltas
Changes flow into the InternalDeviceRepository from the other three sources.
All sources eventually are transformed to have the definition of
FIELDS
=============
IdentityField
Contract
ContractLevel
StartDate
EndDate
ContractStatus
Location
IdentityField is the PrimaryKey, Contract Key is a secondary Key only if a match exists, otherwise a new record needs to be created.
Currently I compare all the fields in a WHERE clause in SQL Statements and also in a number of places in SSIS packages. This creates some unclean looking SQL and SSIS packages.
I've been mulling computing a hash of ContractLevel, StartDate, EndDate, ContractStatus, and Location and adding that to each of the input tables. This would allow me to use a single value for comparison, instead of 5 separate ones each time.
I've never done this before, nor have I seen it done. Is there a reason that it should be used, or is that a cleaner way to do it?
It is a valid approach. Consider to introduce a calculated field with the hash and index on it.
You may use either CHECKSUM function or write your own hash function like this:
CREATE FUNCTION dbo.GetMyLongHash(#data VARBINARY(MAX))
RETURNS VARBINARY(MAX)
WITH RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE #res VARBINARY(MAX) = 0x
DECLARE #position INT = 1, #len INT = DATALENGTH(#data)
WHILE 1 = 1
BEGIN
SET #res = #res + HASHBYTES('MD5', SUBSTRING(#data, #position, 8000))
SET #position = #position+8000
IF #Position > #len
BREAK
END
WHILE DATALENGTH(#res) > 16 SET #res= dbo.GetMyLongHash(#res)
RETURN #res
END
which will give you 16-byte value - you may take all the 16 bytes as Guid, or only first 8-bytes as bigint and compare it.
Adapt the function in your way - to accept string as parameter or even all the your fields instead of varbinary
BUT
be careful with strings casing, datetime formats
if using CHECKSUM - check also other fields, checksum produces dublicates
avoid using 4-byte hash result on relaively big table

Resources