SSIS Lookup transformation including whitespace - sql-server

For some reason, the SSIS Lookup transformation seems to be checking the cache for a NCHAR(128) value instead of a NVARCHAR(128) value. This results in a whole bunch of appended whitespace on the value being looked up and causes the lookup to fail to find a match.
On the Lookup transformation, I configured it to have No Cache so that it always goes to the database so I could trace with SQL Profiler and see what it was looking up. This is what it captured (notice the whitespace ending at the single quote on the second last line - requires horizontal scrolling):
exec sp_executesql N'
select *
from (
SELECT SurrogateKey, NaturalKey, SomeInt
FROM Dim_SomeDimensionTable
) [refTable]
where [refTable].[NaturalKey] = #P1
and [refTable].[SomeInt] = #P2'
,N'#P1 nchar(128)
,#P2 smallint'
,N'VALUE '
,8
Here's the destination table's schema:
CREATE TABLE [dbo].[dim_SomeDimensionTable] (
[SurrogateKey] [int] IDENTITY(1,1) NOT NULL,
[NaturalKey] [nvarchar](128) NOT NULL,
[SomeInt] [smallint] NOT NULL
)
What I am trying to figure out is why SSIS is checking the NaturalKey value as NCHAR(128) and how I can get it to perform the lookup as NVARCHAR(128) without the whitespace.
Things I've tried:
I have LTRIM() and RTRIM() on the SQL Server source query.
Before the Lookup, I have used a Derived Column transformation to add a new column with the original value TRIM()'d (this trimmed column is the one I'm passing to the Lookup transformation).
Before and after the Lookup, I multicasted the rows and sent them to a unicode Flat File Destination and there was no white space in either case.
Before the lookup, I looked at the metadata on the data flow path and it shows the value as having data type DT_WSTR with length 128.
Any ideas would be greatly appreciated!

It doesn't make any difference.
You need to look elsewhere for the source of your problem (perhaps the column has a case sensitive collation for example).
Trailing white space is only significant to SQL Server in LIKE comparisons, not = comparisons as documented here.
SQL Server follows the ANSI/ISO SQL-92 specification (Section 8.2,
, General rules #3) on how to compare strings
with spaces. The ANSI standard requires padding for the character
strings used in comparisons so that their lengths match before
comparing them. The padding directly affects the semantics of WHERE
and HAVING clause predicates and other Transact-SQL string
comparisons. For example, Transact-SQL considers the strings 'abc' and
'abc ' to be equivalent for most comparison operations.
The only exception to this rule is the LIKE predicate...
You can also easily see this by running the below.
USE tempdb;
CREATE TABLE [dbo].[Dim_SomeDimensionTable] (
[SurrogateKey] [int] IDENTITY(1,1) NOT NULL,
[NaturalKey] [nvarchar](128) NOT NULL,
[SomeInt] [smallint] NOT NULL
)
INSERT INTO [dbo].[Dim_SomeDimensionTable] VALUES ('VALUE',8)
exec sp_executesql N'
select *
from (
SELECT SurrogateKey, NaturalKey, SomeInt
FROM Dim_SomeDimensionTable
) [refTable]
where [refTable].[NaturalKey] = #P1
and [refTable].[SomeInt] = #P2'
,N'#P1 nchar(128)
,#P2 smallint'
,N'VALUE '
,8
Which returns the single row

Related

How to validate that UTF-8 columns actually save space?

SQL Server 2019 introduces support for the widely used UTF-8 character encoding.
I have a large table that stores sent emails. So I'd like to give this feature a try.
ALTER TABLE dbo.EmailMessages
ALTER COLUMN Body NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AI_SC_UTF8;
ALTER TABLE dbo.EmailMessages REBUILD;
My concern is that I don't know how to verify size gains. It seems that popular scripts for size estimation do not properly report size in this case.
Basically, column type must be converted to VARCHAR(MAX) then data is stored in a more compact manner:
To limit the amount of changes required for the above scenarios, UTF-8
is enabled in existing the data types CHAR and VARCHAR. String data is
automatically encoded to UTF-8 when creating or changing an object’s
collation to a collation with the “_UTF8” suffix, for example from
LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.
Size can be inspected using sp_spaceused:
sp_spaceused N'EmailMessages';
If unused space is high then you might need to reorganize:
ALTER INDEX ALL ON dbo.EmailMessages REORGANIZE WITH (LOB_COMPACTION = ON);
In my case size was reduced by a factor of ~2 (mostly English text).
As others have already mentioned, you should use VARCHAR instead of NVARCHAR to store UTF-8 encoded text.
You can use a query like the following to compare string lengths. It assumes a table named #Data with an NVARCHAR column called String.
SELECT *
FROM #Data
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), String COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8) AS Utf8String
) U
CROSS APPLY (
SELECT
LEN(String) AS Length,
--LEN(Utf8String) AS Utf8Length,
DATALENGTH(String) AS NVarcharBytes,
DATALENGTH(Utf8String) AS Utf8Bytes
) L
CROSS APPLY (
SELECT
CASE WHEN Utf8Bytes < NVarcharBytes THEN 'Yes' ELSE '' END AS IsShorter,
CASE WHEN Utf8Bytes > NVarcharBytes THEN 'Yes' ELSE '' END AS IsLonger
) C
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), String), 1) AS NVarcharHex,
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), Utf8String), 1) AS Utf8Hex
) H
You can replace FROM #Data with something like FROM (SELECT Email AS String FROM YourTable) D to query your specific data. Replace SELECT * with SELECT SUM(NVarcharBytes) AS NVarcharBytes, SUM(Utf8Bytes) AS Utf8Bytes to get totals.
See this db<>fiddle.
See also: Storage differences between UTF-8 and UTF-16.

Kurdish Sorani Letters sql server

I am trying to create a database containing Kurdish Sorani Letters.
My Database fields has to be varchar cause of project is started that vay.
First I create database with Arabic_CI_AS
I can store all arabic letters on varchar fields but when it comes to kurdish letters for example
ڕۆ these special letters are show like ?? on the table after entering data, I think my collation is wrong. Have anybody got and idea for collation ?
With that collation, no, you need to use nvarchar and always prefix such strings with the N prefix:
CREATE TABLE dbo.floo
(
UseNPrefix bit,
a varchar(32) collate Arabic_CI_AS,
b nvarchar(32) collate Arabic_CI_AS
);
INSERT dbo.floo(UseNPrefix,a,b) VALUES(0,'ڕۆ','ڕۆ');
INSERT dbo.floo(UseNPrefix,a,b) VALUES(1,N'ڕۆ',N'ڕۆ');
SELECT * FROM dbo.floo;
Output:
UseNPrefix
a
b
False
??
??
True
??
ڕۆ
Example db<>fiddle
In SQL Server 2019, you can use a different SC + UTF-8 collation with varchar, but you will still need to prefix string literals with N to prevent data from being lost:
CREATE TABLE dbo.floo
(
UseNPrefix bit,
a varchar(32) collate Arabic_100_CI_AS_KS_SC_UTF8,
b nvarchar(32) collate Arabic_100_CI_AS_KS_SC_UTF8
);
INSERT dbo.floo(UseNPrefix,a,b) VALUES(0,'ڕۆ','ڕۆ');
INSERT dbo.floo(UseNPrefix,a,b) VALUES(1,N'ڕۆ',N'ڕۆ');
SELECT * FROM dbo.floo;
Output:
UseNPrefix
a
b
False
??
??
True
ڕۆ
ڕۆ
Example db<>fiddle
Basically, even if you are on SQL Server 2019, your requirements of "I need to store Sorani" and "I can't change the table" are incompatible. You will need to either change the data type of the column or at least change the collation, and you will need to adjust any code that expects to pass this data to SQL Server without an N prefix on strings.

SQL Server returns wrong result with trailing spaces in Where clause [duplicate]

In SQL Server 2008 I have a table called Zone with a column ZoneReference varchar(50) not null as the primary key.
If I run the following query:
select '"' + ZoneReference + '"' as QuotedZoneReference
from Zone
where ZoneReference = 'WF11XU'
I get the following result:
"WF11XU "
Note the trailing space.
How is this possible? If the trailing space really is there on that row, then I'd expect to return zero results, so I'm assuming it's something else that SQL Server Management Studio is displaying weirdly.
In C# code calling zoneReference.Trim() removes it, suggesting it is some sort of whitespace character.
Can anyone help?
That's the expected result: in SQL Server the = operator ignores trailing spaces when making the comparison.
SQL Server follows the ANSI/ISO SQL-92 specification (Section 8.2, , General rules #3) on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them. The padding directly affects the semantics of WHERE and HAVING clause predicates and other Transact-SQL string comparisons. For example, Transact-SQL considers the strings 'abc' and 'abc ' to be equivalent for most comparison operations.
The only exception to this rule is the LIKE predicate. When the right side of a LIKE predicate expression features a value with a trailing space, SQL Server does not pad the two values to the same length before the comparison occurs. Because the purpose of the LIKE predicate, by definition, is to facilitate pattern searches rather than simple string equality tests, this does not violate the section of the ANSI SQL-92 specification mentioned earlier.
Source
Trailing spaces are not always ignored.
I experienced this issue today. My table had NCHAR columns and was being joined to VARCHAR data.
Because the data in the table was not as wide as its field, trailing spaces were automatically added by SQL Server.
I had an ITVF (inline table-valued function) that took varchar parameters.
The parameters were used in a JOIN to the table with the NCHAR fields.
The joins failed because the data passed to the function did not have trailing spaces but the data in the table did. Why was that?
I was getting tripped up on DATA TYPE PRECEDENCE. (See http://technet.microsoft.com/en-us/library/ms190309.aspx)
When comparing strings of different types, the lower precedence type is converted to the higher precedence type before the comparison. So my VARCHAR parameters were converted to NCHARs. The NCHARs were compared, and apparently the spaces were significant.
How did I fix this? I changed the function definition to use NVARCHAR parameters, which are of a higher precedence than NCHAR. Now the NCHARs were changed automatically by SQL Server into NVARCHARs and the trailing spaces were ignored.
Why didn't I just perform an RTRIM? Testing revealed that RTRIM killed the performance, preventing the JOIN optimizations that SQL Server would have otherwise used.
Why not change the data type of the table? The tables are already installed on customer sites, and they do not want to run maintenance scripts (time + money to pay DBAs) or give us access to their machinines (understandable).
Yeah, Mark is correct. Run the following SQL:
create table #temp (name varchar(15))
insert into #temp values ('james ')
select '"' + name + '"' from #temp where name ='james'
select '"' + name + '"' from #temp where name like 'james'
drop table #temp
But, the assertion about the 'like' statement appears not to work in the above example. Output:
(1 row(s) affected)
-----------------
"james "
(1 row(s) affected)
-----------------
"james "
(1 row(s) affected)
EDIT:
To get it to work, you could put at the end:
and name <> rtrim(ltrim(name))
Ugly though.
EDIT2:
Given the comments abovem, the following would work:
select '"' + name + '"' from #temp where 'james' like name
try
select Replace('"' + ZoneReference + '"'," ", "") as QuotedZoneReference from Zone where ZoneReference = 'WF11XU'

SQL error on inserting UTF8 string into SQL Server 2008 table

I'm having problems trying to insert strings containing UTF-8 encoded Chinese characters and punctuations into a SQL Server 2008 table (default installation) from my Delphi 7 application using Zeosdb native SQL Server library.
I remembered in the past I had problems inserting UTF8 string into SQL Server even using PHP and other methods so I believe that this problem is not unique to Zeosdb.
It doesn't happen all the time, some UTF8 encoded strings can get inserted successfully but some not. I can't figure out what is it in the string that caused the failure.
Table schema:
CREATE TABLE [dbo].[incominglog](
[number] [varchar](50) NULL,
[keyword] [varchar](1000) NULL,
[message] [varchar](1000) NULL,
[messagepart1] [varchar](1000) NULL,
[datetime] [varchar](50) NULL,
[recipient] [varchar](50) NULL
) ON [PRIMARY]
SQL statement template:
INSERT INTO INCOMINGLOG ([Number], [Keyword], [Message], [MessagePart1], [Datetime], [Recipient])
VALUES('{N}', '{KEYWORD}', '{M}', '{M1}', '{TIMESTAMP}', '{NAME}')
The parameter {KEYWORD}, {M} and {M1} can contain UTF8 string.
For example, the following statement will return an error:
Incorrect syntax near 'é¢'. Unclosed quotation mark after the character string '全力克æœå››ç§å±é™©','2013-06-19 17:07:28','')'.
INSERT INTO INCOMINGLOG ([Number], [Keyword], [Message], [MessagePart1], [Datetime], [Recipient])
VALUES('+6590621005', '题', '题 [全力克æœå››ç§å±é™© åšå†³æ‰«é™¤ä½œé£Žä¹‹å¼Š]', '[全力克æœå››ç§å±é™©','2013-06-19 17:07:28', '')
Note: Please ignore the actual characters as the utf8 encoding is lost after copy and paste.
I've also tried using NVARCHAR instead of VARCHAR:
CREATE TABLE [dbo].[incominglog](
[number] [varchar](50) NULL,
[keyword] [nvarchar](max) NULL,
[message] [nvarchar](max) NULL,
[messagepart1] [nvarchar](max) NULL,
[datetime] [varchar](50) NULL,
[recipient] [varchar](50) NULL
) ON [PRIMARY]
And also tried amending the SQL statement into:
INSERT INTO INCOMINGLOG ([Number],[Keyword],[Message],[MessagePart1],[Datetime],[Recipient]) VALUES('{N}',N'{KEYWORD}',N'{M}',N'{M1}','{TIMESTAMP}','{NAME}')
They don't work either. I would appreciate any pointer. Thanks.
EDITED: As indicated by marc_s below, the N prefix must be outside the single quotes. It is correct in my actual test, the initial statement is a typo, which I've corrected.
The test with the N prefix also returned an error:
Incorrect syntax near '原标é¢'. Unclosed quotation mark after the
character string '全力克�四��险','2013-06-19
21:22:08','')'.
The SQL statement:
INSERT INTO INCOMINGLOG ([Number],[Keyword],[Message],[MessagePart1],[Datetime],[Recipient]) VALUES('+6590621005',N'原标题',N'原标题 [全力克�四��险 �决扫除作风之弊]',N'[全力克�四��险','2013-06-19','')
.
.
REPLY TO gbn's Answer: I've tried using parameterized SQL but still encountering "Unclosed quotation mark after the character string" error.
For the new test, I used a simplified SQL statement:
INSERT INTO INCOMINGLOG ([Keyword],[Message]) VALUES(:KEYWORD,:M)
The error returned for the above statement:
Incorrect syntax near '原标é¢'. Unclosed quotation mark after the
character string '')'.
For info, the values of KEYWORD and M are:
KEYWORD:原标题
M:原标题 [
.
.
.
Further tests on 20th June Parametarized SQL query don't work so I tried a different approach by trying to isolate the character that caused the error. After trial and error, I managed to identify the problematic character.
The following character produces an error: 题
SQL Statement: INSERT INTO INCOMINGLOG ([Keyword]) VALUES('题')
Interestingly, note that the string in the return error tax contains a "?" character which didn't exist in the original statement.
Error: Unclosed quotation mark after the character string '�)'. Incorrect syntax near '�)'.
If I were to place some latin characters immediately after the culprit character, there will be no error. For example, INSERT INTO INCOMINGLOG ([Keyword]) VALUES('题Ok') works ok. Note: It doesn't work with all characters.
There are ' characters in the UTF-8 which abnormally terminate the SQL.
Classic SQL injection.
Use proper parametrisation, not string concatenation basically.
Edit, after Question updates...
Without the Delphi code, I don't think we can help you
All SQL side code works. For example, this works in SSMS
DECLARE #t TABLE ([Keyword] nvarchar(100) COLLATE Chinese_PRC_CI_AS);
INSERT INTO #t ([Keyword]) VALUES('题');
INSERT INTO #t ([Keyword]) VALUES(N'题');
SELECT * FROM #t T;
Something is missing to help us fic this
Also see
How to store UTF-8 bytes from a C# String in a SQL Server 2000 TEXT column
Having trouble with UTF-8 storing in NVarChar in SQL Server 2008
Write utf-8 to a sql server Text field using ADO.Net and maintain the UTF-8 bytes

SQL Server 6.5 deadlocks with Spanish letters ú and ü

We're running SQL 6.5 though ADO and we have the oddest problem.
This sentence will start generating deadlocks
insert clinical_notes ( NOTE_ID, CLIENT, MBR_ID, EPISODE, NOTE_DATE_TIME,
NOTE_TEXT, DEI, CARE_MGR, RELATED_EVT_ID, SERIES, EAP_CASE, TRIAGE, CATEGORY,
APPOINTMENT, PROVIDER_ID, PROVIDER_NAME )
VALUES ( 'NTPR3178042', 'HUMANA/PR', '999999999_001', 'EPPR915347',
'03-28-2011 11:25', 'We use á, é, í, ó, ú and ü (this is the least one we
use, but there''s a few words with it, like the city: Mayagüez).', 'APK', 'APK',
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL )
The trigger are the characters ú and ü. If they are in the NOTE_TEXT column.
NOTE_TEXT is a text column.
There are indexes on
UNC_not_id
NT_CT_MBR_NDX
NT_REL_EVT_NDX
NT_SERIES_NDX
idx_clinical_notes_date_time
nt_ep_idx
NOTE_ID is the primary key.
What happens is after we issue this statement, if we issue an identical one, but with a new NOTE_ID value, we receive the deadlock.
As mentioned, this only happens when ú or ü is in NOTE_TEXT.
This is a test server and there is generally only one session accessing this table when the error occurs.
I'm sure it has something to so with character sets and such, but for the life of me I can't work it out.
Is the column (var)char-based or n(var)char-based? Are the values using unicode above 255 or are they ascii 255 or below (250 and 252)?
Try changing the column to a binary collation, just to see if that helps (it may shed light on the issue). I do NOT know if this works in SQL 2000 (though I can check on Monday), but you can try this to find what collations are available on your server:
SELECT * FROM ::fn_helpcollations()
Latin General BIN should be in there somewhere.
Assuming you find a collation to try, you change the collation like so:
ALTER TABLE TableName ALTER COLUMN ColumnName varchar(8000) NOT NULL COLLATE Collation_Name_Here
Script out your table to learn the collation it's using now so you can set it back if that doesn't work or causes problems. Or use a backup. :)
One additional note is that if you're using unicode you do need an N before literal strings, for example:
SELECT N'String'

Resources