Varchar vs nvarchar - causing distinct values that we don't consider distinct - sql-server

SQL Server 2019 - we have a column called Entity which is of type nvarchar(max). The data from this column is inserted from tables on the web as part of an automated process.
In querying for DISTINCT values in this column, we only expected one distinct value, but we actually were returned two. But the two values looked exactly the same inside SQL Server Management Studio.
So we added a CONVERT(varchar(max)) to the query in a new column, and we were able to see the difference, as follows:
Entity Converted
Security Law Security Law
Security Law Security ?Law
Does anyone know how or why this different value is occurring, and more importantly, how we can instruct SQL Server to treat these as duplicate values, by only analyzing the nvarchar version?

nvarchar() takes Unicode characters into account. Since you are copying data from web, there could be invisible characters.
you can use regex and extract ASCII characters alone and convert it to varchar so you get distinct values.

Related

SQL Server datatype for DB2 column "Char() for BIT Data"

I googled but failed to find an answer for the following question.
I have a DB2 database column of datatype “Char(100) for BIT Data“. It stores encrypted value of employee ID. I need to store this data in SQL Server.
What should be the datatype for this column in SQL Server?
Is there any formatting needed before inserting into SQL Server?
The column in your DB2 table is an ordinary character string, except that the for BIT Data part tells DB2 to treat that string as arbitrary binary data, rather than text. This matters mainly for sorts and comparisons. DB2 (as normally configured) would sort and compare strings alphabetically. A capital A would come before a lowercase b. But for BIT Data strings are compared by the underlying numeric value of the characters. Capital A would come after lowercase z.
In SQL server, there is a BINARY datatype for this, so you would probably use BINARY(100). No formatting should be necessary, as you probably want the value of raw binary data to stay exactly the same.

Data Type Conversion from SQL Server to Oracle

I am currently moving a product from SQL Server to Oracle. I am semi-familiar with SQL Server and know nothing about Oracle, so I apologize if the mere presence of this question offends anyone.
Inferring from this page, http://download.oracle.com/docs/cd/E12151_01/doc.150/e12156/ss_oracle_compared.htm, it would seem that the data type conversion from SQL Server to Oracle should be:
REAL = FLOAT(24) -> FLOAT(63)
FLOAT(p) -> FLOAT(p)
TIMESTAMP -> NUMBER
NVARCHAR(n) -> VARCHAR(n*2)
NCHAR(n) -> CHAR(n*2)
Here are my questions regarding them:
For FLOAT, considering that FLOAT(p) -> FLOAT(p), wouldn't it also mean that FLOAT -> FLOAT(24)?
For TIMESTAMP, since Oracle also has its own version of it, wouldn't it be better that TIMESTAMP -> TIMESTAMP?
Finally, for NVARCHAR(n) and NCHAR(n), I thought the issue would be regarding Unicode. Then, again, since Oracle provides its own version of both, wouldn't it make more sense that NVARCHAR(n) -> NVARCHAR(n) and NCHAR(n) -> NCHAR(n)?
It would be much appreciated if someone were to elaborate on the previous 3 matters.
Thanks in advance.
It appears that Oracle's CHAR and VARCHAR2 (always use VARCHAR2 instead of VARCHAR) already support Unicode - the document you've linked to advises converting to those from the SQL Server NCHAR and NVARCHAR datatypes.
The SQL Server TIMESTAMP isn't actually a timestamp at all - it's some kind of identifier based on the time that's just used to indicate that a row has changed - it can't be converted back into any kind of DATETIME (at least in a way that I know about).
For FLOAT, using 126 bytes would be enormous - since the developer tools automatically map SQL Server's FLOAT to Oracle's FLOAT(53), why not use that amount?
This is more FYI than an answer to your question, but you're potentially going to run into a particularly painful difference between SQL Server and Oracle. In SQL Server, you can define a string column (of whatever flavor) to not allow NULL values, and then insert zero-length (aka "blank") strings into that column, because SQL Server does not consider a blank string to be the same as a NULL.
Oracle does consider a blank string to be the same as a NULL, so Oracle will not let you insert blank values into a NOT NULL column. This obviously causes problems when copying data from a table in SQL Server into its counterpart table in Oracle. You choices for dealing with this are:
Set the offending string column in Oracle to allow NULL values (so not a good idea)
When copying the data, replace the blank strings with something else (I have no idea what you should use here)
Skip the offending rows and pretend you never saw them
I'd love to think that Oracle's choice to consider blank strings to be NULL (where they're alone among the major DBs) was to lock customers into their platform, but this one actually works in the opposite direction. You can move a database from Oracle to something else without the blank=NULL difference causing any problems.
See this earlier question: Oracle considers empty strings to be NULL while SQL Server does not - how is this best handled?
Following is not a direct answer to your question; but it is good to take a look at the sqlteam blog
sqlteam - Datatypes translation between Oracle and SQL Server part 2: number
It has detailed explanation about how to handle numbers, etc.

SQL Server Collation / ADO.NET DataTable.Locale with different languages

we have WinForms app which stores data in SQL Server (2000, we are working on porting it in 2008) through ADO.NET (1.1, working on porting to 4.0). Everything works fine if I read data previsouly written in Western-European locale (E.g.: "test", "test ù"), but now we have to be able to mix Western and non-Western alphabets as well (E.g.: "test - ۓےۑ" - these are just random arabic chars).
On the SQL Server side, database has been set with the Latin1_General collation, the field is a nvarchar(80). If I run a SQL SELECT statement (E.g.: "SELECT * FROM MyTable WHERE field = 'test - ۓےۑ'", don't mind about the "*" or the actual names) from Query Analyzer, I get no results; the same happens if I pass the Sql statement to an ADO.NET DataAdapter to fill a DataTable. My guess is that it has something to do with collation, but I don't know how to correct this: do I have to change to collation (SQL Server) to a different one? Or do I have to set the locale on the DataAdaoter/DataTable (ADO.NET)?
Thanks in advance to anyone who will help
Shouldn't you use N when comparing nvarchar with extended char. set?
SELECT * From TestTable WHERE GreekColCaseInsensitive = N'test - ۓےۑ'
Yes, the problem is most likely the collation. The Latin1_General collation does not include the rules to sort and compare non latin characters.
MSDN claims:
If you must store character data that reflects multiple languages, you can minimize collation compatibility issues by always using the Unicode nchar, nvarchar, and ntext data types instead of the char, varchar, text data types. Using the Unicode data types eliminates code page conversion issues.
Since you have already complied with this, you should read further on the info about Mixed Collation Environments here.
Additionally I want to add that just changing a collation is not something done easy, check the MSDN for SQL 2000:
When you set up SQL Server 2000, it is important to use the correct collation settings. You can change collation settings after running Setup, but you must rebuild the databases and reload the data. It is recommended that you develop a standard within your organization for these options. Many server-to-server activities can fail if the collation settings are not consistent across servers.
You can specify a collation on a per column bases however:
CREATE TABLE TestTable (
id int,
GreekColCaseInsensitive nvarchar(10) collate greek_ci_as,
LatinColCaseSensitive nvarchar(10) collate latin1_general_cs_as
)
Have a look at the different binary multilingual collations here. Depending on the charset you use, you should find one that fits your purpose.
If you are not able or willing to change the collation of a column you can also just specify the collation to be used in the query like:
SELECT * From TestTable
WHERE GreekColCaseInsensitive = N'test - ۓےۑ'
COLLATE latin1_general_cs_as
As jfrobishow pointed out the use of N in front of the string you want to use to compare is essential. What does it do:
It denotes that the subsequent string is in Unicode (the N actually stands for National language character set). Which means that you are passing an NCHAR, NVARCHAR or NTEXT value, as opposed to CHAR, VARCHAR or TEXT. See Article #2354 for a comparison of these data types.
You can find a quick rundown here.

How can I recover Unicode data which displays in SQL Server as?

I have a database in SQL Server containing a column which needs to contain Unicode data (it contains user's addresses from all over the world e.g. القاهرة‎ for Cairo)
This column is an nvarchar column with a collation of database default (Latin1_General_CI_AS), but I've noticed data inserted into it via SQL statements containing non English characters and displays as ?????.
The solution seems to be that I wasn't using the n prefix e.g.
INSERT INTO table (address) VALUES ('القاهرة')
Instead of:
INSERT INTO table (address) VALUES (n'القاهرة')
I was under the impression that Unicode would automatically be converted for nvarchar columns and I didn't need this prefix, but this appears to be incorrect.
The problem is I still have some data in this column which appears as ????? in SQL Server Management Studio and I don't know what it is!
Is the data still there but in an incorrect character encoding preventing it from displaying but still salvageable (and if so how can I recover it?), or is it gone for good?
Thanks,
Tom
To find out what SQL Server really stores, use
SELECT CONVERT(VARBINARY(MAX), 'some text')
I just tried this with umlauted characters and Arabic (copied from Wikipedia, I have no idea) both as plain strings and as N'' Unicode strings.
The results are that Arabic non-Unicode strings really end up as question marks (0x3F) in the conversion to VARCHAR.
SSMS sometimes won't display all characters, I just tried what you had and it worked for me, copy and paste it into Word and it might display it corectly
Usually if SSMS can't display it it should be boxes not ?
Try to write a small client that will retrieve these data to a file or web page. Check ALL your code if there are no other inserts or updates that might convertthe data to varchar before storing them in tables.

Sql Server 2008 - Difference between collation types

I'm installing a new SQL Server 2008 server and are having some problems getting any usable information regarding different collations. I have searched SQL Server BOL and google'ed for an answer but can't seem to be able to find any usable information.
What is the difference between the Windows Collation "Finnish_Swedish_100" and "Finnish_Swedish"?
I suppose that the "_100"-version is a updated collation in SQL Server 2008, but what things have changed from the older version if that is the case?
Is it usually a good thing to have "Accent-sensitive" enabled? I know that it depends on the task and all that, but is there any well-known pros and cons to consider?
The "Binary" and "Binary-code point" parameters, in which cases should theese be enabled?
The _100 indicates a collation sequence new in SQL Server 2008, those with _90 are for 2005 and those with no suffix are 2000. I don't know what the differences are, and can't find any documentation. Unless you are doing linked server queries to another SQL server of a different version, I'd be tempted to go with the _100 one. Sorry I can't help with the differences.
The letters ÅÄÖ/åäö do not mix up with A and O just by setting the collation to AI (Accent Insensitive). That is however true for â and other "combinations" not part of the Swedish alphabet as individual letters. â will mix or not mix depending of the setting in question.
Since I have a lot of old databases I still need to communicate with, also using linked servers, I chose FINNISH _SWEDISH _CI _AS now that I'm installing SQL2008. That was the default setting for FINNISH _SWEDISH when the Windows collations first appeared in SQL Server.
Use the query below to try it out yourself.
As you can see, å, ä, etc. do not count as accented characters, and are sorted according to the Swedish alphabet when using the Finnish/Swedish collation.
However, the accents are only considered if you use the AS collation. For the AI collation, their order is unchanged, as if there was no accent at all.
CREATE TABLE #Test (
Number int identity,
Value nvarchar(20) NOT NULL
);
GO
INSERT INTO #Test VALUES ('àá');
INSERT INTO #Test VALUES ('áa');
INSERT INTO #Test VALUES ('aa');
INSERT INTO #Test VALUES ('aà');
INSERT INTO #Test VALUES ('áb');
INSERT INTO #Test VALUES ('ab');
-- w is considered an accented version of v
INSERT INTO #Test VALUES ('wa');
INSERT INTO #Test VALUES ('va');
INSERT INTO #Test VALUES ('zz');
INSERT INTO #Test VALUES ('åä');
GO
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AS;
SELECT Number, Value FROM #Test ORDER BY Value COLLATE Finnish_Swedish_CI_AI;
GO
DROP TABLE #Test;
GO
To address question 3 (info taken off the MSDN; wording theirs, format mine):
Binary (_BIN):
Sorts and compares data in SQL Server tables based on the bit patterns defined for each character.
Binary sort order is case-sensitive and accent-sensitive.
Binary is also the fastest sorting order.
If this option is not selected, SQL Server follows sorting and comparison rules as defined in dictionaries for the associated language or alphabet.
Binary-code point (_BIN2):
For Unicode data: Sorts and compares data in SQL Server tables based on Unicode code points.
For non-Unicode data: will use comparisons identical to binary sorts.
The advantage of using a Binary-code point sort order is that no data resorting is
required in applications that compare sorted SQL Server data. As a result, a Binary-code point sort order provides simpler application development and possible performance increases.
For more information, see Guidelines for Using BIN and BIN2 Collations.
To adress your question 1. Accent sensitive is a good thing to have enabled for Finnish-Swedish. Otherwise your "å"s and "ä"s will be sorted as "a"s and "ö"s as "o"s. (Assuming you will be using those kind of international characters).
More here: http://msdn.microsoft.com/en-us/library/ms143515.aspx (discusses both binary codepoint and accent sensitivity)
On Questions 2 and 3
Accent Sensitivity is something I would suggest turning OFF if you are accepting user data, and ON if you have clean, sanitized data.
Not being Finnish myself, I don't know how many words there are that are different depending on the ó ô õ or ö that they have in them. But if there are users entering data, you can be sure that they will NOT be consistent in their usage, and you want to be able to match them.
If you are gathering data from a dataset that you know the content of, and know the consistency of, then you will want to turn Accent Sensitivity ON because you know that the differences are purposeful.
The same questions apply when considering Question 3. (I'm mostly getting this from the link Tomalak provided) If the data is case and accent sensitive, then you want _BIN, because it will sort faster. If the data is irregular, and not case/accent sensitive, then you will want _BIN2, because it is designed for Unicode data.
To address qestion 2:
Yes, if accent's are required grammer for the given language.

Resources