SQL Server 2019 - we have a column called Entity which is of type nvarchar(max). The data from this column is inserted from tables on the web as part of an automated process.
In querying for DISTINCT values in this column, we only expected one distinct value, but we actually were returned two. But the two values looked exactly the same inside SQL Server Management Studio.
So we added a CONVERT(varchar(max)) to the query in a new column, and we were able to see the difference, as follows:
Entity Converted
Security Law Security Law
Security Law Security ?Law
Does anyone know how or why this different value is occurring, and more importantly, how we can instruct SQL Server to treat these as duplicate values, by only analyzing the nvarchar version?
nvarchar() takes Unicode characters into account. Since you are copying data from web, there could be invisible characters.
you can use regex and extract ASCII characters alone and convert it to varchar so you get distinct values.
I took a flat file and looked up a field in a database and added another field as a new column to the flat file.
But when I directed the matched output to another database, the matched field is NULL upon inspection with a Select statement.
What did I do wrong?
I would check for any of the following on either the flat file or lookup data, which may cause a non-match:
- text data with trailing blanks
- text data with upper case vs lower case
- numeric data of varying datatypes, even just precisions
- probably other issues I haven't listed above - it's just ridiculously fussy
To avoid these issues I always explicitly use SQL CAST or Derived Column transforms to make sure the key fields are all text, all upper case and all exactly the same, byte by byte.
I'm trying to merge multiple text columns into one concatenated text column. Each of the fields were previously used for various descriptions, but per new reqs, I need all of those fields to be combined into one.
I tried converting them to varchar(max) first then concatenating, but some of the rows have values in these columns which are longer than the max and are being truncated in the result.
Is there a way to combine multiple text fields in SQL Server 2000?
The best advice I have for you is to either
perform the concatenation in your middle or presentation tier (or add an abstraction layer that allows this, including routing your query through a newer version of SQL Server which performs the concatenation after pulling through a linked server to 2000); or,
upgrade.
You can't fool SQL Server 2000 into supporting [n]varchar(max), and the limitation you've come across is just one of many, many, many reasons the [n]text data types were deprecated.
I'm new to Microsoft SQL. I'm planning to store text in Microsoft SQL server and there will be special international characters. Is there a "Data Type" specific to Unicode or I'm better encoding my text with a reference to the unicode number (i.e. \u0056)
Use Nvarchar/Nchar (MSDN link). There used to be an Ntext datatype as well, but it's deprecated now in favour of Nvarchar.
The columns take up twice as much space over the non-unicode counterparts (char and varchar).
Then when "manually" inserting into them, use N to indicate it's unicode text:
INSERT INTO MyTable(SomeNvarcharColumn)
VALUES (N'français')
When you say special international characters, what do you mean? If special means they aren't common and just occasional, then the overhead of nvarchar might not make sense in your situation on a table with a very large number of rows or a lot of indexing.
I'm all for using Unicode where appropriate, but understanding when it is appropriate is important.
If you are mixing data with different implied code pages (Japanese and Chinese in same database) or you just want to be forward-looking for internationalization and localization, then you want the column to be Unicode and use nvarchar data type and that's perfectly fine. Unicode is not going to magically solve all sorting problems for you.
If you are know that you will always be storing mainly ASCII but some occasional foreign characters, just store your UTF-8 data or HTML encoded data in varchar. If your data is all in Japanese and code page 932 (or any other single code page), you can still store double-byte characters in varchar, they still take up two bytes. My point is, that when you are already in a DBCS collation, international characters are no longer "special". It's not just the data storage, but any indexes as well as the working set when dealing with such a column in queries and in other dataflows.
And do not make a blanket rule that all character data should be nvarchar - it's a waste for many columns which are codes or identifiers.
Any time you have a column, go through the same questions:
What is the type of data?
What is the range?
Are NULLs allowed?
What is the limit of the size?
Are there any constraints I should apply now to stop bad data getting in from the beginning?
People have had success with using the following code to force Unicode at insert data manipulation.
INSERT INTO <table> (text) values (N'<text here>)
1
Character set features of tables and string inside them are specified for the database and if your database has a Unicode collation, strings inside the tables are unicode. As well for string columns you have to use nvarchar or nchar data types to make them able to store unicode strings. But this feature works if your database has a utf8 or unicode characterset or collation. Read this link for more information. Unicode and SQL Server
The DB I use has French_CI_AS collation (CI should stand for Case-Insensitive) but is case-sensitive anyway. I'm trying to understand why.
The reason I assert this is that bulk inserts with a 'GIVEN' case setup fail, but they succeed with another 'Given' case setup.
For example:
INSERT INTO SomeTable([GIVEN],[COLNAME]) VALUES ("value1", "value2") fails, but
INSERT INTO SomeTable([Given],[ColName]) VALUES ("value1", "value2") works.
EDIT
Just saw this:
http://msdn.microsoft.com/en-us/library/ms190920.aspx
so that means it should be possible to change a column's collation without emptying all the data and recreating the related table?
Given this critical piece of information (that is in a comment on the question and not in the actual question):
In fact I use Microsoft .Net's bulk insert method, so I don't really know the exact query it sends to the DB server.
it makes sense that the column names are being treated as case-sensitive, even in a case-insensitive DB, since that is how the SqlBulkCopy Class works. Please see Column mappings in SqlBulkCopy are case sensitive.
ADDITIONAL NOTES
When asking about an error, please always include the actual, and full, error message in the question. Simply saying that there was an error leads to a lot of guessing and wild-goose chases that in turn lead to off-topic answers.
When asking a question, please do not change the circumstances that you are dealing with. For example, the question states (emphasis added):
bulk inserts with a 'GIVEN' case setup fail, but they succeed with another 'Given' case setup.
Yet the example statements are single INSERTs. Also, a comment on the question states:
In fact I use Microsoft .Net's bulk insert method, so I don't really know the exact query it sends to the DB server.
Using .NET and SqlBulkCopy is waaaay different than using BULK INSERT or INSERT, making the current question misleading, making it difficult (or even impossible) to answer correctly. This new bit of info also leads to more questions because when using SqlBulkCopy, you don't write any INSERT statements: you just write a SELECT statement and specify the name of the destination Table. If you specify column names at all for the destination Table, it is in the optional column mappings. Is that where the issue is?
Regarding the "EDIT" section of the question:
No, changing the Collation of the column won't help at all, even if you weren't using SqlBulkCopy. The Collation of a column determines how data stored in the column behaves, not how the column names (i.e. meta-data of the Table) behaves. It is the Collation of the Database itself that determines how Database-level object meta-data behaves. And in this case, you claim that the DB is using a case-insensitive Collation (correct, the _CI_ portion of the Collation name does mean "Case Insensitive").
Regarding the following statements made by Jonathan Leffler on the question:
that gets into a very delicate area of the interaction between delimited identifiers (normally case-sensitive) and collations (this one is case-insensitive).
No, delimited identifiers are not normally case-sensitive. The sensitivities (case, accent, kana type, width, and starting in SQL Server 2017 variation selector) of delimited identifiers is the same as for non-delimited identifiers at that same level. "Same level" means that Instance-level names (Databases, Logins, etc) are controlled by the Instance-level Collation, while Database-level names (Schemas, Objects--Tables, Views, Functions, Stored Procedures, etc--, Users, etc) are controlled by the Database-level Collation. And these two levels can have different Collations.
you need to research whether the SQL column names in a database are case-sensitive when delimited. It may also depend on how the CREATE TABLE statement is written (were the names delimited in that?). Normally, SQL is case-insensitive on column and table names; you could write INSERT INTO SoMeTaBlE(GiVeN, cOlNaMe) VALUES("v1", "v2") and if the names were never delimited, it'd be OK.
It does not matter if the column names were delimited or not when creating the Table, at least not in terms of how their resolution is handled. Column names are Database-level meta-data, and that is controlled by the default Collation of the Database. And it is the same for all Database-level meta-data within each Databases. You cannot have some column names being case-sensitive while others are case-insensitive.
Also, there is nothing special about Table and column names. They are Database-level meta-data just like User names, Schema names, Index names, etc. All of this meta-data is controlled by the Database's default Collation.
Meta-data (both Instance-level and Database-level) is only "normally" case-insensitive due to the default Collation suggested during installation being a case-insensitive Collation.
a 'delimited identifier' is a column name, table name, or something similar enclosed in double quotes, such as CREATE TABLE "table"(...)
It is more accurate to say that a delimited identifier is an identifier enclosed in whatever character(s) the DBMS in question has defined as its delimiters. And which particular characters are used for delimiters varies between the different DBMSs.
In SQL Server, delimited identifiers are enclosed in square brackets: [GIVEN]
While square brackets always work as delimiters for identifiers, it is possible to use double-quotes as delimiters IF you have the session-level property of QUOTED_IDENTIFIER set to ON (which is best to always do anyway).
There are arcane parts to SQL (and delimited identifier handling is one of them)
Well, delimited identifiers are actually quite simple. The whole point of delimiting an identifier is to effectively ignore the rules of regular (i.e. non-delimited) identifiers. But, in terms of regular identifiers, yes, those rules are rather arcane (mainly due to the official documentation being incomplete and incorrect). So, in order to take the mystery out of how identifiers in SQL Server actually work, I did a bunch of research and published the results here (which includes links to the research itself):
Completely Complete List of Rules for T-SQL Identifiers
For more info on Collations / Encodings / Unicode / ASCII, especially as they relate to Microsoft SQL Server, please visit:
Collations.Info
The fact the column names are case sensitive means that the MASTER database has been created using a case sensitive collation.
In the case I just had that lead me to investigate this, someone entered
Latin1_CS_AI instead of Latin1_CI_AS
When setting up SQL server.
Check the collation of the columns in your table definition, and the collation of the tempdb database (i.e. the server collation). They may differ from your database collation.