SQL Server 2008: Collation for UTF-8 code page 65001 - sql-server

There is a need to save an XML in UTF-8 encoding and then use it in T-SQL code to extract data.
Default database collation is SQL_Latin1_General_CP1_CI_AS.
I don't know if it is possible to save and work with UTF-8 data in SQL Server 2008, but I have an idea to use collation with code page of UTF-8 (65001) on the XML column in order to save the data in UTF-8.
Does anybody know if it is possible or have another idea on how to work with UTF-8 data in SQL Server?

If you're dealing with xml data, store it as the xml data type. That should take care of any concerns you have (i.e. how to store it) and you'll save yourself the work of having to convert it to xml when you do work on it (e.g. xpath expressions, xquery, etc).

You can store all Unicode characters in xml or nvarchar columns. It does not matter what collation you use. A handful of rare Chinese characters (from the supplementary plane) may be stored as pairs of nchars (surrogate pairs). But there is no loss of data.

NVARCHAR column should do the job just fine.

Related

SQL Server 2019 Database with consequences of mixing nvarchar column type, but changing collation to Latin1_General_100_CI_AI_SC_UTF8

We need to store much UTF-8 collated data in XML data type columns and the XML files we store explicitly state encoding='UTF-8', which results in an error when trying to store XML data in the column.
We are now in the process of switching DB collation default to Latin1_General_100_CI_AI_SC_UTF8 from a prior UTF-16 based Latin1_General_100_CI_AI_SC. Do we need to switch all nvarchar columns to varchar in the process? We are not afraid of losing any data and we (probably) do not have anything but a few encoded chars in our data, we are all in 'latin' alphabet. Is it simply going to affect (use 2x size)? Will there be a performance hit on joins? Any other consequences?

What SQL Server datatype to use for mixed XML and HL7v2 data?

Consider a column in an MS SQL database which will house either potentially large chunks or XML or pipe-delimited HL7v2 data.
Currently (due to not using forward-thinking) it's currently typed as XML because originally we were only ever accepting XML data. While technically this could work, it means that all the XML special characters in the HL7v2 messages are being encoded (& --> & etc.).
This is not ideal for what we are doing. If I were to convert this column to a different datatype, what would be recommended? I was thinking nvarchar(max) as it seems like it would handle it, but I'm not well-versed in SQL datatypes and the implications of using different types for such data.
There really isn't much of a choice other than nvarchar(max).
The other options are either varchar(max) or varbinary(max). You might need Unicode so you can't use varchar. It would work to store it as varbinary, but it would just be annoying to work with.
Use HAPI to transform the HL7 messages from ER7 (pipe delimited) to XML encoding. That way you can use a single SQL Server XML column for everything. And it will give you the added benefit of being able to query into HL7 message contents using XQuery.
As Nicks say, converting pipe delimited to XML and then persist in XML is the best option, trying to persist xml and pipe delimited values in a same column for me it make no sense, as on source they are different data types.

Unicode Data Type in SQL

I'm new to Microsoft SQL. I'm planning to store text in Microsoft SQL server and there will be special international characters. Is there a "Data Type" specific to Unicode or I'm better encoding my text with a reference to the unicode number (i.e. \u0056)
Use Nvarchar/Nchar (MSDN link). There used to be an Ntext datatype as well, but it's deprecated now in favour of Nvarchar.
The columns take up twice as much space over the non-unicode counterparts (char and varchar).
Then when "manually" inserting into them, use N to indicate it's unicode text:
INSERT INTO MyTable(SomeNvarcharColumn)
VALUES (N'français')
When you say special international characters, what do you mean? If special means they aren't common and just occasional, then the overhead of nvarchar might not make sense in your situation on a table with a very large number of rows or a lot of indexing.
I'm all for using Unicode where appropriate, but understanding when it is appropriate is important.
If you are mixing data with different implied code pages (Japanese and Chinese in same database) or you just want to be forward-looking for internationalization and localization, then you want the column to be Unicode and use nvarchar data type and that's perfectly fine. Unicode is not going to magically solve all sorting problems for you.
If you are know that you will always be storing mainly ASCII but some occasional foreign characters, just store your UTF-8 data or HTML encoded data in varchar. If your data is all in Japanese and code page 932 (or any other single code page), you can still store double-byte characters in varchar, they still take up two bytes. My point is, that when you are already in a DBCS collation, international characters are no longer "special". It's not just the data storage, but any indexes as well as the working set when dealing with such a column in queries and in other dataflows.
And do not make a blanket rule that all character data should be nvarchar - it's a waste for many columns which are codes or identifiers.
Any time you have a column, go through the same questions:
What is the type of data?
What is the range?
Are NULLs allowed?
What is the limit of the size?
Are there any constraints I should apply now to stop bad data getting in from the beginning?
People have had success with using the following code to force Unicode at insert data manipulation.
INSERT INTO <table> (text) values (N'<text here>)
1
Character set features of tables and string inside them are specified for the database and if your database has a Unicode collation, strings inside the tables are unicode. As well for string columns you have to use nvarchar or nchar data types to make them able to store unicode strings. But this feature works if your database has a utf8 or unicode characterset or collation. Read this link for more information. Unicode and SQL Server

How do I save xml exactly as is to a xml database field?

At the moment, if I save <element></element> to a SQL Server 2008 database in a field of type xml, it converts it to <element/>.
How can I preserve the xml empty text as is when saving?
In case this is a gotcha, I am utilising Linq to Sql as my ORM to communicate to the database in order to save it.
What you're asking for is not possible.
SQL Server stores data in xml columns as a binary representation, so any extraneous formatting is discarded, as you found out.
To preserve the formatting, you would have to store the content in a text field of type varchar(MAX) or nvarchar(MAX). Hopefully you don't have to run XML-based queries on the data.
http://msdn.microsoft.com/en-us/library/ms189887.aspx

How can I recover Unicode data which displays in SQL Server as?

I have a database in SQL Server containing a column which needs to contain Unicode data (it contains user's addresses from all over the world e.g. القاهرة‎ for Cairo)
This column is an nvarchar column with a collation of database default (Latin1_General_CI_AS), but I've noticed data inserted into it via SQL statements containing non English characters and displays as ?????.
The solution seems to be that I wasn't using the n prefix e.g.
INSERT INTO table (address) VALUES ('القاهرة')
Instead of:
INSERT INTO table (address) VALUES (n'القاهرة')
I was under the impression that Unicode would automatically be converted for nvarchar columns and I didn't need this prefix, but this appears to be incorrect.
The problem is I still have some data in this column which appears as ????? in SQL Server Management Studio and I don't know what it is!
Is the data still there but in an incorrect character encoding preventing it from displaying but still salvageable (and if so how can I recover it?), or is it gone for good?
Thanks,
Tom
To find out what SQL Server really stores, use
SELECT CONVERT(VARBINARY(MAX), 'some text')
I just tried this with umlauted characters and Arabic (copied from Wikipedia, I have no idea) both as plain strings and as N'' Unicode strings.
The results are that Arabic non-Unicode strings really end up as question marks (0x3F) in the conversion to VARCHAR.
SSMS sometimes won't display all characters, I just tried what you had and it worked for me, copy and paste it into Word and it might display it corectly
Usually if SSMS can't display it it should be boxes not ?
Try to write a small client that will retrieve these data to a file or web page. Check ALL your code if there are no other inserts or updates that might convertthe data to varchar before storing them in tables.

Resources