I'm trying to insert into XML column (SQL SERVER 2008 R2), but the server's complaining:
System.Data.SqlClient.SqlException (0x80131904):
XML parsing: line 1, character 39, unable to switch the encoding
I found out that the XML column has to be UTF-16 in order for the insert to succeed.
The code I'm using is:
XmlSerializer serializer = new XmlSerializer(typeof(MyMessage));
StringWriter str = new StringWriter();
serializer.Serialize(str, message);
string messageToLog = str.ToString();
How can I serialize object to be in UTF-8 string?
EDIT: Ok, sorry for the mixup - the string needs to be in UTF-8. You were right - it's UTF-16 by default, and if I try to insert in UTF-8 it passes. So the question is how to serialize into UTF-8.
Example
This causes errors while trying to insert into SQL Server:
<?xml version="1.0" encoding="utf-16"?>
<MyMessage>Teno</MyMessage>
This doesn't:
<?xml version="1.0" encoding="utf-8"?>
<MyMessage>Teno</MyMessage>
Update
I figured out when the SQL Server 2008 for its Xml column type needs utf-8, and when utf-16 in encoding property of the xml specification you're trying to insert:
When you want to add utf-8, then add parameters to SQL command like this:
sqlcmd.Parameters.Add("ParamName", SqlDbType.VarChar).Value = xmlValueToAdd;
If you try to add the xmlValueToAdd with encoding=utf-16 in the previous row it would produce errors in insert. Also, the VarChar means that national characters aren't recognized (they turn out as question marks).
To add utf-16 to db, either use SqlDbType.NVarChar or SqlDbType.Xml in previous example, or just don't specify type at all:
sqlcmd.Parameters.Add(new SqlParameter("ParamName", xmlValueToAdd));
This question is a near-duplicate of 2 others, and surprisingly - while this one is the most recent - I believe it is missing the best answer.
The duplicates, and what I believe to be their best answers, are:
Using StringWriter for XML Serialization (2009-10-14)
https://stackoverflow.com/a/1566154/751158
Trying to store XML content into SQL Server 2005 fails (encoding problem) (2008-12-21)
https://stackoverflow.com/a/1091209/751158
In the end, it doesn't matter what encoding is declared or used, as long as the XmlReader can parse it locally within the application server.
As was confirmed in Most efficient way to read XML in ADO.net from XML type column in SQL server?, SQL Server stores XML in an efficient binary format. By using the SqlXml class, ADO.net can communicate with SQL Server in this binary format, and not require the database server to do any serialization or de-serialization of XML. This should also be more efficient for transport across the network.
By using SqlXml, XML will be sent pre-parsed to the database, and then the DB doesn't need to know anything about character encodings - UTF-16 or otherwise. In particular, note that the XML declarations aren't even persisted with the data in the database, regardless of which method is used to insert it.
Please refer to the above-linked answers for methods that look very similar to this, but this example is mine:
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.IO;
using System.Xml;
static class XmlDemo {
static void Main(string[] args) {
using(SqlConnection conn = new SqlConnection()) {
conn.ConnectionString = "...";
conn.Open();
using(SqlCommand cmd = new SqlCommand("Insert Into TestData(Xml) Values (#Xml)", conn)) {
cmd.Parameters.Add(new SqlParameter("#Xml", SqlDbType.Xml) {
// Works.
// Value = "<Test/>"
// Works. XML Declaration is not persisted!
// Value = "<?xml version=\"1.0\"?><Test/>"
// Works. XML Declaration is not persisted!
// Value = "<?xml version=\"1.0\" encoding=\"UTF-16\"?><Test/>"
// Error ("unable to switch the encoding" SqlException).
// Value = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><Test/>"
// Works. XML Declaration is not persisted!
Value = new SqlXml(XmlReader.Create(new StringReader("<?xml version=\"1.0\" encoding=\"UTF-8\"?><Test/>")))
});
cmd.ExecuteNonQuery();
}
}
}
}
Note that I would not consider the last (non-commented) example to be "production-ready", but left it as-is to be concise and readable. If done properly, both the StringReader and the created XmlReader should be initialized within using statements to ensure that their Close() methods are called when complete.
From what I've seen, the XML declarations are never persisted when using an XML column. Even without using .NET and just using this direct SQL insert statement, for example, the XML declaration is not saved into the database with the XML:
Insert Into TestData(Xml) Values ('<?xml version="1.0" encoding="UTF-8"?><Test/>');
Now in terms of the OP's question, the object to be serialized still needs to be converted into an XML structure from the MyMessage object, and XmlSerializer is still needed for this. However, at worst, instead of serializing to a String, the message could instead be serialized to an XmlDocument - which can then be passed to SqlXml through a new XmlNodeReader - avoiding a de-serialization/serialization trip to a string. (See http://blogs.msdn.com/b/jongallant/archive/2007/01/30/how-to-convert-xmldocument-to-xmlreader-for-sqlxml-data-type.aspx for details and an example.)
Everything here was developed against and tested with .NET 4.0 and SQL Server 2008 R2.
Please don't make waste by running XML through extra conversions (de-deserializations and serializations - to DOM, strings, or otherwise), as shown in other answers here and elsewhere.
Although a .net string is always UTF-16 you need to serialize the object using UTF-16 encoding.
That sould be something like this:
public static string ToString(object source, Type type, Encoding encoding)
{
// The string to hold the object content
String content;
// Create a memoryStream into which the data can be written and readed
using (var stream = new MemoryStream())
{
// Create the xml serializer, the serializer needs to know the type
// of the object that will be serialized
var xmlSerializer = new XmlSerializer(type);
// Create a XmlTextWriter to write the xml object source, we are going
// to define the encoding in the constructor
using (var writer = new XmlTextWriter(stream, encoding))
{
// Save the state of the object into the stream
xmlSerializer.Serialize(writer, source);
// Flush the stream
writer.Flush();
// Read the stream into a string
using (var reader = new StreamReader(stream, encoding))
{
// Set the stream position to the begin
stream.Position = 0;
// Read the stream into a string
content = reader.ReadToEnd();
}
}
}
// Return the xml string with the object content
return content;
}
By setting the encoding to Encoding.Unicode not only the string will be UTF-16 but you should also get the xml string as UTF-16.
<?xml version="1.0" encoding="utf-16"?>
Isn't the easiest solution to tell the serializer not to ouput the XML declaration? .NET and SQL should sort the rest out between them.
XmlSerializer serializer = new XmlSerializer(typeof(MyMessage));
StringWriter str = new StringWriter();
using (XmlWriter writer = XmlWriter.Create(str, new XmlWriterSettings { OmitXmlDeclaration = true }))
{
serializer.Serialize(writer, message);
}
string messageToLog = str.ToString();
It took me forever to re-solve this problem.
I was doing an INSERT statement into SQL Server as something like:
UPDATE Customers
SET data = '<?xml version="1.0" encoding="utf-16"?><MyMessage>Teno</MyMessage>';
and this gives the error:
Msg 9402, Level 16, State 1, Line 2
XML parsing: line 1, character 39, unable to switch the encoding
And the really, very simple fix is to:
UPDATE Customers
SET data = N'<?xml version="1.0" encoding="utf-16"?><MyMessage>Teno</MyMessage>';
The difference is prefixing the Unicode string with N:
N'<?xml version="1.0" encoding="utf-16"?>Teno</MyMessage>'
In the former case, an unprefixed string is assumed to be varchar (e.g. Windows-1252 code-page). When it encounters the encoding="utf-16" inside the string, there is a conflict (and rightly so, since the string isn't utf-16).
The fix is to pass the string to SQL server as an nvarchar (i.e. UTF-16):
N'<?xml version="1.0" encoding="utf-16"?>'
That way the string is UTF-16, which matches the utf-16 encoding that the XML says it is. The carpet matches the curtains, so to speak.
#ziesemer's answer (above) is the only fully correct answer to this question and the linked duplicates of this question. However, it could still use a little more explanation and some clarification. Consider this as an extension of #ziesemer's answer.
Even if they produce the desired result, most answers to this question (including the duplicate question) are convoluted and go through many unnecessary steps. The main issue here is the overall lack of understanding regarding how the XML datatype actually works in SQL Server (not surprising given that it isn't well documented). The XML type:
Is a highly optimized (for storage) type that converts the incoming XML into a binary format (which is documented somewhere in the msdn site). The optimizations include:
Converting numbers and dates from string (as they are in the XML) into binary representations IF the element or attribute is tagged with the type info (this might require specifying an XML Schema Collection). Meaning, the number "1234567" is stored as a 4-byte "int" instead of a 14-byte UTF-16 string of 7 digits.
Element and Attribute names are stored in a dictionary and given a numeric ID. That numeric ID is used in the XML tree structure. Meaning, "<ElementName>...</ElementName>" takes up 27 character (i.e. 54 bytes) in string form, but only 11 characters (i.e. 22 bytes) when stored in the XML type. And that is for a single instance of it. Multiple instances take up additional multiples of the 54 bytes. But in the XML type, each instance only takes up the space of that numeric ID, most likely a 4-byte int.
Stores strings as UTF-16 Little Endian, always. This is most likely why the XML declaration is not stored: it is entirely unnecessary as it is always the same since the "Encoding" attribute cannot ever change.
No XML declaration assumes the encoding to be UTF-16, not UTF-8.
Can have 8-bit / non-UTF-16 data passed in. In this case, you need to make sure that the string is not an NVARCHAR string (i.e. not prefixed with an upper-case "N" for literals, not declared as NVARCHAR when dealing with T-SQL variables, and not declared as SqlDbType.NVarChar in .NET). AND, you need to make sure that you do have the XML declaration, and that it specifies the correct encoding.
PRINT 'VARCHAR / UTF-8:';
DECLARE #XML_VC_8 XML;
SET #XML_VC_8 = '<?xml version="1.0" encoding="utf-8"?><test/>';
PRINT 'Success!'
-- Success!
GO
PRINT '';
PRINT 'NVARCHAR / UTF-8:';
DECLARE #XML_NVC_8 XML;
SET #XML_NVC_8 = N'<?xml version="1.0" encoding="utf-8"?><test/>';
PRINT 'Success!'
/*
Msg 9402, Level 16, State 1, Line XXXXX
XML parsing: line 1, character 38, unable to switch the encoding
*/
GO
PRINT '';
PRINT 'VARCHAR / UTF-16:';
DECLARE #XML_VC_16 XML;
SET #XML_VC_16 = '<?xml version="1.0" encoding="utf-16"?><test/>';
PRINT 'Success!'
/*
Msg 9402, Level 16, State 1, Line XXXXX
XML parsing: line 1, character 38, unable to switch the encoding
*/
GO
PRINT '';
PRINT 'NVARCHAR / UTF-16:';
DECLARE #XML_NVC_16 XML;
SET #XML_NVC_16 = N'<?xml version="1.0" encoding="utf-16"?><test/>';
PRINT 'Success!'
-- Success!
As you can see, when the input string is NVARCHAR, then the XML declaration can be included, but it needs to be "UTF-16".
When the input string is VARCHAR then the XML declaration can be included, but it cannot be "UTF-16". It can, however, be any valid 8-bit encoding, in which case the bytes for that encoding will be converted into UTF-16, as shown below:
DECLARE #XML XML;
SET #XML = '<?xml version="1.0" encoding="utf-8"?><test attr="'
+ CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0x8E) + '"/>';
SELECT #XML;
-- <test attr="😎" />
SET #XML = '<?xml version="1.0" encoding="Windows-1255"?><test attr="'
+ CONVERT(VARCHAR(10), 0xF9ECE5ED) + '"/>';
SELECT #XML AS [XML from Windows-1255],
CONVERT(VARCHAR(10), 0xF9ECE5ED) AS [Latin1_General / Windows-1252];
/*
XML from Windows-1255 Latin1_General / Windows-1252
<test attr="שלום" /> ùìåí
*/
The first example specifies the 4-byte UTF-8 sequence for Smiling Face with Sunglasses and it get converted correctly.
The second example uses 4 bytes to represent 4 Hebrew letters making up the word "Shalom", which is converted correctly, and displayed correctly given that the "F9" byte, which is first, is the ש character, which is on the right-side of the word (since Hebrew is a right-to-left language). Yet those same 4 bytes display as ùìåí when selected directly since the default Collation for the current DB is Latin1_General_100_CS_AS_SC.
A string is always UTF-16 in .NET, so as long as you stay inside your managed app you don't have to care about which encoding it is.
The problem is more likely where you talk to the SQL server. Your question doesn't show that code so it's hard to pin point the exact error. My suggestion is you check if there's a property or attribute you can set on that code that specifies the encoding of the data sent to the server.
You are serializing to a string rather than a byte array so, at this point, any encoding hasn't happened yet.
What does the start of "messageToLog" look like? Is the XML specifying an encoding (e.g. utf-8) which subsequently turns out to be wrong?
Edit
Based on your further info it sounds like the string is automatically converted to utf-8 when it is passed to the database, but the database chokes because the XML declaration says it is utf-16.
In which case, you don't need to serialize to utf-8. You need to serialize with the "encoding=" omitted from the XML. The XmlFragmentWriter (not a standard part of .Net, Google it) lets you do this.
Default encoding for a xml serializer should be UTF-16. Just to make sure you can try -
XmlSerializer serializer = new XmlSerializer(typeof(YourObject));
// create a MemoryStream here, we are just working
// exclusively in memory
System.IO.Stream stream = new System.IO.MemoryStream();
// The XmlTextWriter takes a stream and encoding
// as one of its constructors
System.Xml.XmlTextWriter xtWriter = new System.Xml.XmlTextWriter(stream, Encoding.UTF16);
serializer.Serialize(xtWriter, yourObjectInstance);
xtWriter.Flush();
Related
A piece of tsql code doesnt behave the same from production to Test environment. When the code below is executed on prod it brings back data
SELECT [col1xml]
FROM [DBName].[dbo].[Table1] (NOLOCK)
WHERE (cast([col1xml] as xml).value('(/Payment/****/trn1)[1]','nvarchar(20)') ='123456'))
However that same code brings back the below error when ran in Test.
Msg 9402, Level 16, State 1, Line 9
XML parsing: line 1, character 38, unable to switch the encoding
I have seen the fix provided by this site of conversion of UTF and this works in both prod and test. See below. However i need to provide an answer to the developers of why this behavior is occurring and a rationale why they should change their code(if that is the case)
WHERE CAST(
REPLACE(CAST(col1xml AS VARCHAR(MAX)), 'encoding="utf-16"', 'encoding="utf-8"')
AS XML).value('(/Payment/****/trn1)[1]','NVARCHAR(max)') ='123456')
I have compared both DB's and looked for anything obvious such as ansi nulls and ansi padding. Everything is the same and the version of SQL Server. This is SQL SERVER 2012 11.0.5388 version. Data between environments is different but the table schema is identical and the data type for col1xml is ntext.
In SQL Server you should store XML in a column typed XML. This native type has a got a lot of advantages. It is much faster and has implicit validity checks.
From your question I take, that you store your XML in NTEXT. This type is deprecated for centuries and will not be supported in future versions! You ought to change this soon!
SQL-Server knows two kinds of strings:
1 byte strings (CHAR or VARCHAR), which is extended ASCII
Important: This is not UTF-8! Native UTF-8 support will be part of a coming version.
2 byte string (NCHAR or NVARCHAR), which is UTF-16 (UCS-2)
If the XML has a leading declaration with an encoding (in most cases this is utf-8 or utf-16) you can get into troubles.
If the XML is stored as 2-byte-string (at least the NTEXT tells me this), the declaration has to be utf-16. With a 1-byte-string it should be utf-8.
The best (and easiest) was to ommit the declaration completely. You do not need it. Storing the XML in the appropriate type will kill this declaration automatically.
What you should do: Create a new column of type XML and shuffle all your XMLs to this column. Get rid of any TEXT, NTEXT and IMAGE columns you might have!
The next step is: Be happy and enjoy the fast and easy going with the native XML type :-D
UPDATE Differences in environment
You write: Data between environments is different
The error happens here:
cast([col1xml] as xml)
If your column would store the XML in the native type, you would not need a cast (which is very expensive!!) at all. But in your case this cast depends on the actual XML. As this is stored in NTEXT it is 2-byte-string. If your XML starts with a declaration stating a non-supported encoding (in most cases utf-8), this will fail.
Try this:
This works
DECLARE #xml2Byte_UTF16 NVARCHAR(100)='<?xml version="1.0" encoding="utf-16"?><root>test1</root>';
SELECT CAST(#xml2Byte_UTF16 AS XML);
DECLARE #xml1Byte_UTF8 VARCHAR(100)='<?xml version="1.0" encoding="utf-8"?><root>test2</root>';
SELECT CAST(#xml1Byte_UTF8 AS XML);
This fails
DECLARE #xml2Byte_UTF8 NVARCHAR(100)='<?xml version="1.0" encoding="utf-8"?><root>test3</root>';
SELECT CAST(#xml2Byte_UTF8 AS XML);
DECLARE #xml1Byte_UTF16 VARCHAR(100)='<?xml version="1.0" encoding="utf-16"?><root>test4</root>';
SELECT CAST(#xml1Byte_UTF16 AS XML);
Play around with VARCHAR and NVARCHAR and utf-8 and utf-16...
I am trying to figure out how to pass an XML value to a stored procedure using MSSQL node driver, from the documentation I can see that the driver does support stored procedures, and you also define custom data types like this:
sql.map.register(MyClass, sql.Text);
but I haven't found an example how can it be done for XML so far.
I did find a similar question but for a .NET SQL driver, trying to figure out if anyone has done this for Node.
UPDATE
I was able to send an XML to a stored procedure and parse it in DB, here's the example:
var request = new sql.Request(connection);
var xml = '<root><stock><id>3</id><name>Test3</name><ask>91011</ask></stock></root>'
request.input('XStock', sql.Xml, xml);
request.execute('StockUpdateTest', function (err, recordsets, returnValue, affected) {
});
I do not know this special case, but there are some general ideas:
The input parameter of a stored procedure, which should take some XML, can be either XML, VARCHAR or NVARCHAR. Well, just to mention this, VARBINARY might work too, but why should one do this...
A string in SQL-Server is either 8-bit encoded (VARCHAR) or 16-bit (NVARCHAR), XML is - in any case - NVARCHAR internally.
Most cases will be casted implicitly. You can pass a valid XML as VARCHAR or as NVARCHAR and assign this to a variable of type XML. Both will work. But you will get into troubles if your XML includes special characters...
Important: If the XML includes a declaration like <?xml ... encoding="utf-8"?> it must be handed over to the XML variable as VARCHAR, while utf-16 must be NVARCHAR. This declaration would be omitted in SQL Server in any case, so easiest is, to pass the XML as string without such a declaration.
The clear advise is, to pass an XML as 16-bit unicode string without <?xml ...?>-declaration. Doing so, there will be no implicit casting and you will not run in troubles with special characters and/or encoding issues.
Your SP can either define the parameter as XML or as NVARCHAR(MAX) and assign it to a typed variable internally.
Hope this helps!
The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.
But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.
QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?
Further info
The insertion operations are performed using Spring JDBCTemplate.
The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.
private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
if (jaxbEncoding!=null)
m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
StringWriter strw = new StringWriter();
m.marshal(myTao, strw);
String strw.toString();
}
Where ...
public enum JAXBEncoding {
UTF8("UTF-8"),
UTF16("UTF-16")
;
private String value;
private JAXBEncoding(String value){
this.value = value;
}
public String toString(){
return this.value;
}
}
Is there a way to define a SQL Server column/field as having UTF-8 encoding?
No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.
You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:
The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
The byte sequence needs to be the actual UTF-8 bytes.
For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):
SET NOCOUNT ON;
DECLARE #XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
+ CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
+ '</test></root>';
SELECT #XML;
PRINT CONVERT(NVARCHAR(MAX), #XML);
Returns (in both "Results" and "Messages" tabs):
<root><test>😱</test></root>
You mentioned in a comment on #Shnugo's answer:
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?
No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.
If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).
As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.
Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.
The problem occurs, when the encoding declaration is included within <?xml ...?>:
This works:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-8"?>
<root>test</root>';
SELECT #xml;
This produces an error:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
But this works again (see the leading N before the string literal):
DECLARE #xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
##Fazit
If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.
##UPDATE
You are mixing two things
##Encoding
From your comment:
UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...
Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.
And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column
And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:
DECLARE #s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.
But internal VARCHAR cannot deal with higher encodings!. Try this:
DECLARE #s NVARCHAR(MAX)=N'слов в тексте';
SELECT #s
This will work with NVARCHAR and N'Your string' only!
##XML-storage
As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...
The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.
##Fazit
If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.
A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.
convert(xml, convert(text, '<your UTF-8 xml>'))
The "Type Casting String and Binary Instances" section of the MSDN document
Create Instances of XML Data
explains how incoming XML data is interpreted. Essentially,
if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",
if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,
if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".
If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.
If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.
The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.
But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.
QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?
Further info
The insertion operations are performed using Spring JDBCTemplate.
The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.
private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
if (jaxbEncoding!=null)
m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
StringWriter strw = new StringWriter();
m.marshal(myTao, strw);
String strw.toString();
}
Where ...
public enum JAXBEncoding {
UTF8("UTF-8"),
UTF16("UTF-16")
;
private String value;
private JAXBEncoding(String value){
this.value = value;
}
public String toString(){
return this.value;
}
}
Is there a way to define a SQL Server column/field as having UTF-8 encoding?
No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.
You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:
The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
The byte sequence needs to be the actual UTF-8 bytes.
For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):
SET NOCOUNT ON;
DECLARE #XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
+ CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
+ '</test></root>';
SELECT #XML;
PRINT CONVERT(NVARCHAR(MAX), #XML);
Returns (in both "Results" and "Messages" tabs):
<root><test>😱</test></root>
You mentioned in a comment on #Shnugo's answer:
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?
No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.
If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).
As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.
Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.
The problem occurs, when the encoding declaration is included within <?xml ...?>:
This works:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-8"?>
<root>test</root>';
SELECT #xml;
This produces an error:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
But this works again (see the leading N before the string literal):
DECLARE #xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
##Fazit
If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.
##UPDATE
You are mixing two things
##Encoding
From your comment:
UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...
Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.
And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column
And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:
DECLARE #s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.
But internal VARCHAR cannot deal with higher encodings!. Try this:
DECLARE #s NVARCHAR(MAX)=N'слов в тексте';
SELECT #s
This will work with NVARCHAR and N'Your string' only!
##XML-storage
As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...
The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.
##Fazit
If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.
A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.
convert(xml, convert(text, '<your UTF-8 xml>'))
The "Type Casting String and Binary Instances" section of the MSDN document
Create Instances of XML Data
explains how incoming XML data is interpreted. Essentially,
if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",
if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,
if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".
If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.
If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.
I am trying to insert the following string into an sql xml field
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Ip>x.x.x.x</Ip>
<CountryCode>CA</CountryCode>
<CountryName>Canada</CountryName>
<RegionCode>QC</RegionCode>
<RegionName>Québec</RegionName>
<City>Dorval</City>
<ZipCode>h9p1j3</ZipCode>
<Latitude>45.45000076293945</Latitude>
<Longitude>-73.75</Longitude>
<MetroCode></MetroCode>
<AreaCode></AreaCode>
</Response>
The insert code looks like:
INSERT
INTO Traffic(... , xmlGeoLocation, ...)
VALUES (
...
<!---
<cfqueryparam CFSQLType="cf_sql_varchar" value="#xmlGeoLocation#">,
--->
'#xmlGeoLocation#',
...
)
Two bad things happen:
Québec gets turned into Québec
I get an error saying [Macromedia][SQLServer JDBC Driver][SQLServer]XML parsing: line 8, character 16, illegal xml character
UPDATE:
The incoming test stream is mostly single byte characters.
The é is a two byte character. In particular C3A9
Also I don't have control over the incoming xml stream
I'm going to strip the header...
I'm having the same issue with a funny little apostrophe thing. I think the issue is that by the time the string is getting converted to XML, it's not UTF-8 anymore, but sql server is trying to use the header to decode it. If it's VARCHAR, it's in the client's encoding. If it's NVARCHAR, it's UTF-16. Here are some variations I tested:
SQL (varchar, UTF-8):
SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')
Error:
XML parsing: line 1, character 44, illegal xml character
SQL (nvarchar, UTF-8):
SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')
Error:
XML parsing: line 1, character 38, unable to switch the encoding
SQL (varchar, UTF-16)
SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')
Error:
XML parsing: line 1, character 39, unable to switch the encoding
SQL (nvarchar, UTF-16)
SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')
Worked!
Have a look at this link from w3, it tells me that:
In HTML, there is a list of some built-in character names like é for é but XML does not have this. In XML, there are only five built-in character entities: <, >, &, " and ' for <, >, &, " and ' respectively. You can define your own entities in a Document Type Definition, or you can use any Unicode character (see next item).
In HTML, there are also numeric character references, such as & for &. You can refer to any Unicode character, but the number is decimal, whereas in the Unicode tables the number is usually in hexadecimal. XML also allows hexadecimal references: & for example.
This leads me to believe that, é might work for an é character.
Also the information at this link from Microsoft states that:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.
But all this does not help you if you don't have control over the incoming XML stream. I doubt that it is possible to save an é (or any special character for that matter, except for the built in character entities mentioned above) inside an XML document into an SQL Server XML field, without either adding a DTD or replacing the character with its hexadecimal reference counterpart. In both cases you would need to be able to modify the XML before it goes into the database.
Just a quick example for anyone wanting to go down the "adding a DTD" route.
Here's how to add an internal DTD to an xml file which declares an entity for an é character:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [<!ENTITY eacute "é">]>
<root>
<RegionName>Québec</RegionName>
</root>
If you go here and search on the page "Ctrl+F" for "eacute", you end up in a list with examples for other characters which you could just copy and paste into your own internal DTD.
Edit
You could off course add all entities as they are specified at the link above: <!ENTITY eacute "é"><!ENTITY .. // Next entity>, or just copy them all from this file. I do understand how adding an internal DTD to every single XML file you add to the database isn't such a good idea. I would be interested to know if adding it for 1 file fixes your issue though.
Try to change this:
<RegionName>Québec</RegionName>
to:
<RegionName><![CDATA[Québec
]]></RegionName>