So I decided for the fun of it to read a text file and store the contents into a NVARCHAR using TSQL and the Microsoft SQL Server Management Studio 2008 R2. I found an example for doing this at https://www.simple-talk.com/sql/t-sql-programming/the-tsql-of-text-files/
So I tried this with my ABC.txt file and its contents are:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
When I first tried to store the contents of this file into#myString I used this code:
declare #myString nvarchar(max);
Select #myString = BulkColumn
from OPENROWSET(Bulk 'C:\Users\<myComputer'sNameHere>\Documents\How2\FilesForTestingStuff\ABC.txt', SINGLE_BLOB) as x
print #myString;
I got this as my output when I printed the string:
䉁䑃䙅䡇䩉䱋乍偏剑呓噕塗婙扡摣晥桧橩汫湭灯牱瑳癵硷穹
I changed nvarchar to varchar and I got the correct contents of the file.
Anyone know why this happend? I didn't think that there's a conversion difference other than nvarchar has more space available than varchar and is able to hold unicode characters.
Also how do you normally attempt reading from a file and inserting the contents into a nvarchar?
I suppose it depends on the encoding of the input file.
You used SINGLEBLOB and according to MSDN it causes data to be returned as varbinary(MAX). Your file was probably saved using a non-unicode encoding, so when it was imported data into nvarchar column, SQL interpreted it incorrectly. Changing the type allowed characters to be read correctly. Please try to encode the file with UTF-16 and try to import data into a nvarchar(MAX) variable.
Update
I tried to recreate the issue You described. I've saved a text file with ANSI encoding, run the import script and got the output similar to the one You posted in Your question. Then, I converted the file to UCS-2 Little Endian encoding and after running the script I got correct output.
To sum it up, if You want to use importing with SINGLEBLOB option, just convert the file with data to use UCS-2 Little Endian encoding and it should work correctly with nvarchar SQL type.
Reference links:
OPENROWSET
nchar and varchar
Related
I have a table having nvarchar column in SQL server 2016 that I want to store in hive. The nvarchar column can have non-ASCII characters. The data from SQL server is extracted in a file with the nvarchar column converted to base64 coded string. I tried the following to convert the base64 back to readable string:
select decode(unbase64(BASE64STR),'UTF-8');
It failed with the following error:
org.apache.hive.service.cli.HiveSQLException: Error while compiling
statement: FAILED: SemanticException [Error 10014]: Line 1:7 Wrong
arguments ''UTF-8'': org.apache.hadoop.hive.ql.metadata.HiveException:
java.nio.charset.MalformedInputException: Input length = 1
The following code is able to propertly decode BASE64 coded string
select decode(unbase64(BASE64STR),'UTF-16LE');
Is it safe to use UTF-16LE to decode the string from nvarchar type column? Will this work with any data stored in the column? Is there another way to achieve this ETL functionality from SQL Server to Hive for nvarchar type data?
Take a look at the functions for working with base64 that we have in SDU Tools (free). They go to/from varbinary but might work ok with strings cast to/from that. At the very least, the code should give you a good start. They are at: http://sdutools.sqldownunder.com
I have some data which I believe is Unicode and seeing what happens when I store it into my database column which is of VARCHAR(MAX) datatype.
And here's the source, from the file which is UTF-8...
looking for that ‘X’ and • 3 large bedrooms with 2 ensuites and • Main bedroom with ensuite & surround with plantation shutters`
and using the Visual Studio debugger:
=> so 2x apostrophes and 2x bullets.
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
I'm assuming my source data is not Unicode and therefore, I totally suck at all this Unicode/UTF-8 stuff :(
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
That's correct. As far as I can guess from your example, it is not storing Unicode. Probably it is storing bytes encoded in Windows code page 1252, which would be the default encoding for a Western install of SQL Server.
Code page 1252 happens to include mappings for characters ‘, ’ and •, so those characters can be safely stored. But step outside that limited repertoire and you'll start losing characters.
The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.
But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.
QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?
Further info
The insertion operations are performed using Spring JDBCTemplate.
The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.
private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
if (jaxbEncoding!=null)
m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
StringWriter strw = new StringWriter();
m.marshal(myTao, strw);
String strw.toString();
}
Where ...
public enum JAXBEncoding {
UTF8("UTF-8"),
UTF16("UTF-16")
;
private String value;
private JAXBEncoding(String value){
this.value = value;
}
public String toString(){
return this.value;
}
}
Is there a way to define a SQL Server column/field as having UTF-8 encoding?
No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.
You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:
The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
The byte sequence needs to be the actual UTF-8 bytes.
For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):
SET NOCOUNT ON;
DECLARE #XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
+ CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
+ '</test></root>';
SELECT #XML;
PRINT CONVERT(NVARCHAR(MAX), #XML);
Returns (in both "Results" and "Messages" tabs):
<root><test>😱</test></root>
You mentioned in a comment on #Shnugo's answer:
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?
No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.
If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).
As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.
Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.
The problem occurs, when the encoding declaration is included within <?xml ...?>:
This works:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-8"?>
<root>test</root>';
SELECT #xml;
This produces an error:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
But this works again (see the leading N before the string literal):
DECLARE #xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
##Fazit
If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.
##UPDATE
You are mixing two things
##Encoding
From your comment:
UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...
Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.
And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column
And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:
DECLARE #s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.
But internal VARCHAR cannot deal with higher encodings!. Try this:
DECLARE #s NVARCHAR(MAX)=N'слов в тексте';
SELECT #s
This will work with NVARCHAR and N'Your string' only!
##XML-storage
As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...
The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.
##Fazit
If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.
A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.
convert(xml, convert(text, '<your UTF-8 xml>'))
The "Type Casting String and Binary Instances" section of the MSDN document
Create Instances of XML Data
explains how incoming XML data is interpreted. Essentially,
if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",
if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,
if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".
If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.
If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.
The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.
But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.
QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?
Further info
The insertion operations are performed using Spring JDBCTemplate.
The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.
private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
if (jaxbEncoding!=null)
m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
StringWriter strw = new StringWriter();
m.marshal(myTao, strw);
String strw.toString();
}
Where ...
public enum JAXBEncoding {
UTF8("UTF-8"),
UTF16("UTF-16")
;
private String value;
private JAXBEncoding(String value){
this.value = value;
}
public String toString(){
return this.value;
}
}
Is there a way to define a SQL Server column/field as having UTF-8 encoding?
No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.
You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:
The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
The byte sequence needs to be the actual UTF-8 bytes.
For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):
SET NOCOUNT ON;
DECLARE #XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
+ CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
+ '</test></root>';
SELECT #XML;
PRINT CONVERT(NVARCHAR(MAX), #XML);
Returns (in both "Results" and "Messages" tabs):
<root><test>😱</test></root>
You mentioned in a comment on #Shnugo's answer:
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?
No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.
If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).
As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.
Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.
The problem occurs, when the encoding declaration is included within <?xml ...?>:
This works:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-8"?>
<root>test</root>';
SELECT #xml;
This produces an error:
DECLARE #xml XML =
'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
But this works again (see the leading N before the string literal):
DECLARE #xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT #xml;
##Fazit
If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.
##UPDATE
You are mixing two things
##Encoding
From your comment:
UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...
Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.
And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column
And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:
DECLARE #s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.
But internal VARCHAR cannot deal with higher encodings!. Try this:
DECLARE #s NVARCHAR(MAX)=N'слов в тексте';
SELECT #s
This will work with NVARCHAR and N'Your string' only!
##XML-storage
As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...
The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.
##Fazit
If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.
A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.
convert(xml, convert(text, '<your UTF-8 xml>'))
The "Type Casting String and Binary Instances" section of the MSDN document
Create Instances of XML Data
explains how incoming XML data is interpreted. Essentially,
if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",
if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,
if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".
If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.
If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.
I have been referring to the following page:
http://msdn.microsoft.com/en-us/library/ms178129.aspx
I simply want to bulk import some data from a file that has Unicode characters. I have tried encoding the actual data file in UC-2, UTF-8, etc but nothing works. I have also modified the format file to use SQLNCHAR, but still it doesn't work and gives error:
Bulk load data conversion error (truncation) for row 1, column 1
I think it has to do with this statement from the above link:
For a format file to work with a Unicode character data file, all the
input fields must be Unicode text strings (that is, either fixed-size
or character-terminated Unicode strings).
What exactly does this mean? I thought this means every character string needs to be a fixed 2 bytes, which encoding the file in UCS-2 should handle???
This blog post was really helpful and solved my problem:
http://blogs.msdn.com/b/joaol/archive/2008/11/27/bulk-insert-using-unicode-data-files.aspx
Something else to note - a Java class was generating the data file. In order for the above solution to work, the data file needed to be encoded in UTF-16LE, which can be set in the constructor of OutputStreamWriter (for example).
In SQL Server 2012 I imported a .csv file saved with Notepad++ enconded in UCS-2 with special spanish characters