We have a process that writes XML (Using SQL's FOR XML). When it was executed via SQLCMD in a batch file the output was in UTF-8 format (specifically 8-bit ascii characters become 2 byte).
When I do the same thing through Execute SQL Command in SSIS it's not UTF-8 encoded.
Here's a simple example. The ® should become 2 bytes:
SELECT 'Diversity® Certified' as fldAgentLastName
FOR XML PATH('agent'), ELEMENTS, TYPE, ROOT('agents')
The output is: Diversity® Certified
it SHOULD be: Diversity® Certified
and was using SQLCMD. I understand that internally XML is stored as UCS-2(?), but I need a way to get the output as UTF-8 encoded data (not just 8-bit).
I also cannot use the BCP trick I've seen mentioned.
I don't want to use the CDATA tag because that would entail recreating a giant ugly query.
Everything I've found on the web doesn't encode the high ascii characters.
This is running on SQL Server 2008 R2.
I guess all I had to do was ask, and then I'd find my own answer: The problem wasn't with SQL, it was downstream in SSIS. I changed the codepage for a CONVERT as well as the final Text file to 65001 based on this answer in another thread:
...The workaround is simple albeit counterintuitive - add a Data Conversion Transformation step between the OLE DB Source and the Flat-File Destination that converts your input "Dat" column from DT_NTEXT to DT_TEXT with a codepage of 65001. Then you feed the newly transformed column directly to the output column in your flat-file dest. ... Regards, Jacob
http://www.sqlservercentral.com/Forums/Topic719421-149-1.aspx
Related
I have a requirement to export a database to a tab-delimited file in the ASCII format. I am using derived columns to convert any Unicode strings to non-Unicode strings. For example, a former Unicode text stream is now casted as this:
(DT_TEXT,20127)incomingMessage
But SSIS is still looking for ANSI. I am still seeing an error at the Flat File Destination:
The code page on input column <column_name> is 1252 and is required to be 20127.
This happens for any column in the table, not just Unicode ones.
This is what I have been doing to ensure ASCII is used:
In the Flat File Connection Manager, used Code page "20127 (US-ASCII)"
Used a Derived Column to cast data types
In the OLE DB source, set the default code page to 20127
Any thoughts?
How about using the Data Conversion task? Connect the Flat File task to the Data Conversion and then change the metadata on the fly to suit your needs. You should be able to delete the derived column task if you change the metadata to handle the unicode issues in the data conversion task. Then you can process the records accordingly into the OLE DB Source without issues.
I recently created an Oracle DB with JA16SJIS Character Set.
And then I try to insert some data include Japanese characters using SQL*Plus running an external SQL file. The file is encoded in Shift-JIS (and I can see Japanese characters properly in the file using notepad++).
Inserting was success but when I select the data (using SQL*Plus), Japanese characters are not displayed properly (like some alphabet characters with some question marks).
Even when I use SQL Developer to view the data, Japanese characters still unreadable.
And I'm using Window 7 Professional SP1, Oracle Database 11g R2, system locale set to Japan as well.
First, you should try to insert some text directly from SQLDeveloper data view. That should work no matter what, so you can use it to check your imports.
Then before you connect with SQL*Plus you must specify what you're going to send by setting or changing the value of environment variable NLS_LANG.
NSL_LANG=ENGLISH_FRANCE.JA16SJIS
The syntax will depend on your OS. The only important part is the last one JA16SJIS which means Shift-Jis as you already know.
You can then connect with SQL*Plus and import your file.
Note that the encoding that you specify must match the encoding of your file but not necessarily the encoding of the base as Oracle will do a conversion if necessary. So you could have your base in UTF8 and it would still work (because UTF8 can hold japanese characters).
In these cases the first thing I do is to have a look at what byte values are stored in the database. You can use the dump function for that.
select dump(<column>) from <table>
If you know what byte values your characters should have you can check if the correct values are in your table.
I'm reading this page that describing the bcp utility. It states that:
This section contains the following examples that show how to use bcp
commands to create a non-XML format file:
A. Creating a non-XML format file for native data
B. Creating a non-XML format file for character data
C. Creating a non-XML format file for Unicode native data
D. Creating a non-XML format file for Unicode character data
The examples use the HumanResources.Department table in the
AdventureWorks2012 sample database. The HumanResources.Department
table contains four columns: DepartmentID, Name, GroupName, and
ModifiedDate.
I'm not clear what does these types mean? When to use each?
Thanks.
There are two dimensions:
native vs. character. Native format creates a binary file. Character format creates a text file. Use character when you want the result to be human readable and usable for other apps (eg. import in Excel). Use native if both the source and destinations are SQL Server and human readability is not desired/needed.
Unicode vs. non-Unicode. Unicode will store strings in wide format (Unicode encoding). non-Unicode will store them in the specified code page encoding (the -C argument). If space is not a concern, use Unicode, unless you enjoy pain.
You have to realize that you're seeing a product with +20 years history behind, there is path dependence. Nowadays I always use native Unicode (-n -w) unless I have specific reasons not to.
Currently, I am in the process of updating all of our Delphi 2007 code base to Delphi XE2. The biggest consideration is the ANSI to Unicode conversion, which we've dealt with by re-defining all base types (char/string) to ANSI types (ansichar/ansistring). This has worked in many of our programs, until I started working with the database.
The problem started when I converted a program that stores information read from a file into an SQL Server 2008 database. Suddenly simple queries that used a string to locate data would fail, such as:
SELECT id FROM table WHERE name = 'something'
The name field is a varchar. I found that I was able to complete the query successfully by prefixing the string name with an N. I was under the impression that varchar could only store ANSI characters, but it appears to be storing Unicode?
Some more information: the name field in Delphi is string[13], but I've tried dropping the [13]. The database collation is SQL_Latin1_General_CP1_CI_AS. We use ADO to interface with the database. The connection information is stored in the ODBC Administrator.
NOTE: I've solved my actual problem thanks to a bit of direction from Panagiotis. The name we read from our map file is an array[1..24] of AnsiChar. This value was being implicitly converted to string[13], which was including null characters. So a name with 5 characters was really being stored as the 5 characters + 8 null characters in the database.
varchar fields do NOT store Unicode characters. They store ASCII values in the codepage specified by the field's collation. SQL Server will try to convert characters to the correct codepage when you try to store Unicode or data from a different codepage. You can disable this feature but the best option is to avoid the whole mess by using nvarchar fields and UnicodeString in your application.
You mention that you changes all character types to ANSI, not UNICODE types in your application. If you want to use UNICODE you should be using a UNICODE type like UnicodeString. Otherwise your values will be converted to ANSI when they are sent to your server. This conversion is done by your code when you create the AnsiString that is sent to the server.
BTW, your select statement stores an ASCII value in the field. You have to prepend the value with N if you want to store it as a unicode value, eg.g
SELECT id FROM table WHERE name = N'something'
Even this will not guarantee that your data will reach the server in a Unicode form. If you store the statement in an AnsiString the entire statement is converted to ANSI before it is sent to the server. If your app makes a wrong conversion, you will end up with mangled data on the server.
The solution is very simple, just use parameterized statements to pass unicode values as unicode parameters and store them in NVarchar fields. It is much faster, avoids all conversion errors and prevents SQL injection attacks.
There is a need to save an XML in UTF-8 encoding and then use it in T-SQL code to extract data.
Default database collation is SQL_Latin1_General_CP1_CI_AS.
I don't know if it is possible to save and work with UTF-8 data in SQL Server 2008, but I have an idea to use collation with code page of UTF-8 (65001) on the XML column in order to save the data in UTF-8.
Does anybody know if it is possible or have another idea on how to work with UTF-8 data in SQL Server?
If you're dealing with xml data, store it as the xml data type. That should take care of any concerns you have (i.e. how to store it) and you'll save yourself the work of having to convert it to xml when you do work on it (e.g. xpath expressions, xquery, etc).
You can store all Unicode characters in xml or nvarchar columns. It does not matter what collation you use. A handful of rare Chinese characters (from the supplementary plane) may be stored as pairs of nchars (surrogate pairs). But there is no loss of data.
NVARCHAR column should do the job just fine.