SQL Server BULK INSERT - Escaping reserved characters - sql-server

There's very little documentation available about escaping characters in SQL Server BULK INSERT files.
The documentation for BULK INSERT says the statement only has two formatting options: FIELDTERMINATOR and ROWTERMINATOR, however it doesn't say how you're meant to escape those characters if they appear in a row's field value.
For example, if I have this table:
CREATE TABLE People ( name varchar(MAX), notes varchar(MAX) )
and this single row of data:
"Foo, \Bar", "he has a\r\nvery strange name\r\nlol"
...how would its corresponding bulk insert file look like, because this wouldn't work for obvious reasons:
Foo,\Bar,he has a
very strange name
lol
SQL Server says it supports \r and \n but doesn't say if backslashes escape themselves, nor does it mention field value delimiting (e.g. with double-quotes, or escaping double-quotes) so I'm a little perplexed in this area.

I worked-around this issue by using \0 as a row separator and \t as a field separator, as neither character appeared as a field value and are both supported as separators by BULK INSERT.
I am surprised MSSQL doesn't offer more flexibility when it comes to import/export. It wouldn't take too much effort to build a first-class CSV/TSV parser.

For the next person to search:
I used "\0\t" as a field separator, and "\0\n" for the end-of-line separator on the last field. Use of "\0\r\n" would also be acceptable if you wish to pretend that the files have DOS EOL conventions.
For those unfamiliar with the \x notation, \0 is CHAR(0), \t is CHAR(9), \n is CHAR(10) and \r is CHAR(13). Replace the CHAR() function with whatever your language offers to convert a number to a nominated character.
With this combination, all instances of \t and \n (and \r) become acceptable characters in the data file. After all, the weakness of the bulk upload system is that tabs and newlines are often legitimate characters in text strings, whereas other low-ASCII characters like CHAR(0), CHAR(1) and CHAR(2) are not legal text - not even appearing in UTF-8.
The only character you cannot have in your data is \0 - UNLESS you can guarantee it will never be followed by \t or \n (or \r)
If your language suffers problems when you use \0 in strings (but depending on how you code, you may still be able to avoid that problem) - AND if you know that your data won't have CHAR(1) or CHAR(2) in it (ie no binary) then use those characters instead. Those low characters are only going to be found when you are trying to store arbitrary binary data in strings.
Note also that you will find bytes 0, 1, 2 in UTF-16, UCS-2 and UTF-32 (aka UCS-4) - BUT - the 2 or 4 byte wide representation of CHAR(0, 1 or 2) is still acceptable and distinct from any legal unicode text. Just make sure you select the correct codepage setting in the format file to suit your choice of a UTF or UCS variant.

A bulk insert needs to have corresponding fields and field count for each row. Your example is a little rough, as its not structured data. As for thecharacters it will interpret them literally, not using escape characters (your string will be as seen in the file.
As for the double quotes enclosing each field, you will just have to use them as field and row terminators as well. So now your you should have:
Fieldterminator = '","',
Rowterminator = '"\n'
Does that make sense? Then after the bulk insert you'll need to take out the prefix double quote with something like:
Update yourtable
set yourfirstcolumn = right(yourfirstcolumn, len(yourfirstcolumn) - 1)

Related

Values shifting to end of line because of encoding

I'm trying to create a .CSV file from a sql statement, in ssis. The dile includes names with special characters like Latin and German letters.
I'm using Unicode but there is one word in Arab who keeps skipping to the end of the line and not in the place it belongs.
I tried replacing special characters with replace char(10), char(13) etc, But it didn't help.
I've also tried using UTF8 encoding but I still need to mark Unicode because of the other Latin letters.
First of all you should use UTF-8 encoding. Then make sure you are trying to store data to a nvarchar data type.

How is Unicode (UTF-16) data that is out of collation stored in varchar column?

This is purely theoretical question to wrap my head around
Let's say I have Unicode cyclone (🌀 1F300) symbol. If I try to store it in varchar column that has default Latin1_General_CI_AS collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar...
The ways I can see this done:
Like javascript does for symbols out of Basic plane(BMP) where it stores them as 2 symbols (surrogate pairs), and then additional processing is needed to put them back together...
Just truncate the symbol, store first byte and drop the second.... (data is toast - you should have read the manual....)
Data is destroyed and nothing of use is saved... (data is toast - you should have read the manual....)
Some other option that is outside of my mental capacity.....
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10)) in both cases I got 0x3F3F.
3F in ascii is ? (question mark) e.g two question marks (??) that I also see when doing normal select * does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?
The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2)) which is also 0xF3F3 (as opposed to casting N'👽').
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.
Yes the data has gone.
Varchar requires less space, compared to NVarchar. But that reduction comes at a cost. There is no space for a Varchar to store Unicode characters (at 1 byte per character the internal lookup just isn't big enough).
From Microsoft's Developer Network:
...consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.
As you've spotted, unsupported characters are repalced with question marks.

French Characters

I need to do an update with french characters in MS SQL SERVER, the problem is that I don't know where I can find a conversion list. For example I identified that this character É means É.
Where can I find the all corresponding list of symbols for each French character?
Thank you.
I don't think "É" is encoded as "É", that would be a very strange way to encode the character, using other special characters.
It seems rather the opposite:
I've noticed "É" is sometimes shown as ""é". It's an encoding character error, due to character encoding not being appropriate for the language.
If you try to correct the righ French character from your error display characters, I don't know if such a list exist. You would first need to know what kind of error in the encoding character you make to find the list of the wrong displays.
So my suggested solution is here: Find the proper character encoding to solve this issue. Changing the encoding will normally fix the display.
For French characters, the right ones are iso-8859-1 and utf-8. Once you have the right encoding, you can find a conversion list, translating your source code into displayed characters.
For instance, this list: http://www.fileformat.info/info/charset/ISO-8859-1/list.htm
I've seen similar in a nvarchar -> varchar conversion gone wrong.
Especially when concatenating and doing this:
e.FirstName + ' ' + e.LastName
Instead of doing:
e.FirstName + N' ' + e.LastName
Doing a cast(nvarchar_field AS varchar(50)) - or the opposite, as required.
Bring everything to NVARCHAR, as this supports Unicode, which supports UTF-8.
Perhaps the actual table needs to be modified if the fields are varchar instead of nvarchar.

Migrating from sql server to Oracle varchar length issues

Im facing a strange issue trying to move from sql server to oracle.
in one of my tables i have column defined by NVARCHAR(255)
after reading a bit i understod that SQL server is counting characters when oracle count bytes.
So i defined my table in oracle as VARCHAR(510) 255*2 = 510
But when using sqlldr to load the data from a tab delimetered text file i get en error indicating some entries had exiceeded the length of this column.
after checking in the sql server using:
SELECT MAX(DATALENGTH(column))
FROM table
i get that the max data length is 510.
I do use Hebrew_CI_AS collationg even though i dont think it changes anything....
I checked in SQL Server also if any of the entries contains TAB but no... so i guess its not a corrupted data....
Any one have an idea?
EDIT
After further checkup i've noticed that the issue is due to the data file (in addition to the issue solved by #Justin Cave post.
I have changed the row delimeter to '^' since none of my data contains this character and '|^|' as column delimeter.
creating a control file as follows:
load data
infile data.txt "str '^'"
badfile "data_BAD.txt"
discardfile "data_DSC.txt"
into table table
FIELDS TERMINATED BY '|^|' TRAILING NULLCOLS
(
col1,
col2,
col3,
col4,
col5,
col6
)
The problem is that my data contain <CR> and sqlldr expecting a stream file there for fails on the <CR>!!!! i do not want to change the data since its a textual data (error messages for examples).
What is your database character set
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter LIKE '%CHARACTERSET'
Assuming that your database character set is AL32UTF8, each character could require up to 4 bytes of storage (though almost every useful character can be represented with at most 3 bytes of storage). So you could declare your column as VARCHAR2(1020) to ensure that you have enough space.
You could also simply use character length semantics. If you declare your column VARCHAR2(255 CHAR), you'll allocate space for 255 characters regardless of the amount of space that requires. If you change the NLS_LENGTH_SEMANTICS initialization parameter from the default BYTE to CHAR, you'll change the default so that VARCHAR2(255) is interpreted as VARCHAR2(255 CHAR) rather than VARCHAR2(255 BYTE). Note that the 4000-byte limit on a VARCHAR2 remains even if you are using character length semantics.
If your data contains line breaks, do you need the TRAILING NULLCOLS parameter? That implies that sometimes columns may be omitted from the end of a logical row. If you combine columns that may be omitted with columns that contain line breaks and data that is not enclosed by at least an optional enclosure character, it's not obvious to me how you would begin to identify where a logical row ended and where it began. If you don't actually need the TRAILING NULLCOLS parameter, you should be able to use the CONTINUEIF parameter to combine multiple physical rows into a single logical row. If you can change the data file format, I'd strongly suggest adding an optional enclosure character.
The bytes used by an NVARCHAR field is equal to two times the number of characters plus two (see http://msdn.microsoft.com/en-us/library/ms186939.aspx), so if you make your VARCHAR field 512 you may be OK. There's also some indication that some character sets use 4 bytes per character, but I've found no indication that Hebrew is one of these character sets.

Sql command - like with % operator

In table i have row where in column 'name' is 'name123'
First sql command return this row but second command do nothing , why ?
select * from Osoby where imie like '%123%'
select * from Osoby where imie like '%123'
In line with what others are suggesting, try this --
select * from Osoby where RTRIM(LTRIM((imie)) like '%123'
and verify that you are getting the row
Perhaps the field has trailing spaces.
If the imie field is a char field, it will pad whatever you put in it with spaces to reach the length of the field. If you change this to a varchar field, you can get rid of the trailing spaces.
If you change your field to varchar, then run, UPDATE Osoby SET imie = RTRIM(imie) to trim off the extra spaces.
In essence, the query you posted should work, it sounds like something's wrong with your data.
Check your datatypes and have a look at:
http://msdn.microsoft.com/en-us/library/ms179859.aspx
Pattern Matching by Using LIKE
LIKE supports ASCII pattern matching and Unicode pattern matching. When all arguments (match_expression, pattern, and escape_character, if present) are ASCII character data types, ASCII pattern matching is performed. If any one of the arguments are of Unicode data type, all arguments are converted to Unicode and Unicode pattern matching is performed. When you use Unicode data (nchar or nvarchar data types) with LIKE, trailing blanks are significant; however, for non-Unicode data, trailing blanks are not significant. Unicode LIKE is compatible with the ISO standard. ASCII LIKE is compatible with earlier versions of SQL Server.
to prevent spaces problems try this:
select * from Osoby where ltrim(rtrim(imie)) like '%123'

Resources