MS SQL Server, Bulk Insert failing insert file in UTF-16 BE - sql-server

I have a problem with Bulk Insert on MS SQL Server 2012. Input file is saved in UTF-16 BE.
BULK INSERT Positions
FROM 'C:\DEV\Test\seq.filename.csv'
WITH
(
DATAFILETYPE = 'widechar',
FORMATFILE = 'C:\DEV\Test\Format.xml'
);
Fortmat file:
<?xml version="1.0" encoding="utf-16"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="ActionCode" xsi:type="NCharFixed" LENGTH="4" />
<FIELD ID="T1" xsi:type="NCharFixed" LENGTH="2" />
<FIELD ID="ReofferCounter" xsi:type="NCharFixed" LENGTH="6" />
<FIELD ID="T1" xsi:type="NCharFixed" LENGTH="2" />
... other fields....
</RECORD>
<ROW>
<COLUMN SOURCE="ActionCode" NAME="DT" xsi:type="SQLNCHAR" LENGTH="255" />
<COLUMN SOURCE="ReofferCounter" NAME="NO" xsi:type="SQLNCHAR" LENGTH="255" />
</ROW>
</BCPFORMAT>
Input file sample:
02|+00|... other cols....
02|+00|... other cols....
I have two problems:
1) If the input file has encoding UTF-16 BE, I get only chinesee characters instead of numbers.
2) If I convert the input file to the UTF-16 LE, I see correct characters, but the character data are shifted one character to the left - as if BOM was parsed (and counted as 1 character), but not transformed to the output (which I do not desire).
Questions:
1) I there a way, how to import a file in UTF-16 BE withou conversion to LE?
2) What causes the shift and how to avoid it?

Related

How to load UTF-8 CSV files using Bulk Insert and an XML Format file in SQL Server 2017

After much trying, I have found that since SQL server 2017 (2016?), loading UTF-8 encoded CSV files through Bulk Insert has become possible by using the options CODEPAGE = 65001 and DATAFILETYPE = 'Char', as explained in some other questions.
What doesn't seem to work, is doing the same when using an XML formatfile. I have tried this by still using the CODEPAGE and DATAFILETYPE options, and also with these options omited. And I have tried this with the most simple dataset. One row, one column, containing some text with an UTF-8 character.
This is the XML Formatfile I am using.
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="STREET" xsi:type="NCharTerm" TERMINATOR="\r\n" MAX_LENGTH="1000" COLLATION="Latin1_General_CS_AS_WS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="STREET" NAME="STREET" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
Even through the source data only contains some text with 1 special character, the end result looks like this: 慊潫ⵢ瑓晥慦⵮瑓慲鿃⁳㐱
When using xsi:type="CharTerm" instead of xsi:type="NCharTerm" the result looks like this: ...-Straßs ...
Am I doing something wrong, or has UTF-8 support not been properly implemented for XML format files?
After playing around with this, I have found the solution.
Notes
This works with or without BOM header. It is irrelevant.
The culprit was using the COLLATION parameter in the XML file. Omitting it solved the encoding problem. I have an intuitive sense as to why this is the case, but maybe someone with more insight could explain in the comments...
The DATAFILETYPE = 'char' option doesn't seem necessary.
In the XML file, the xsi:type for the field needs to be CharTerm, not NCharTerm.
This works with \r\n, \n, or \r. As long as you set the TERMINATOR correctly, this works. No \n\0 variations required (this would even break functionality, since this is not UTF-16 or UCS-2).
Below you can find a proof-of-concept for easy reuse...
data.txt
ß
ß
ß
Table
CREATE TABLE [dbo].[TEST](
TEST [nvarchar](500) NULL
)
formatfile.xml
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="20"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="TEST" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
Bulk insert
bulk insert TEST..TEST
from 'data.txt'
with (formatfile = 'formatfile.xml', CODEPAGE = 65001)
Change your terminator to TERMINATOR="\r\0\n\0". You have to account for the extra bytes when using NCharTerm.

"BULK LOAD DATA CONVERSION ERROR for csv file

I am trying to import .csv file but i am getting "BULK LOAD DATA CONVERSION ERROR" for last column. File looks like:
"123456","123","001","0.00"
I have tried below rowterminator:
ROW TERMINATOR = "\"\r\n"
Nothing is working. Any ideas on what is causing this record to have this error? Thanks
As per given example below, remove the quotes in your csv and use the terminator as "\r\n".
Always use format xml when doing bulk insert. It provides several advantages such as validation of data files etc.
The format file maps the fields of the data file to the columns of the table. You can use a non-XML or XML format file to bulk import data when using a bcp command or a BULK INSERT or INSERT or Transact-SQL command
Considering the input file given by you, suppose you have a table as given below :
CREATE TABLE myTestFormatFiles (
Col1 smallint,
Col2 nvarchar(50),
Col3 nvarchar(50),
Col4 nvarchar(50)
);
Your sample Data File will be as follows :
10,Field2,Field3,Field4
15,Field2,Field3,Field4
46,Field2,Field3,Field4
58,Field2,Field3,Field4
Sample format XML file will be :
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="7"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Col1" xsi:type="SQLSMALLINT"/>
<COLUMN SOURCE="2" NAME="Col2" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="Col3" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="4" NAME="Col4" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
If you are unfamiliar with format files, check XML Format Files (SQL Server).
Example is illustrated here

Use xml format file to edit csv before bulk insert into ms sql

SQL :Bulk insert
bulk insert TESTING
from 'D:\Testing.csv'
with
( FIRSTROW=2,
DATAFILETYPE='char',
FIELDTERMINATOR=',',
ROWTERMINATOR = '\n',
FORMATFILE = 'D:\Testing.xml');
XML : Format file
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="Address1" xsi:type="CharTerm" TERMINATOR='","' />
<FIELD ID="Address2" xsi:type="CharTerm" TERMINATOR='","' />
<FIELD ID="Address3" xsi:type="CharTerm" TERMINATOR='","' />
<FIELD ID="Address4" xsi:type="CharTerm" TERMINATOR='\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="Address1" NAME="COLUMN1" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="Address2" NAME="COLUMN2" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="Address3" NAME="COLUMN3" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="Address4" NAME="COLUMN4" xsi:type="SQLVARYCHAR" />
</ROW>
</BCPFORMAT>
The csv file that I have used contain address. I have created a SQL table before bulk insert. There are four column for address.
Testing.csv
"Address1","Address2","Address3","Address4"
"Lot 180, Street 19, "," Oakland Park, "," Kuala Lumpur, "," Selangor"
I want to get the output like in the table above. When i try use the xml format file in bulk insert, I received following error message:
Bulk load: An unexpected end of file was encountered in the data file.
Cannot obtain the required interface ("IID_IColumnsInfo") from OLE DB provider "BULK"
for linked server "(null)".

How to handle import of file with UTF-8 encoding, codepage = 65001, into SQL server

In Norway we have 3 highly annoying characters, æøå, that create all sorts of problems. Since sql server 2008, Microsoft decided to not support codepage 65001. I have found a manageable solution to the problem of importing a UTF-8 file into sql server with OPENROWSET(BULK) and keep the æøå tokens.
I created a powershell script that uses StreamReader and StreamWriter to convert the file from UTF-8 to default encoding, ANSI.
$filename = "C:\Test\UTF8_file.txt"
$outfile = "C:\Test\ANSI_file.txt"
$reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001))
$stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)
I strip the file of the first line, the header row, in the same process.
$i=1
while(($line = $reader.ReadLine()) -ne $null) {
if($i -gt 1) {
$stream.WriteLine($line)
}
$i++
}
$reader.Close()
$stream.Close()
Then I am able to use OPENROWSET to import the ANSI file into sql server and manipulating data while doing so. Using codepage 1252, which equals danish_norwegian collation.
insert into SomeDatabase.dbo.SomeTable
SELECT [companynumber]
, case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role'
, case [representant] when 'Y' then '1' else '0' end as 'representant'
, left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr'
, income*1000 as income
, null as person2id
FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt',
FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml'
, CODEPAGE =1252
, ROWS_PER_BATCH = 50000
) as v
This method secured that norwegian tokens were displayed correctly. The format file looks like this:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' />
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' />
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' />
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/>
<COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/>
<COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/>
</ROW>
</BCPFORMAT>
Hope this is helpful to someone else, because I spent quite a lot of time googleing before I found a way to solve this issue.
Convert into UTF16 instead. That is SQL Server's native NCHAR format, and allows full representation of Unicode values.
To make this work you will have to specify SQLNCHAR or SQLNVARCHAR in your format file, and also be aware of the caveat:
For a format file to work with a Unicode character data file, all the input fields must be Unicode text strings (that is, either fixed-size or character-terminated Unicode strings).
http://msdn.microsoft.com/en-us/library/ms178129.aspx
An alternative is to load it as binary data and use the CONVERT function to convert it from VARBINARY to NVARCHAR (which is UTF-16) and then to the desired codepage as VARCHAR.

Bulk insert using format file

My database named 'dictionary' have two column named 'column1' and 'column2'. Both can accept NULL value. The data-type of both columns is INT. Now I want to insert into only column2 from a text file using bcp. I made a format file. My format file is like that
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="7"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="24"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="column2" xsi:type="SQLINT"/>
</ROW>
</BCPFORMAT>
and my bulk statement is like
BULK INSERT dictionary
FROM 'C:\Users\jka\Desktop\n.txt'
WITH
(
FIELDTERMINATOR = '\n',
ROWTERMINATOR = '\n',
FORMATFILE = 'path to my format file.xml'
)
But it didn't work? How can I solve this?
N:B:
My txt file looks like
123
456
4101
......
One more question Edited:
i can fill one colum by this technique but when i fill another column from a text file like before from the 1st row. how can i do that ???
Assuming that your format file is correct I believe you need to ditch FIELDTERMINATOR and ROWTERMINATOR from your BULK INSERT
BULK INSERT dictionary
FROM 'C:\Users\jka\Desktop\n.txt'
WITH (FORMATFILE = 'path to my format file.xml')
Also make sure that:
input file's encoding is correct. In your case most likely it should be ANSI and not UTF-8 or Unicode.
row terminator (which is second field terminator in your format file) is actually \r\n and not \n.
UPDATE Since you need to skip first column:
With an XML format file, there is no way to skip a column when you are importing directly into a table by using a BULK INSERT statement. In order to achieve desired result and still use XML format file you need to use OPENROWSET(BULK...) and provide explicit list of columns in the select list and in the target table.
So to insert data only to column2 use:
INSERT INTO dictionary(column2)
SELECT column2
FROM OPENROWSET(BULK 'C:\temp\infile1.txt',
FORMATFILE='C:\temp\bulkfmt.xml') as t1;
If your data file has only one field your format file can look like this
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="C1" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="24"/>
</RECORD>
<ROW>
<COLUMN SOURCE="C1" NAME="column2" xsi:type="SQLINT"/>
</ROW>
</BCPFORMAT>
Your data file contains one field, so your format file should reflect that
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n"/>
</RECORD>

Resources