Bulk Insert Includes Line Terminator - sql-server

I am bulk importing data from a pipe-separated CSV file into SQL Server. The data is formatted like
A|B|CCCCCC\r\n
I have validated both that the file is in UTF-8 format and that lines are terminated with "\r\n" by viewing the CSV file in a hex editor.
The command is
BULK INSERT MyTable FROM 'C:\Path\File.csv'
WITH (FIRSTROW=1, MAXERRORS=0, BATCHSIZE=10000, FIELDTERMINATOR = '|',
ROWTERMINATOR = '\r\n')
The third column originally was defined as CHAR(6) as this field is always a code exactly 6 (ASCII) characters wide. That resulted in a truncation error during bulk insert.
I then widened the column to CHAR(8). The import worked, but
SELECT CAST(Col3 As VARBINARY(MAX))
indicates that the column data ends with 0x0D0A (or "\r\n", the row terminator)
Why is the row terminator being included in the imported data and how can I fix that?

Long story short, SQL Server doesn't support UTF-8 and you just need \n as the row terminator.
It's actually a bit unclear what's going on because you didn't provide the table definition or the precise error messages. Having said all that, I could load the following data:
create table dbo.BCPTest (
col1 nchar(1) not null,
col2 nchar(1) not null,
col3 nchar(6) not null
)
/* This data can saved as ASCII, UTF-16 with BOM or UTF-8 without BOM
(see comments below)
A|B|CCCCCC
D|E|FFFFFF
*/
BULK INSERT dbo.BCPTest FROM 'c:\testfile.csv'
WITH (FIELDTERMINATOR = '|', ROWTERMINATOR = '\n')
Comments:
When I created and saved a in Notepad as "UTF-8", it added the BOM bytes 0xEFBBBF which is the standard UTF-8 BOM
But, SQL Server doesn't support UTF-8, it supports UTF-16 (offical docs here) and it expects a BOM of 0xFFFE
So I saved the file again in Notepad as "Unicode", and it added the 0xFFFE BOM; this loaded fine as shown above. Out of curiosity I also saved it (using Notepad++) as "UTF-8 without BOM" and I could load that file too
Saving the file as ASCII also loads fine with the same table data types and BULK INSERT command
The row terminator should be \n not \r\n because \n is interpreted as a "newline", i.e. SQL Server (and/or Windows) is being 'clever' by interpreting \n semantically instead of literally. This is most likely a result of the C handling of \r and \n, which doesn't require them to be interpreted literally.

Related

What is the difference between 0x0a and \n in for the ROWTERMINATOR parameter?

What is the difference between using 0x0a and \n for ROWTERMINATOR when importing data from an external source?
I tried to query data from a JSON file into rows and columns and I got different outcomes. Here is an image of the JSON file:
enter image description here
Here is the code I used:
SELECT TOP 10*
FROM OPENROWSET(
BULK 'taxi/raw/payment_type_array.json',
DATA_SOURCE='nyc_taxidata',
FORMAT='CSV',
PARSER_VERSION='1.0',
FIELDTERMINATOR='0x0b',
FIELDQUOTE='0x0b',
ROWTERMINATOR='\n'
)
WITH
(jsonDoc NVARCHAR(MAX)
) AS payment_type
CROSS APPLY OPENJSOn(jsonDoc)
WITH(
payment_type SMALLINT,
payment_type_desc NVARCHAR(MAX) AS JSON
);
Here is the outcome:
enter image description here
When I used '0x0a' as the FIELDTERMINATOR I got the following:
enter image description here
From the documentation
Specifying \n as a Row Terminator for Bulk Import
When you specify \n as a row terminator for bulk import, or implicitly use the default row terminator, bcp and the BULK INSERT statement expect a carriage return-line feed combination (CRLF) as the row terminator. If your source file uses a line feed character only (LF) as the row terminator - as is typical in files generated on Unix and Linux computers - use hexadecimal notation to specify the LF row terminator.
So if you are getting different results for \n and 0x0a then you have some lines that are just using a line feed (LF) character, and some which are using carriage return (CR) followed by a line feed (LF) character (CRLF). This is a problem with the source data, in my opinion, as using inconsistent new line characters can (and does) cause problems.

How to strip characters out of Temp Table after Bulk Insert

I am trying to remove some very annoying inline characters from my .csv file. I need to strip out ! CR LF because these are junking up my import. I have a proc to try to get rid of the crap but it doesn't seem to work. Here's the code:
CREATE TABLE #Cleanup
(
SimpleData nvarchar(MAX)
)
BULK INSERT #Cleanup from '**********************\myimport.csv'
SELECT * FROM #Cleanup
DECLARE #ReplVar nvarchar(MAX)
SET #ReplVar = CONCAT(char(33),char(10),char(13),char(32))
UPDATE #Cleanup SET SimpleData = replace([SimpleData], #ReplVar,'') from #Cleanup
SELECT * FROM #Cleanup
My plan is if the goofy line break gets removed, the second select should not have it in there. The text looks like
js5t,1599,This is this and that is t!
hat,asdf,15426
when that line should read
js5t,1599,This is this and that is that,asdf,15426
See my quandary? Once the sequential characters !crlfsp are removed, I will take that temp table and feed it into the working one.
Edit to show varbinary data:
`0x31003700360039002C004300560045002D0032003000310035002D0030003000380035002C0028004D005300310035002D00300032003200290020004D006900630072006F0073006F006600740020004F006600660069006300650020004D0065006D006F00720079002000480061006E0064006C0069006E0067002000520065006D006F0074006500200043006F0064006500200045007800650063007500740069006F006E0020002800330030003300380039003900390029002C004D006900630072006F0073006F00660074002C00570069006E0064006F007700730020004F0053002C0053006F006600740077006100720065002000550070006700720061006400650020006F0072002000500061007400630068002C002C0042006100730065006C0069006E0065002C002C00310035002D003100320035002C0039002E0033002C0048006900670068002C0022005500730065002D00610066007400650072002D0066007200650065002000760075006C006E00650072006100620069006C00690074007900200069006E0020004D006900630072006F0073006F006600740020004F00660066006900630065002000320030003000370020005300500033002C00200045007800630065006C002000320030003000370020005300500033002C00200050006F0077006500720050006F0069006E0074002000320030003000370020005300500033002C00200057006F00720064002000320030003000370020005300500033002C0020004F00660066006900630065002000320030003100300020005300500032002C00200045007800630065006C002000320030003100300020005300500032002C00200050006F0077006500720050006F0069006E0074002000320030003100300020005300500032002C00200057006F00720064002000320030003100300020005300500032002C0020004F006600660069006300650020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020004F006600660069006300650020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200045007800630065006C0020005600690065007700650072002C0020004F0066006600690063006500200043006F006D007000610074006900620069006C0069007400790020005000610063006B0020005300500033002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C00200045007800630065006C0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006500620020004100700070006C00690063006100740069006F006E0073002000320030003100300020005300500032002C0020004F006600660069006300650020005700650062002000410070007000730020005300650072007600650072002000320030003100300020005300500032002C00200057006500620020004100700070007300200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003000370020005300500033002C002000570069006E0064006F007700730020005300680061007200650050006F0069006E007400200053006500720076006900630065007300200033002E00300020005300500033002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E0020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200061006E00640020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E0064002000530050003100200061006C006C006F00770073002000720065006D006F002100
`0x2000740065002000610074007400610063006B00650072007300200074006F00200065007800650063007500740065002000610072006200690074007200610072007900200063006F00640065002000760069006100200061002000630072006100660074006500640020004F0066006600690063006500200064006F00630075006D0065006E0074002C00200061006B0061002000220022004D006900630072006F0073006F006600740020004F0066006600690063006500200043006F006D0070006F006E0065006E0074002000550073006500200041006600740065007200200046007200650065002000560075006C006E00650072006100620069006C006900740079002E00220022002200
#seagulledge, in a comment on the question, is correct, or at least partially correct, in stating that the CHAR(10) and CHAR(13) are out of order. A carriage-return (CR) is CHAR(13) and a line-feed (LF) is CHAR(10).
HOWEVER, the main thing preventing this from working is not the order of those two characters: it is the simple fact that the newlines -- whether they are \r\n or just \n -- are in the incoming CSV file, and hence the BULK INSERT command is assuming that the newlines are separating input rows (which makes sense for it to do). This can be seen looking at the VARBINARY output in the question. There are two rows of output, both starting with 0x.
This problem can only be solved by fixing the incoming CSV file prior to calling BULK INSERT. That way the erroneously embedded newlines will be removed such that each row imports as a single row into the temp table.

How can I read a CSV file with UTF-8 code page in SQL bulk insert?

I have a Persian CSV file and I need to read that with SQL bulk into the SQL server:
I wrote this bulk:
BULK INSERT TEMP
FROM 'D:\t1.csv'
WITH(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
CODEPAGE = '1256'
);
but that can not read UTF-8 encoding and read ی character as ? character.
How can I write that?
1. go to the BULK INSERT documentation on MSDN
2. find the section on the CODEPAGE
3. see the note that says:
SQL Server does not support code page 65001 (UTF-8 encoding).
4. Research further and find the Use Unicode Character Format to Import or Export Data (SQL Server) and see if that helps
This problem is still there in SQL server 2017, see here and here.
If your import is just an occasional exercise, i.e. if it's OK to import not using a script at all, what worked for me is simply importing the csv using Tasks -> Import -> Flat file.
I'm adding this here because this page is high up when you Google 'SQL Server does not support code page 65001'. Hope it helps some.
In addition to the now deprecated or obsolete earlier answers by others I want to point out that a of today in May 2022, with Release Version 15.0.2080.9 (SQL Server 2019), this works flawlessly for UTF-8.
Create a UTF-8 encoded file (I use with BOM)
then
BULK INSERT #tempTable1
FROM 'C:\....\file.csv' WITH (
CODEPAGE = '65001',
FIRSTROW = 2, --skip the first line
FIELDTERMINATOR = ';',
ROWTERMINATOR = '\n')
GO
works flawlessly for me, with many french and other characters.
I went through the documenation #marc_s linked to, and found the usage of DATAFILETYPE = widechar.
I then went ahead and tried it with my UTF-8 csv file, but it didn't work, giving me the error:
[...] the data file does not have a Unicode signature
I then re-saved my csv file with Notepad's Unicode format, retried the import, and voila, success.
Make sure all commas and line-breaks are escaped (see here how to save a valid csv).
My full script (I'm using SQL Server 2017):
BULK INSERT [my_table]
FROM 'C:\path\to\file.csv'
WITH
(
FORMAT = 'CSV',
FIRSTROW = 2, -- if you have a title row, the first data row is 2nd
FIELDTERMINATOR = ',',
KEEPIDENTITY, -- remove it if you don't want identity to be kept
ROWTERMINATOR = '\n',
DATAFILETYPE = 'widechar',
ERRORFILE = 'C:\path\to\file_err.txt',
KEEPNULLS,
TABLOCK
)
Notes:
Make sure your date fields are in valid sql format.
Regarding KEEPNULS, read this question (e.g, if you have NULLs in your file, replace them with an empty string).

sql loader- files with carriage return, line feed load into oracle with cr/lf

I have data extract from sql server where data in most columns have carriage return and line feeds. I need to load them into oracle with the carriage return and line feed; basically I have to mirror the data form sql server 2012 to oracle 11g.
below is the sample of my extract file
[#BOR#][#EOC#]109[#EOC#]4[#EOC#]testdata_Duplicate[#EOC#]testdata_Duplicate from chat[#EOC#]this
is
carriage return field[#EOC]test2[#EOR#]
Here [#EOC#] is column delimiter, [#EOR#] is row delimiter. [#BOR#] indicates the beginning of row. Initially my loads failed to due to blank lines in the flat file(data extract). Then I used [#BOR#] with continueIf preserve clause so that sqlldr will not treat blank lines(cr/lf) as physical row.
with [#BOR#] as a filler column my load works fine but carriage return or line feed are not loaded into oracle tables.
My ctl file is as below
load data
truncate
CONTINUEIF NEXT preserve (1:7) <> "[#BOR#]"
into table sch1.tbl1
fields terminated by '[#EOC#]'
trailing nullcols (
field filler,
a_id integer external,
h_id integer external,
title char(128),
descn char(4000),
risk char(4000),
comment char(4000) terminated by '[#EOR#]')
In oracle sch1.tbl1 table column risk has data as 'this is carriage return field' instead of
'this
is
carriage return field'
I tried to replace char(10) with string [#crlf#] and use replace function in ctl like as below
load data
truncate
CONTINUEIF NEXT preserve (1:7) <> "[#BOR#]"
into table sch1.tbl1
fields terminated by '[#EOC#]'
trailing nullcols (
field filler,
a_id integer external,
h_id integer external,
title char(128),
descn char(4000),
risk char(4000) "replace(:risk,[#crlf#],chr(10))"
comment char(4000) terminated by '[#EOR#]')
the sql loader errors out stating SQL*Loader-309: No SQL string allowed as part of field specification; I believe because my columns are CLOB data type I am not able to use replace function.
Please help me to load data from sql server with cr/lnFeed into oracle tables using sqlloader. Thank you in advance.
Here is the solution that works for me.
Instead of replacing the carriage return/line feed(cr/lf) in the extracted flat file with [#crlf#] I retain the cr/lf in the extracted data file.
And then I changed my ctl file to handle the cr/lf with INFILE Clause with file name and " str '\n' ". For Unix env we need \n where in for windows we can use either \n or \r\n.
see below
load data INFILE 'filename.dat' "str '\n'"
truncate
CONTINUEIF NEXT preserve (1:7) <> "[#BOR#]"
into table sch1.tbl1
fields terminated by '[#EOC#]'
trailing nullcols (
field filler,
a_id integer external,
h_id integer external,
title char(128),
descn char(4000),
risk char(4000),
comment char(4000) terminated by '[#EOR#]')
I tested it and data loaded with cr\lf.. I need to do more detailed testing, as of now I have tested one table I have many more.
Meanwhile if any one has better solution I would be more than happy to try it out.

SQL Server - Bulk insert without losing CR or LF characters

I am trying to import email communication into a database table using Bulk Insert but I can't seem to be able to preserve the CR and LF characters. Let's consider the following:
CREATE TABLE myTable (
Email_Id int,
Email_subject varchar(200) NULL,
Email_Body TEXT NULL
)
The bulk insert statement has the following:
codepage = '1250',
fieldterminator = '<3P4>',
rowterminator = '<3ND>',
datafiletype = 'char'
The file contains full emails (including CR and LF characters). I would like to import the data and include the CR and LF characters. I have read that BULK INSERT treats each entry as a single row but does that mean it strips out the CR and LF characters? If so, what can I use to import this CSV file? I don't have access to SSIS and I would prefer to use SQL code to do it.
Example data:
11324<3P4>Read this email because it's urgent<3P4>Haha John,
I lied, the email was just to mess with you!
Your Nemesis,
Steve
P.S. I still hate you!
<3ND>
11355<3P4>THIS IS THE LAST STRAW<3P4>Steve,
I have had it with you stupid jokes, this email is going to the manager.
Good day,
John
<3ND>
It should import with the carriage returns and linefeeds, even if you don't see them in some tools. We would import XSL this way and it would preserve all of the line formatting.

Resources