SQL Server - Bulk insert without losing CR or LF characters - sql-server

I am trying to import email communication into a database table using Bulk Insert but I can't seem to be able to preserve the CR and LF characters. Let's consider the following:
CREATE TABLE myTable (
Email_Id int,
Email_subject varchar(200) NULL,
Email_Body TEXT NULL
)
The bulk insert statement has the following:
codepage = '1250',
fieldterminator = '<3P4>',
rowterminator = '<3ND>',
datafiletype = 'char'
The file contains full emails (including CR and LF characters). I would like to import the data and include the CR and LF characters. I have read that BULK INSERT treats each entry as a single row but does that mean it strips out the CR and LF characters? If so, what can I use to import this CSV file? I don't have access to SSIS and I would prefer to use SQL code to do it.
Example data:
11324<3P4>Read this email because it's urgent<3P4>Haha John,
I lied, the email was just to mess with you!
Your Nemesis,
Steve
P.S. I still hate you!
<3ND>
11355<3P4>THIS IS THE LAST STRAW<3P4>Steve,
I have had it with you stupid jokes, this email is going to the manager.
Good day,
John
<3ND>

It should import with the carriage returns and linefeeds, even if you don't see them in some tools. We would import XSL this way and it would preserve all of the line formatting.

Related

Workaround to BULK INSERT NULL values [SQL Server]

I have not used SQL Server much (I usually use PostgreSQL) and I find hard to believe / accept that one simply cannot insert NULL values from a text file using BULK INSERT, if the file has a value that indicates null or missing data (NULL, NA, na, null, -, ., etc.).
I know BULK INSERT can keep NULL if the field is empty (link, and this is not a nice solution for my case because I have > 50 files, all of them relatively big > 25GB, so I do not want to). But I cannot find a way to tell SQL Server / BULK INSERT that certain value should be interpreted as NULL.
This is, I would say, pretty standard in importing data from text files in most tools. (e.g. COPY table_name FROM 'file_path' WITH (DELIMITER '\t', NULL 'NULL') in PostgreSQL, or readr::read_delim(file = "file", delim = "\t", na = "NULL") in R and the readr package, just to name a couple of examples).
Even more annoying is the fact that the file I want to import was actually exported from SQL Server. It seems that by default, instead of leaving NULL as empty fields in the text files, it writes the value NULL (which makes the file bigger, but anyway). So it seems very odd that the "import" feature (BULK INSERT or the bcp utility) of one tool (SQL Server) cannot properly import the files exported by default by the very same tool.
I've been googling around (link1, link2, link3, link4) and cannot find a workaround for this (different than editing my files to change NULL for empty fields, or import everything as varchar and later work in database to change types and so on). So I would really appreciate any ideas.
For the sake of a reproducible example, here is a sample table where I want to import this sample data stored in a text file:
Sample table:
CREATE TABLE test
(
[item][varchar](255) NULL,
[price][int] NULL
)
Sample data stored in file.txt:
item1, 34
item2, NULL
item3, 55
Importing the data ...
BULK INSERT test
FROM 'file.txt'
WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')
But this fails because on the second line it finds NULL for an integer field. This field, however, allows NULL values. So I want it to understand that this is just a NULL value and not a character value.

Bulk Import CSV file into SQL Server - remove double quotes

I am running SQL 2008, bulk insert command, while inserting the data, I am trying to remove (") double quotes from the CSV file, which works partially, but doesnt work for all the records, please check my code and the screenshot of the result.
Bulk Insert tblUsersXTemp
from 'C:\FR0250Members161212_030818.csv'
WITH (FIELDTERMINATOR = '","',
ROWTERMINATOR = '"\n"',
--FormatFile =''
ERRORFILE = 'C:\bulk_insert_BadData.txt')
After you do the bulk insert, you could replace the double quotes.
UPDATE tblUsersXTemp
SET usxMembershipID = REPLACE(usxMembershipID, CHAR(34), '')
You need a format file I believe, that's what I think is going on.
If you use the following Bulk Insert command to import the data without using a format file, then you will land up with a quotation mark prefix to the first column value and a quotation mark suffix for the last column values and a quotation mark prefix for the first column values.
Reference
Example from reference:
BULK INSERT tblPeople
FROM ‘bcp.txt’
WITH (
DATAFILETYPE=‘char’,
FIELDTERMINATOR=‘","’,
ROWTERMINATOR = ‘\n’,
FORMATFILE = ‘bcp.fmt’);
You could also potentially have dirty data that uses quotes for more than just delimiters.

How to strip characters out of Temp Table after Bulk Insert

I am trying to remove some very annoying inline characters from my .csv file. I need to strip out ! CR LF because these are junking up my import. I have a proc to try to get rid of the crap but it doesn't seem to work. Here's the code:
CREATE TABLE #Cleanup
(
SimpleData nvarchar(MAX)
)
BULK INSERT #Cleanup from '**********************\myimport.csv'
SELECT * FROM #Cleanup
DECLARE #ReplVar nvarchar(MAX)
SET #ReplVar = CONCAT(char(33),char(10),char(13),char(32))
UPDATE #Cleanup SET SimpleData = replace([SimpleData], #ReplVar,'') from #Cleanup
SELECT * FROM #Cleanup
My plan is if the goofy line break gets removed, the second select should not have it in there. The text looks like
js5t,1599,This is this and that is t!
hat,asdf,15426
when that line should read
js5t,1599,This is this and that is that,asdf,15426
See my quandary? Once the sequential characters !crlfsp are removed, I will take that temp table and feed it into the working one.
Edit to show varbinary data:
`0x31003700360039002C004300560045002D0032003000310035002D0030003000380035002C0028004D005300310035002D00300032003200290020004D006900630072006F0073006F006600740020004F006600660069006300650020004D0065006D006F00720079002000480061006E0064006C0069006E0067002000520065006D006F0074006500200043006F0064006500200045007800650063007500740069006F006E0020002800330030003300380039003900390029002C004D006900630072006F0073006F00660074002C00570069006E0064006F007700730020004F0053002C0053006F006600740077006100720065002000550070006700720061006400650020006F0072002000500061007400630068002C002C0042006100730065006C0069006E0065002C002C00310035002D003100320035002C0039002E0033002C0048006900670068002C0022005500730065002D00610066007400650072002D0066007200650065002000760075006C006E00650072006100620069006C00690074007900200069006E0020004D006900630072006F0073006F006600740020004F00660066006900630065002000320030003000370020005300500033002C00200045007800630065006C002000320030003000370020005300500033002C00200050006F0077006500720050006F0069006E0074002000320030003000370020005300500033002C00200057006F00720064002000320030003000370020005300500033002C0020004F00660066006900630065002000320030003100300020005300500032002C00200045007800630065006C002000320030003100300020005300500032002C00200050006F0077006500720050006F0069006E0074002000320030003100300020005300500032002C00200057006F00720064002000320030003100300020005300500032002C0020004F006600660069006300650020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020004F006600660069006300650020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200045007800630065006C0020005600690065007700650072002C0020004F0066006600690063006500200043006F006D007000610074006900620069006C0069007400790020005000610063006B0020005300500033002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C00200045007800630065006C0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006500620020004100700070006C00690063006100740069006F006E0073002000320030003100300020005300500032002C0020004F006600660069006300650020005700650062002000410070007000730020005300650072007600650072002000320030003100300020005300500032002C00200057006500620020004100700070007300200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003000370020005300500033002C002000570069006E0064006F007700730020005300680061007200650050006F0069006E007400200053006500720076006900630065007300200033002E00300020005300500033002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E0020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200061006E00640020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E0064002000530050003100200061006C006C006F00770073002000720065006D006F002100
`0x2000740065002000610074007400610063006B00650072007300200074006F00200065007800650063007500740065002000610072006200690074007200610072007900200063006F00640065002000760069006100200061002000630072006100660074006500640020004F0066006600690063006500200064006F00630075006D0065006E0074002C00200061006B0061002000220022004D006900630072006F0073006F006600740020004F0066006600690063006500200043006F006D0070006F006E0065006E0074002000550073006500200041006600740065007200200046007200650065002000560075006C006E00650072006100620069006C006900740079002E00220022002200
#seagulledge, in a comment on the question, is correct, or at least partially correct, in stating that the CHAR(10) and CHAR(13) are out of order. A carriage-return (CR) is CHAR(13) and a line-feed (LF) is CHAR(10).
HOWEVER, the main thing preventing this from working is not the order of those two characters: it is the simple fact that the newlines -- whether they are \r\n or just \n -- are in the incoming CSV file, and hence the BULK INSERT command is assuming that the newlines are separating input rows (which makes sense for it to do). This can be seen looking at the VARBINARY output in the question. There are two rows of output, both starting with 0x.
This problem can only be solved by fixing the incoming CSV file prior to calling BULK INSERT. That way the erroneously embedded newlines will be removed such that each row imports as a single row into the temp table.

How can I read a CSV file with UTF-8 code page in SQL bulk insert?

I have a Persian CSV file and I need to read that with SQL bulk into the SQL server:
I wrote this bulk:
BULK INSERT TEMP
FROM 'D:\t1.csv'
WITH(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
CODEPAGE = '1256'
);
but that can not read UTF-8 encoding and read ی character as ? character.
How can I write that?
1. go to the BULK INSERT documentation on MSDN
2. find the section on the CODEPAGE
3. see the note that says:
SQL Server does not support code page 65001 (UTF-8 encoding).
4. Research further and find the Use Unicode Character Format to Import or Export Data (SQL Server) and see if that helps
This problem is still there in SQL server 2017, see here and here.
If your import is just an occasional exercise, i.e. if it's OK to import not using a script at all, what worked for me is simply importing the csv using Tasks -> Import -> Flat file.
I'm adding this here because this page is high up when you Google 'SQL Server does not support code page 65001'. Hope it helps some.
In addition to the now deprecated or obsolete earlier answers by others I want to point out that a of today in May 2022, with Release Version 15.0.2080.9 (SQL Server 2019), this works flawlessly for UTF-8.
Create a UTF-8 encoded file (I use with BOM)
then
BULK INSERT #tempTable1
FROM 'C:\....\file.csv' WITH (
CODEPAGE = '65001',
FIRSTROW = 2, --skip the first line
FIELDTERMINATOR = ';',
ROWTERMINATOR = '\n')
GO
works flawlessly for me, with many french and other characters.
I went through the documenation #marc_s linked to, and found the usage of DATAFILETYPE = widechar.
I then went ahead and tried it with my UTF-8 csv file, but it didn't work, giving me the error:
[...] the data file does not have a Unicode signature
I then re-saved my csv file with Notepad's Unicode format, retried the import, and voila, success.
Make sure all commas and line-breaks are escaped (see here how to save a valid csv).
My full script (I'm using SQL Server 2017):
BULK INSERT [my_table]
FROM 'C:\path\to\file.csv'
WITH
(
FORMAT = 'CSV',
FIRSTROW = 2, -- if you have a title row, the first data row is 2nd
FIELDTERMINATOR = ',',
KEEPIDENTITY, -- remove it if you don't want identity to be kept
ROWTERMINATOR = '\n',
DATAFILETYPE = 'widechar',
ERRORFILE = 'C:\path\to\file_err.txt',
KEEPNULLS,
TABLOCK
)
Notes:
Make sure your date fields are in valid sql format.
Regarding KEEPNULS, read this question (e.g, if you have NULLs in your file, replace them with an empty string).

Bulk Insert Includes Line Terminator

I am bulk importing data from a pipe-separated CSV file into SQL Server. The data is formatted like
A|B|CCCCCC\r\n
I have validated both that the file is in UTF-8 format and that lines are terminated with "\r\n" by viewing the CSV file in a hex editor.
The command is
BULK INSERT MyTable FROM 'C:\Path\File.csv'
WITH (FIRSTROW=1, MAXERRORS=0, BATCHSIZE=10000, FIELDTERMINATOR = '|',
ROWTERMINATOR = '\r\n')
The third column originally was defined as CHAR(6) as this field is always a code exactly 6 (ASCII) characters wide. That resulted in a truncation error during bulk insert.
I then widened the column to CHAR(8). The import worked, but
SELECT CAST(Col3 As VARBINARY(MAX))
indicates that the column data ends with 0x0D0A (or "\r\n", the row terminator)
Why is the row terminator being included in the imported data and how can I fix that?
Long story short, SQL Server doesn't support UTF-8 and you just need \n as the row terminator.
It's actually a bit unclear what's going on because you didn't provide the table definition or the precise error messages. Having said all that, I could load the following data:
create table dbo.BCPTest (
col1 nchar(1) not null,
col2 nchar(1) not null,
col3 nchar(6) not null
)
/* This data can saved as ASCII, UTF-16 with BOM or UTF-8 without BOM
(see comments below)
A|B|CCCCCC
D|E|FFFFFF
*/
BULK INSERT dbo.BCPTest FROM 'c:\testfile.csv'
WITH (FIELDTERMINATOR = '|', ROWTERMINATOR = '\n')
Comments:
When I created and saved a in Notepad as "UTF-8", it added the BOM bytes 0xEFBBBF which is the standard UTF-8 BOM
But, SQL Server doesn't support UTF-8, it supports UTF-16 (offical docs here) and it expects a BOM of 0xFFFE
So I saved the file again in Notepad as "Unicode", and it added the 0xFFFE BOM; this loaded fine as shown above. Out of curiosity I also saved it (using Notepad++) as "UTF-8 without BOM" and I could load that file too
Saving the file as ASCII also loads fine with the same table data types and BULK INSERT command
The row terminator should be \n not \r\n because \n is interpreted as a "newline", i.e. SQL Server (and/or Windows) is being 'clever' by interpreting \n semantically instead of literally. This is most likely a result of the C handling of \r and \n, which doesn't require them to be interpreted literally.

Resources