How to strip characters out of Temp Table after Bulk Insert - sql-server

I am trying to remove some very annoying inline characters from my .csv file. I need to strip out ! CR LF because these are junking up my import. I have a proc to try to get rid of the crap but it doesn't seem to work. Here's the code:
CREATE TABLE #Cleanup
(
SimpleData nvarchar(MAX)
)
BULK INSERT #Cleanup from '**********************\myimport.csv'
SELECT * FROM #Cleanup
DECLARE #ReplVar nvarchar(MAX)
SET #ReplVar = CONCAT(char(33),char(10),char(13),char(32))
UPDATE #Cleanup SET SimpleData = replace([SimpleData], #ReplVar,'') from #Cleanup
SELECT * FROM #Cleanup
My plan is if the goofy line break gets removed, the second select should not have it in there. The text looks like
js5t,1599,This is this and that is t!
hat,asdf,15426
when that line should read
js5t,1599,This is this and that is that,asdf,15426
See my quandary? Once the sequential characters !crlfsp are removed, I will take that temp table and feed it into the working one.
Edit to show varbinary data:
`0x31003700360039002C004300560045002D0032003000310035002D0030003000380035002C0028004D005300310035002D00300032003200290020004D006900630072006F0073006F006600740020004F006600660069006300650020004D0065006D006F00720079002000480061006E0064006C0069006E0067002000520065006D006F0074006500200043006F0064006500200045007800650063007500740069006F006E0020002800330030003300380039003900390029002C004D006900630072006F0073006F00660074002C00570069006E0064006F007700730020004F0053002C0053006F006600740077006100720065002000550070006700720061006400650020006F0072002000500061007400630068002C002C0042006100730065006C0069006E0065002C002C00310035002D003100320035002C0039002E0033002C0048006900670068002C0022005500730065002D00610066007400650072002D0066007200650065002000760075006C006E00650072006100620069006C00690074007900200069006E0020004D006900630072006F0073006F006600740020004F00660066006900630065002000320030003000370020005300500033002C00200045007800630065006C002000320030003000370020005300500033002C00200050006F0077006500720050006F0069006E0074002000320030003000370020005300500033002C00200057006F00720064002000320030003000370020005300500033002C0020004F00660066006900630065002000320030003100300020005300500032002C00200045007800630065006C002000320030003100300020005300500032002C00200050006F0077006500720050006F0069006E0074002000320030003100300020005300500032002C00200057006F00720064002000320030003100300020005300500032002C0020004F006600660069006300650020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020004F006600660069006300650020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200045007800630065006C0020005600690065007700650072002C0020004F0066006600690063006500200043006F006D007000610074006900620069006C0069007400790020005000610063006B0020005300500033002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C00200045007800630065006C0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006500620020004100700070006C00690063006100740069006F006E0073002000320030003100300020005300500032002C0020004F006600660069006300650020005700650062002000410070007000730020005300650072007600650072002000320030003100300020005300500032002C00200057006500620020004100700070007300200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003000370020005300500033002C002000570069006E0064006F007700730020005300680061007200650050006F0069006E007400200053006500720076006900630065007300200033002E00300020005300500033002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E0020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200061006E00640020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E0064002000530050003100200061006C006C006F00770073002000720065006D006F002100
`0x2000740065002000610074007400610063006B00650072007300200074006F00200065007800650063007500740065002000610072006200690074007200610072007900200063006F00640065002000760069006100200061002000630072006100660074006500640020004F0066006600690063006500200064006F00630075006D0065006E0074002C00200061006B0061002000220022004D006900630072006F0073006F006600740020004F0066006600690063006500200043006F006D0070006F006E0065006E0074002000550073006500200041006600740065007200200046007200650065002000560075006C006E00650072006100620069006C006900740079002E00220022002200

#seagulledge, in a comment on the question, is correct, or at least partially correct, in stating that the CHAR(10) and CHAR(13) are out of order. A carriage-return (CR) is CHAR(13) and a line-feed (LF) is CHAR(10).
HOWEVER, the main thing preventing this from working is not the order of those two characters: it is the simple fact that the newlines -- whether they are \r\n or just \n -- are in the incoming CSV file, and hence the BULK INSERT command is assuming that the newlines are separating input rows (which makes sense for it to do). This can be seen looking at the VARBINARY output in the question. There are two rows of output, both starting with 0x.
This problem can only be solved by fixing the incoming CSV file prior to calling BULK INSERT. That way the erroneously embedded newlines will be removed such that each row imports as a single row into the temp table.

Related

What is the difference between 0x0a and \n in for the ROWTERMINATOR parameter?

What is the difference between using 0x0a and \n for ROWTERMINATOR when importing data from an external source?
I tried to query data from a JSON file into rows and columns and I got different outcomes. Here is an image of the JSON file:
enter image description here
Here is the code I used:
SELECT TOP 10*
FROM OPENROWSET(
BULK 'taxi/raw/payment_type_array.json',
DATA_SOURCE='nyc_taxidata',
FORMAT='CSV',
PARSER_VERSION='1.0',
FIELDTERMINATOR='0x0b',
FIELDQUOTE='0x0b',
ROWTERMINATOR='\n'
)
WITH
(jsonDoc NVARCHAR(MAX)
) AS payment_type
CROSS APPLY OPENJSOn(jsonDoc)
WITH(
payment_type SMALLINT,
payment_type_desc NVARCHAR(MAX) AS JSON
);
Here is the outcome:
enter image description here
When I used '0x0a' as the FIELDTERMINATOR I got the following:
enter image description here
From the documentation
Specifying \n as a Row Terminator for Bulk Import
When you specify \n as a row terminator for bulk import, or implicitly use the default row terminator, bcp and the BULK INSERT statement expect a carriage return-line feed combination (CRLF) as the row terminator. If your source file uses a line feed character only (LF) as the row terminator - as is typical in files generated on Unix and Linux computers - use hexadecimal notation to specify the LF row terminator.
So if you are getting different results for \n and 0x0a then you have some lines that are just using a line feed (LF) character, and some which are using carriage return (CR) followed by a line feed (LF) character (CRLF). This is a problem with the source data, in my opinion, as using inconsistent new line characters can (and does) cause problems.

Remove lines recursively from a SQL TEXT column

I have a SQL Text column which has a block of text and has multiple lines which are not relevant and I need to remove only those lines.
Example - This is all one column value:
Header_ID askdjfhklasjdhfklajhfwoi fhweiohrognfk
ABC
SECTION_ID asdfhkwjehfi efjhewiu1382204 3904834
123
SECTION_ID deihefgjkahf dfjsdhfkl edfashldfkljh
So basically I need to remove all lines which are starting with Header_ID and Section_ID and the output Text i need is just
ABC
123
The only thing constant about these lines is the first word it starts with and depending on that I need to remove the whole line.
Here is a solution. Details about how it works are below. Note this solution needs MSSQL 2017+ to work.
-- Place the raw string value as varchar data in a variable so it is convenient to work with:
declare #rawValue varchar(max) = 'Header_ID askdjfhklasjdhfklajhfwoi fhweiohrognfk
ABC
SECTION_ID asdfhkwjehfi efjhewiu1382204 3904834
123
SECTION_ID deihefgjkahf dfjsdhfkl edfashldfkljh';
-- Perform multiple operations on the raw value and save the result to another variable:
declare #convertedValue varchar(max) =
(
select string_agg(value, char(13) + char(10))
from string_split(#rawValue, char(10))
where value not like 'header_id%' and value not like 'section_id%'
);
-- Display converted value.
select #convertedValue;
The magic begins with the string_split() function which produces a table value. It detects the line feed character, char(10), and splits the multi-line string into a table with each line from the string in a separate row.
Next, we filter out the rows from the table that we don't want. These rows begin with the known substrings header_id and section_id. This is accomplished in the where clause.
Lastly, for the output, we use string_agg() and aggregate the remaining rows (the lines we do want) back into a string with the individual values delimited by a combination of the carriage return char(13) and line feed char(10) characters.
Since I am using SQL Server 2016 and not 2017 what i did to resolve the issue was First break all the data into multiple rows(Cross apply with the Split function) using the delimiter as CHAR(13) and then taking only rows that did not start with Header_ID, Section_ID etc and using stuff again built the text block.
Thanks again #otto for the resolution.

How to remove weird Excel character in SQL Server?

There is a weird whitespace character I can't seem to get rid of that occasionally shows up in my data when importing from Excel. Visibly, it comes across as a whitespace character BUT SQL Server sees it as a question mark (ASCII 63).
declare #temp nvarchar(255); set #temp = 'carolg#c?am.com'
select #temp
returns:
?carolg#c?am.com
How can I get rid of the whitespace without getting rid of real question marks? If I look at the ASCII code for each of those "?" characters I get 63 when in fact, only one of them is a real qustion mark.
Have a look at this answer for someone with a similar issue. Sorry if this is a bit long winded:
SQL Server seems to flatten Unicode to ASCII by mapping unrepresentable characters (for which there is no suitable substitution) to a question mark. To replicate this, try opening the Character Map Windows program (should be installed on most machines), select Arial as the font and find U+034f "Combining Grapheme Joiner". select this character, copy to clipboard and paste it between the single quotes below:
declare #t nvarchar(10)
set #t = '͏'
select rtrim(ltrim(#t)) -- we can try and trim it, but by this stage it's already a '?'
You'll get a question mark out, because it doesn't know how to represent this non-ASCII character when it casts it to varchar. To force it to accept it as a double-byte character (nvarchar) you need to use N'' instead, as has already been mentioned. Add an N before the quotes above and the question mark disappears (but the original invisible character is preserved in the output - and ltrim and rtrim won't remove it as demonstrated below):
declare #t nvarchar(10),
#s varchar(10) -- note: single-byte string
set #t = rtrim(ltrim(N'͏')) -- trimming doesn't work here either
set #s = #t
select #s -- still outputs a question mark
Imported data can definitely do this, I've seen it before, and characters like the one I've shown above are particularly hard to diagnose because you can't see them! You will need to create some sort of scrubbing process to remove these unprintables (and any other junk characters, for that matter), and make sure that you use nvarchar everywhere, or you'll end up with this issue. Worse, those phantom question marks will become real question marks that you won't be able to distinguish from legitimate ones.
To see what character code you're dealing with, you can cast as varbinary as follows:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- returns 0x4F0374006500730074003F00
-- Returns:
-- 0x4F03 7400 6500 7300 7400 3F00
-- badchar t e s t ?
Now to get rid of it:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- bad char
set #t = replace(#t COLLATE Latin1_General_100_BIN2, nchar(0x034f), N'');
select cast(#t as varbinary) -- gone!
Note I had to swap the byte order from 0x4f03 to 0x034f (same reason "t" appears in the output as 0x7400, not 0x0074). For some notes on why we're using binary collation, see this answer.
This is kind of messy, because you don't know what the dirty characters are, and they could be one of thousands of possibilities. One option is to iterate over strings using like or even the unicode() function and discard characters in strings that aren't in a list of acceptable characters, but this could be slow. It may be that most of your bad characters are either at the start or end of the string, which might speed this process up if that's an assumption you think you can make.
You may need to build additional processes either external to SQL Server or as part of a SSIS import based on what I've shown you above to strip this out quickly if you have a lot of data to import. If you aren't sure the best way to do this, that's probably best answered in a new question.
I hope that helps.

SQL Server not escape \ CR LF

I have a table with column type text. In this column save for parameters, one per line.
My problem if parameters ends with "\" because path name, when saving data, SQL Server remove CR LF, leaving two parameters per line.
Example
Save
Path=C:\Transfer\
Outher=Yes
Outher1=No
Recover
Path=C:\Transfer\Outher=Yes
Outher1=No
With have SQL Server not clear "CR LF" after "\"?
Aditional info
create table TEST ( Ini text);
insert into TEST values
(
'Path=C:\Transfer
Outher=Yes
Outher1=No
');
insert into TEST values
(
'Path=C:\Transfer\
Outher=Yes
Outher1=No
');
select * from TEST;
First Insert return
Second Insert return
Apparently this is by design to allow for better readability of long strings in SSMS (or whatever client).
Breaks a long string constant into two or more lines for readability.
You can get around it by putting a space after your \
select 'Path=C:\Transfer\
Outher=Yes
Outher1=No'
or
Explicitly concatenating the new line separately
select 'Path=C:\Transfer\'+ '
Outher=Yes
Outher1=No'
or
Doubling up on everything (thanks #TT)
select 'Path=C:\Transfer\\
Outher=Yes
Outher1=No'
See this answer as well.

Bulk Insert Includes Line Terminator

I am bulk importing data from a pipe-separated CSV file into SQL Server. The data is formatted like
A|B|CCCCCC\r\n
I have validated both that the file is in UTF-8 format and that lines are terminated with "\r\n" by viewing the CSV file in a hex editor.
The command is
BULK INSERT MyTable FROM 'C:\Path\File.csv'
WITH (FIRSTROW=1, MAXERRORS=0, BATCHSIZE=10000, FIELDTERMINATOR = '|',
ROWTERMINATOR = '\r\n')
The third column originally was defined as CHAR(6) as this field is always a code exactly 6 (ASCII) characters wide. That resulted in a truncation error during bulk insert.
I then widened the column to CHAR(8). The import worked, but
SELECT CAST(Col3 As VARBINARY(MAX))
indicates that the column data ends with 0x0D0A (or "\r\n", the row terminator)
Why is the row terminator being included in the imported data and how can I fix that?
Long story short, SQL Server doesn't support UTF-8 and you just need \n as the row terminator.
It's actually a bit unclear what's going on because you didn't provide the table definition or the precise error messages. Having said all that, I could load the following data:
create table dbo.BCPTest (
col1 nchar(1) not null,
col2 nchar(1) not null,
col3 nchar(6) not null
)
/* This data can saved as ASCII, UTF-16 with BOM or UTF-8 without BOM
(see comments below)
A|B|CCCCCC
D|E|FFFFFF
*/
BULK INSERT dbo.BCPTest FROM 'c:\testfile.csv'
WITH (FIELDTERMINATOR = '|', ROWTERMINATOR = '\n')
Comments:
When I created and saved a in Notepad as "UTF-8", it added the BOM bytes 0xEFBBBF which is the standard UTF-8 BOM
But, SQL Server doesn't support UTF-8, it supports UTF-16 (offical docs here) and it expects a BOM of 0xFFFE
So I saved the file again in Notepad as "Unicode", and it added the 0xFFFE BOM; this loaded fine as shown above. Out of curiosity I also saved it (using Notepad++) as "UTF-8 without BOM" and I could load that file too
Saving the file as ASCII also loads fine with the same table data types and BULK INSERT command
The row terminator should be \n not \r\n because \n is interpreted as a "newline", i.e. SQL Server (and/or Windows) is being 'clever' by interpreting \n semantically instead of literally. This is most likely a result of the C handling of \r and \n, which doesn't require them to be interpreted literally.

Resources