Format fields during bulk insert SQL 2008 - sql-server

I am currently working on a project that requires data from a report generated by third party software to be inserted into a local SQL database. So far I have the data stored as a tab delimited .txt file and the following bulk insert SQL statement:
BULK INSERT ExampleTable
FROM 'c:\temp\Example.txt'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
GO
The two problems I am encountering are, quotation marks around any value that includes it's own comma, and money signs in every field that has a dollar amount.
For instance one of the columns of the table is a description field and some of the values come out looking like:
"this is an example description, some more information, I don't know why the author would use commas in the first place here"
I don't care about the description field nearly as much as other fields that include dollar amounts. Each of these fields is already prefixed with a $ sign, so I have to set them as a nvarchar instead of a decimal or a float, which would be A LOT more useful for reporting. Furthermore, when the dollar amount is greater than 1000, the field will also contain a comma, and thus, quotation marks. ex "$1,084.59"
I am familiar with SSMS, but I have never made a format or bcp file (the solutions I have found online).
Any help would be greatly appreciated.

You can use a format file, but only if your metadata remains constant, which it does not appear to be in your case. You state that the dollar amounts are enclosed in quotes only when they exceed 999 and the comma is inserted. A format file would allow you to define per column delimiters such as [,] or [","]. But if that delimiter is shifting throughout your file, you will have to pre-process the file. Text qualifiers themselves are not supported.
For reference:
CSV import in SQL Server 2008
http://jessesql.blogspot.com/2010/05/bulk-insert-csv-with-text-qualifiers.html

i dont see why, but ThiefMaster deleted my answer :-(
probabaly a mistake and he did not check the link, as this link is the full answer to you question, i will try again for the last time here...
Tip: if your CSV file don't have consistent format, for example ON THE SAME COLUMN some of the values are doubleqouted and some not than this blog will help you do it in an easy way (using openrowset in the last step make it a one simple query): http://ariely.info/Blog/tabid/83/EntryId/122/Using-Bulk-Insert-to-import-inconsistent-data-format-using-pure-T-SQL.aspx
There is a new WIKI at: http://social.technet.microsoft.com/wiki based on this blog if you prefer to read from Microsoft site.

Related

Snowflake:Export data in multiple delimiter format

Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.

Import CSV data into SQL Server

I have data in the csv file similar to this:
Name,Age,Location,Score
"Bob, B",34,Boston,0
"Mike, M",76,Miami,678
"Rachel, R",17,Richmond,"1,234"
While trying to BULK INSERT this data into a SQL Server table, I encountered two problems.
If I use FIELDTERMINATOR=',' then it splits the first (and sometimes the last) column
The last column is an integer column but it has quotes and comma thousand separator whenever the number is greater than 1000
Is there a way to import this data (using XML Format File or whatever) without manually parsing the csv file first?
I appreciate any help. Thanks.
You can parse the file with http://filehelpers.sourceforge.net/
And with that result, use the approach here: SQL Bulkcopy YYYYMMDD problem or straight into SqlBulkCopy
Use MySQL load data:
LOAD DATA LOCAL INFILE 'path-to-/filename.csv' INTO TABLE `sql_tablename`
CHARACTER SET 'utf8'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES;
The part optionally enclosed by '\"', or escape character and quote, will keep the data in the first column together for the first field.
IGNORE 1 LINES will leave the field name row out.
UTF8 line is optional but good to use if names have diacritics, like in José.

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

SQL Server CSV extract of tables with newline, doublequotes and commas in columns?

I extracted some 10 tables in CSV with " as the text qualifier. Problem is my extract does not look right in Excel because of special characters in a few columns. Some columns are breaking into a new row when it should stay in the column.
I've been doing it manually using the management studio export feature, but what's the best extract the 10 tables to CSV with the double quote qualifier using a script?
Will I have to escape commas and double quotes? Best way to do this?
How should I handle newline codes in my columns, we need them for migration to a new system, but the PM wants to open the files and make modifications using Excel. Can they have it both ways?
I understand that much of the problem is that Excel is interpreting the file where a load utility into another database might not do anything special with new line, but what about double quotes and commas in the data, if I don't care about excel, must I escape that?
Many Thanks.
If you are using SQL Server 2005 or later, the export wizard will export the excel file out for you.
Right click the database, select Tasks-> Export Data...
Set the source to be the database.
Set the destination to excel.
At the end of the wizard, select the option to create an SSIS package. You can then create a job to execute the package on a schedule or on demand.
I'd suggest never using commas for your delimiter - they show up too frequently in other places. Use a tab, since a tab isn't too easy to include in Excel tables.
Make sure you never start a field with a space unless you want that space in the field.
Try changing your text lf's into the literal text \n. That is:
You might have:
0,1,"Line 1
Line 2", 3
I suggest you want:
0 1 "Line 1\nLine 2" 3
(assuming the spacing between lines are tabs)
Good luck
As far as I know, you cannot have new line in csv columns. If you know a column could have comma, double quotes or new line, then you can use this SQL statement to extract the value as valid csv
SELECT '"' + REPLACE(REPLACE(REPLACE(CAST([yourColumnName] AS VARCHAR(MAX)), '"', '""'), char(13), ''), char(10), '') + '"' FROM yourTable.

A 99.99 numeric from flat file doesn't want to go in a NUMERIC(4,2)'SQL Server

I have a csv file :
1|1.25
2|23.56
3|58.99
I want to put this value in a SQL Server table with SSIS.
I have created my table :
CREATE TABLE myTable( ID int, Value numeric(4,2));
My problem is that I have to create a Derived Column Transformation to specify my cast :
(DT_NUMERIC,4,2)(REPLACE(Value,".",","))
Otherwise, SSIS don't seem to be able to put my Value in my column, and fill my column with null value.
And I think it is tooooo ugly to do it this way. I want my Derived Column Transformation be here for real new derived column, and not some simple cast that I think SSIS have to detect.
So, what is the standard way to use SSIS to resolve this problem ?
BULK
INSERT myTable
FROM 'c:\csvtest1.txt'
WITH
(
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n'
)
csvtest1.txt
1|1.25
2|23.56
3|58.99
You're loading this up in international format (56,99 in lieu of 56.99). You need to load this as 56.99 for SQL Server to recognize it as such. Take out the REPLACE(Value, ".", ",") and just have the code be:
(DT_NUMERIC,4,2)(Value)
Handle the formatting on the application side, not on the data side. The comma is a reserved operator in SQL Server and you can't change that fact.
Haven't used SSIS a whole lot, but can't you set the regional settings on the File Source or at least set the decimal separator?
Can you change your SSIS source column to be in the correct datatype?
If you have control over the production of your file, I'd suggest you to format values without ANY decimal or thousand separation : in this case I'ld have a file with values:
1|125
2|2356
3|5899
and then apply a division by 100 when importing the data. While it has the advantage of being culture-independent, of course it has some drawbacks:
1) First of all, it may not be possible to impose this format of the file.
2) It presumes that all numeric values are formatted accordingly, in this case every value is multiplied by 100; this can be an issue if you have to mix values from countries with different decimal positions (many have two decimals, but some have zero decimals).
3) It may severely impact with other routines, maybe out of your control
Therefore, this can really be an option if you have total control on the csv file.

Resources