BULK INSERT - header and data rows with different delimiters - sql-server

I'm using the following BULK INSERT command
BULK INSERT dbo.A
FROM 'd:\AData.csv'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = ',\n',FIRSTROW = 2)
to process the data shown. My import skips the first row but also skips the second row. In this case i believe this is because my header and data rows have different delimiters,the data rows have a training comma.
DATASET 1
Trial,Timestep,Column1 - line 1
1,0,0,- line 2
1,1,0.00687237750794734, - line 3
1,2,-0.00190074803257245,- line 4
The import works with this data (note the comma on line 1)
DATASET 2
Trial,Timestep,Column1, - line 1
1,0,0,- line 2
1,1,0.00687237750794734, - line 3
1,2,-0.00190074803257245,- line 4
Is there a way to tweak the parameters of the BULK INSERT command to handle DATASET1 without using a custom formatting file?

Delete header row from your file and you should be good to go.

Your data rows have a comma at the end, but your header row doesn't. Get rid of the last commas in the data rows and try again.

Related

unable to load csv file into snowflake with the COPY INTO command

End of record reached while expected to parse column '"VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]'
File 'veg_plant_height.csv', line 8, character 14
Row 3, column "VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
this is my table
create or replace table VEGETABLE_DETAILS_PLANT_HEIGHT (
PLANT_NAME text(7),
VEG_HEIGHT_CODE text(1),
UNIT_OF_MEASURE text(2),
LOW_END_OF_RANGE number(2),
HIGH_END_OF_RANGE number(2)
);
and the COPY INTO command I used
copy into vegetable_details_plant_height
from #like_a_window_into_an_s3_bucket
files = ( 'veg_plant_height.csv')
file_format = ( format_name=VEG_CHALLENGE_CC );
and the csv file https://uni-lab-files.s3.us-west-2.amazonaws.com/veg_plant_height.csv
The error "End of record reached while expected to parse column" means Snowflake detected there were less than expected columns when processing the current row.
Please review your CSV file and make sure each row has correct number of columns. The error said on line 8.
The table has 5 columns but source file consist values for four columns due to this copy command returns the error. In order to resolve the issue you can modified the copy command as mentioned below:
copy into vegetable_details_plant_height(PLANT_NAME, UNIT_OF_MEASURE, LOW_END_OF_RANGE, HIGH_END_OF_RANGE)
from (select $1, $2, $3, $4 from #like_a_window_into_an_s3_bucket)
files = ( 'veg_plant_height.csv') file_format = ( format_name=VEG_CHALLENGE_CC );
As you can see in csv file data in one column is in "" and the names are separated by , so u need to use that FIELD_OPTIONALLY_ENCLOSED_BY = '"' type option

Why does ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE does not work in snowflake?

I'm creating table in snowflake using copy into clause, the values are coming from another table.
now I want to raise an error if in the table I have more columns than the new one.
for example I have in the new table 13 columns, and want to raise exception if the current record I'm trying to insert has 14 columns.
for that I used the parameter ERROR_ON_COLUMN_COUNT_MISMATCH and set this to TRUEv within the file_format = () clause.
the data I want to insert is file, pipe delimited, which sometimes when parsing it we get more columns than we except (14 instead of 13 ex.).
file_format = (format_name = MY_FORMAT, ERROR_ON_COLUMN_COUNT_MISMATCH =
TRUE)
suppose to work... but not working.
anybody saw thing like this before?

Pandas insert into SQL Server

I've read in an excel file with 5 columns into a dataframe (using Pandas) and I'm trying to write it to an existing empty sql server table using this code
for index, row in df.iterrows():
PRCcrsr.execute("Insert into table([Field1], [Field2], [Field3], [Field4], [Field5]) VALUES(?,?,?,?,?)"
, row['dfcolumn1'],row['dfcolumn2'], row['dfcolumn3'], row['dfcolumn4'], row['dfcolumn5'])
I get the following error message:
TypeError: execute() takes from 2 to 5 positional arguments but 7 were given
df.shape says I have 5 columns but when I print the df to the screen it includes the RowNumber. Also one of the columns is city_state which includes a comma. Is this the reason it thinks I'm providing 7 arguments(5 actual columns + row number + the comma issue)? Is there a way to deal with the comma and rowindex columns in the dataframe before writing in to SQL Server? If shape says 5 columns why am I getting this error?
The code above indicated 7 parameters were being passed to the cursor execute command and only between 2 and 5 are permissible. I am actually passing 7 parameters (Insert into, Values, and row[dfcolumn1, 2, 3, 4, 5 - 7 in all). The fix was to convert the row[dfcolumn1] to a tuple using this code
new tuple = [tuple(r) for r in df.values.tolist()]
then I rewrote the for loop as follows:
for tuple in new_tuple:
PRCcrsr.execute = Insert into table([Field1], [Field2], [Field3], [Field4], [Field5]) VALUES(?,?,?,?,?)", tuple)
This delivered the fields as a tuple and inserted correctly

Split data from strings into columns

I have a column with a long string. The data needs split into columns and there are variable lengths of strings with not always the same amount of columns. Not exactly sure how to do this so was looking for some advice here.
Lets say I have this string:
VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|
And in some cases the string might not have all the medical conditions just some of them.
I need to split into columns where the column name is in between the tilds i.e. MedCond1 and the value would be the value to the right of the tild but before the pipe and end up like this:
MedCond1 MedCond2 MedCond3 MedCond4 MedCond5 MedCond6 MedCond7 MedCond8
======== ======== ======== ======== ======== ======== ======== ========
35.1 24 110 64 100 88 21 79
I need to do this for a lot of rows within a large table and as I said not all the columns are always present but they will not be different names, you might have med cond 1- 8, then in another set have med cond 3, 4, 7.
Here is a query I created that is kind of what I want but not dynamic so it is picking up the values with some extra bits of the string
select MainCol, case when charindex('MedCond1', MainCol) > 0 then
substring(MainCol, charindex('MedCond1', MainCol) + 9, 4) end as [MedCond1]
from MedTable
Will return
MedCond1
========
35.3
40.2
33.6
33|V <--- Problem
As you can see the numeric value is sometimes picked up with additional part of the string due to hard coding of the charindex number. The value is sometimes 4 characters long with a decimal place, sometimes 2 long with no decimal place. I would like to make this dynamic. The pipe defines the end of the data I need and the start is defined by the tild at the end of the column name.
Thanks for any thoughts on making this dynamic
Andrew
This data looks like a table itself. It could have been stored in SQL Server as xml. SQL Server supports xml fields and allows querying them. In fact, one could try to convert this string to XML, then try to query it:
declare #medTable table (item nvarchar(2000))
insert into #medTable
values ('VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|');
-- Step 1: Replace `|` with <item> tags and `~` with `tag` tags
-- This will return an xml value for each medTable row
with items as (
select xmlField= cast('<item><tag>'
+ replace(
replace(item,'|','</tag></item><item><tag>'),
'~','</tag><tag>' )
+ '</tag></item>' as xml)
from #medTable
)
-- Step 2: Select different tags and display them as fields
select
y.item.value('(tag/text())[1]','nvarchar(20)'),
y.item.value('(tag/text())[2]','nvarchar(20)'),
y.item.value('(tag/text())[3]','nvarchar(20)')
from items outer apply xmlField.nodes('item') as y(item)
The result is :
-------------------- -------------------- -------
VS5 MedCond1 35.4
VS4 MedCond2 16
VS1 MedCond3 155
VS2 MedCond4 70
SPO2 MedCond5 100
VS3 MedCond6 64
FiO2 MedCond7 21
MAP MedCond8 98
NULL NULL NULL
It would be better to perform this conversion when loading the data though. It's easier for example, to make the replacements in C# or SSIS and store a complete xml value in the database.
You can modify this query too, to generate the xml value and store it in the database:
declare #medTable2 table (xmlField xml)
with items as (
select xmlField= cast('<item><tag>' + replace(replace(item,'|','</tag></item><item><tag>'),'~','</tag><tag>' ) + '</tag></item>' as xml)
from #medTable
)
insert into #medTable2
select items.xmlField
from items
-- Query the new table from now on
select
y.item.value('(tag/text())[1]','nvarchar(20)'),
y.item.value('(tag/text())[2]','nvarchar(20)'),
y.item.value('(tag/text())[3]','nvarchar(20)')
from #medTable2 outer apply xmlField.nodes('item') as y(item)
OK, let me take a stab at this. The solution I'm outlining is not going to be purely SQL Server, however, it uses a round-trip via a text-file.
The approach uses the following steps:
Unpivot the data delimited by the pipe symbols (to create more than one line of output for each line of input)
Round-trip the data from SQL Server to a text file and back
Separate the data into columns on the tilde ~ symbol delimiter
Pivot the data back into columns
The key benefit of this approach is the unpivot operation, which allows you to handle missing columns like MedCond2 naturally by the absence of an equivalent row. It also eliminates nearly all string manipulation, save for the one REPLACE function in step 1 below.
Given a single row's contents like the following:
VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|
Step 1 (Unpivot): Find and replace all instances of the pipe symbol with a newline character. So, REPLACE(column, '|', CHAR(13)) will give you the following lines of text (i.e. multiple lines of text in a single database row) for a single input row:
VS5~MedCond1~35.4
VS4~MedCond2~16
VS1~MedCond3~155
VS2~MedCond4~70
SPO2~MedCond5~100
VS3~MedCond6~64
FiO2~MedCond7~21
MAP~MedCond8~98
Step 2 (Round-trip): Write the above output to a text file, using your tool of choice (SSIS, SQLCMD, etc.) and ensure that the newline character defined is the same as that used in the REPLACE command in step 1.
The purpose of this step is to concatenate multiple lines within the same row with other lines in different rows.
Note that steps 1 can be eliminated by defining the row delimiter for steps 2 & 3 as the pipe symbol. I've put in the additional step 1 using newlines only to make it easier to understand and debug.
Step 3 (Separate columns): Import the text file back into SQL Server using the same tool, and define the column delimiter as the tilde ~ symbol, row delimiter same as in steps 1/2.
ColA MedCondTitle MedCondValue
------ ------------- -------------
VS5 MedCond1 35.4
VS4 MedCond2 16
VS1 MedCond3 155
VS2 MedCond4 70
SPO2 MedCond5 100
VS3 MedCond6 64
FiO2 MedCond7 21
MAP MedCond8 98
Step 4 (Pivot): Now you'd have a trivially simple step of pivoting rows to columns, which can be achieved with a statement of the form:
SUM(CASE WHEN MedCondTitle='MedCond1' THEN MedCondValue ELSE 0) as MedCond1

SQL Server Update using PYODBC

I have a CSV file which contains '\N' in some cells under a column for which the header is defined as int in SQL Server.
I am using pyodbc to update the SQL server data each day from the CSV supplied. The problem is whenever I have '\N' in the CSV file then the SQL server update results in an error and I have to delete the rows with '\N' to have them updated.
Is there any way I can update '\N' in int type column in SQL Server?
Below is the code
with open (Result_File, 'r') as h:
reader = csv.reader(h)
columns = next(reader)
query = "INSERT into dbo.test({0}) values ({1})"
query = query.format(','.join(columns), ','.join('?' * len(columns)))
for data_line in reader:
cur.execute(query,data_line)
Use a list comprehension to iterate through the lists returned from the csv.reader object, replacing elements with \n value with the default value you specified (0).
...
for data_line in reader:
# substitute 0 if the column value is '\n'
cur.execute(query, [0 if value == '\n' else value for value in data_line])
Note that is workaround is highly specific, and would be difficult to maintain if other cases of value substitution crop up. If possible, I'd fix the process that creates the input file so your processing code can be generalized and more readable.

Resources