SSIS - remove character X unless it's followed by character Y - sql-server

Let's say I have the following dataset imported from a textfile:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill "The man" Taylor","France"
I am looking for a way to remove every " in the data, unless it's followed or preceded by a ,.
So in my case, the data should become:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill The man Taylor","France"
I tried it with the import tekst file component in SSIS, but that gives an error when I set the column delimiter to ". If I don't set a delimiter, it sees the comma in "Texas, USA" as a split delimiter....
Any suggestions/ideas? The textfile is too large to change this manually for every line so that's not an option.

Bit of a cop-out on the last '"', but:
Create table #test ([Data] nvarchar(max))
insert into #test values ('1,"John Davis","Germany"' )
insert into #test values ('2,"Mike Johnson","Texas, USA"' )
insert into #test values ('3,"Bill "The man" Taylor","France"')
select replace(replace(replace(replace([Data],',"',',~'), '",','~,'),'"', ''),'~','"') + '"'
from #test

Related

Loading data to Snowflake target: End of record reached while expected to parse column

Observing Error:
End of record reached while expected to parse column '"DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]' File 'DIM_EQUIPMENT_UNIT_issue_nulll_end.csv', line 17, character 133 Row 16, column "DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
For more information on loading options, please run 'info loading_data' in a SQL client.
Sample Data:
3499933,00010101,99991231,"Y","TSXZ 622095",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499948,00010101,99991231,"Y","EOLU 8888370",0,1,28,-1,3,-2,"1","2018-04-26-17.35.47.000000","COMPASS ",5799,-2,-2,"n/a ","n/a",+3.69000000000000E+000
3499968,00010101,99991231,"Y","NONZ 7086849",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499992,00010101,99991231,"Y","SGPU 1240279",0,1,31,-1,3,-2,"1","2019-05-22-17.29.11.000000","COMPASS ",6203,-2,-2,"n/a ","n/a",+3.05000000000000E+000
109267,00010101,99991231,"Y","CTSU 425ß85 ",0,1,46,-1,3,-2,"1","2011-05-16-08.52.08.000000","COMPASS ",98,-2,-2,"n/a ","n/a",
DDL:
CREATE OR REPLACE TABLE DIM_EQUIPMENT_UNIT(
EQUNIT_ID NUMBER,
EQUNIT_VLD_FROM DATE,
EQUNIT_VLD_TO DATE,
EQUNIT_VLD_FLAG VARCHAR(1),
EQUNIT_UNIT_NUMBER VARCHAR(13),
EQUNIT_CONSTRUCTION_YEAR NUMBER,
FK_TW2130EQCAT NUMBER,
FK_TW0020EQT NUMBER,
FK_TW2160EQSERIES NUMBER,
FK_TW0050OWS NUMBER,
FK_TW059VEQLESSOR NUMBER,
EQUNIT_CLIENT VARCHAR(1),
EQUNIT_LC TIMESTAMP_NTZ,
EQUNIT_CB VARCHAR(8),
EQUNIT_LOAD_CYCLE_ID NUMBER,
FK_TW0820CHT NUMBER,
FK_TW0850GST NUMBER,
EQUNIT_SAP_ASSET_NUMBER VARCHAR(11),
EQUNIT_PRE_INTEGRATION_OWNER VARCHAR(3),
EQUNIT_TARE_WEIGHT_TNE FLOAT
);
COPY Command Used:
COPY INTO "DIM_EQUIPMENT_UNIT" FROM '#SNOWFLAKE_STAGE/' FILES=('DIM_EQUIPMENT_UNIT_issue_nulll_end.csv') on_error='abort_statement'
file_format=(type=csv SKIP_HEADER=1 FIELD_OPTIONALLY_ENCLOSED_BY='"' ESCAPE_UNENCLOSED_FIELD = None ENCODING ='WINDOWS1252'
EMPTY_FIELD_AS_NULL=true NULL_IF = ('NULL', 'null', '\N') TIMESTAMP_FORMAT='YYYY-MM-DD-HH24.MI.SS.FF')
So if you see the last record in your sample data. 18th and 19th column values are "n/a". There is nothing on column number 20. Even it is supposed to be null, it should spit out in the data, like "" or "\N" or NULL.
Since it doesn't have anything, it's giving you end of file error on that column.
Now, you can do either of the below 2 things,
a. make sure your sample file has an exact 20 columns,
b. If you cannot do that, and if you're ok to ignore the row, change the on-error in copy statement to on_error='continue'
This will ignore this row and move forward.

how to insert variable into database with pyodbc?

highscore= score
cursor.execute("insert into tble values (hscore) hishscore.getvalue"):
que: score will save into variable highscore. That highscore needs to save on to the database in the field hscore. What is the correct code for insertion and getting value.
You want to bind the parameter using the ? placeholder:
cursor.execute("INSERT INTO tble (hscore) VALUES (?)", highscore)
If you wanted to insert multiple values, here's a longer form:
cursor.execute(
"""
INSERT INTO table_name
(column_1, column_2, column_3)
VALUES (?, ?, ?)
""",
(value_1, value_2, value_3)
)
Your order of VALUES was out of place as well. Good luck!
cursor.execute("insert into tablename(column1,column2) values (?,?);",var1,var2)
I needed the semicolon for it to work for me.
Assuming the column name is 'hscore', and the variable with the value to be inserted is 'highscore':
cursor.execute("insert into tablename([hscore]) values(?)", highscore)
you can follow the below code this is going write column values from csv , good example for your use case
import pyodbc
import io
#credential file for server,database,username,password
with io.open('cred.txt','r',encoding='utf-8')as f2:
cred_data=f2.read()
f2.close()
cred_data=cred_data.split(',')
server=cred_data[0]
database=cred_data[1]
username=cred_data[2]
pwd=cred_data[3]
con_obj=pyodbc.connect("DRIVER={SQL Server};SERVER="+server+";DATABASE="+database+";UID="+username+";PWD="+pwd)
data_obj=con_obj.cursor()
#data file with 5 columns
with io.open('data.csv','r',encoding='utf-8')as f1:
data=f1.read()
f1.close()
data=data.split('\n')[1:]
i=1001
for row in data:
lines=row.split(',')
emp=i
fname=lines[0].split(' ')[0]
sname=lines[0].split(' ')[1]
com=lines[1]
dep=lines[2]
job=lines[3]
email=lines[4]
data_obj.execute("insert into dbo.EMP(EMPID,FNAME,SNAME,COMPANY,DEPARTMENT,JOB,EMAIL) values(?,?,?,?,?,?,?)", emp,fname,sname,com,dep,job,email)
con_obj.commit()
i=i+1

Using substring to strip text from text file & insert into SQL Server database & create script text file as result

Database Specification: SQL Server 2012
Problem Statement:
I need a sql query (store procedure or function or set-base query … you the expert) to assist with processing 4.1 million records in efficient time. These records reside in a text file and therefore need to inserted into a database table. Note each record consist of the following fields at different offset values:
Herewith the offset values for only the first 6 columns...
,LTRIM(SUBSTRING([TABLE],1,13)) --National_id
,LTRIM(SUBSTRING([TABLE],14,43)) --Errmsg
,LTRIM(SUBSTRING([TABLE],57,8)) --DeceasedDTE
,LTRIM(SUBSTRING([TABLE],65,50)) --DeceasedReason
,LTRIM(SUBSTRING([TABLE],115,45)) --Surname
,LTRIM(SUBSTRING([TABLE],158,50)) --FirstNames
Note the FirstNames column could contain 3 string values separated by a space … FName1, FName2, Fname3…
This is where I’m struggling to strip this column with more than 2 string values … only because this column doesn’t work with offset values …
I have the following 3 records as a sample ... these records need to be stored in a text file and used as input..
Copy this to a text file named: Deceased.txt
000101001118 IDENTITY NUMBER NOT NUMERIC
0001010061181PERSON DECEASED 19990101OBSTRUCTIVE AIRWAYS SYNDROME BABA NOWEZILE
0001010077097 COERTZEN AZIL CUBITT JONO
Desired Result:
1. Insert each record to a sql server table.
The table will have the following columns:
National_id Errmsg DeceasedDTE DeceasedReason Surname FirstNames
First_Initial Second_Initial Third_Initial FName1 FName2 FName3
Also note the FirstNames is separated with a space and I need them spit up into each FName1, FName2 and FName3 ... with its corresponding first letter that makes up the initial..
I then need to create a script send to a .txt file that creates an insert statement for each record with the following columns... national_id, surname, First_Initial, Second_Initial,Third_Initial, First_Name, Second_Name, Third_Name
E.G.
Set #Insert = "insert into prodmgr.t_unverified (national_id, surname, First_Initial, Second_Initial,Third_Initial, First_Name, Second_Name, Third_Name) values ('"
NOTE:
DECEASED.TXT (3 RECORDS)
000101001118 IDENTITY NUMBER NOT NUMERIC
0001010061181PERSON DECEASED 19990101OBSTRUCTIVE AIRWAYS SYNDROME BABA NOWEZILE
0001010077097 COERTZEN AZIL CUBITT JONO
TABLE:DECEASED (3 RECORDS)
National_id Errmsg DeceasedDTE DeceasedReason Surname FirstNames First_Initial Second_Initial Third_Initial FName1 FName2 FName3
RESULT IN TEXT FILE: (2 RECORDS)
insert into prodmgr.t_unverified (national_id, surname, First_Initial, Second_Initial,Third_Initial, First_Name, Second_Name, Third_Name) values ('0001010061181',’BABA’,’N’,’’,’’,’NOWEZILE’,’’,’’)
insert into prodmgr.t_unverified (national_id, surname, First_Initial, Second_Initial,Third_Initial, First_Name, Second_Name, Third_Name) values (0001010077097,’COERTZEN’,’A’,’C’,’J’,’AZIL’,’CUBITT’,’JONO’)

Oracle split text into multiple rows

Inside a varchar2 column I have text values like :
aaaaaa. fgdfg.
bbbbbbbbbbbbbb ccccccccc
dddddd ddd dddddddddddd,
asdasdasdll
sssss
if i do select column from table where id=... i get the whole text in a single row, normally.
But i would like to get the result in multiple rows, 5 for the example above.
I have to use just one select statement, and the delimiters will be new line or carriage return (chr(10), chr(13) in oracle)
Thank you!
Like this, maybe (but it all depends on the version of oracle you are using):
WITH yourtable AS (SELECT REPLACE('aaaaaa. fgdfg.' ||chr(10)||
'bbbbbbbbbbbbbb ccccccccc ' ||chr(13)||
'dddddd ddd dddddddddddd,' ||chr(10)||
'asdasdasdll ' ||chr(13)||
'sssss '||chr(10),chr(13),chr(10)) AS astr FROM DUAL)
SELECT REGEXP_SUBSTR ( astr, '[^' ||chr(10)||']+', 1, LEVEL) data FROM yourtable
CONNECT BY LEVEL <= LENGTH(astr) - LENGTH(REPLACE(astr, chr(10))) + 1
see: Comma Separated values in Oracle
The answer by Kevin Burton contains a bug if your data contains empty lines.
The adaptation below, based on the solution invented here, works. Check that post for an explanation on the issue and the solution.
WITH yourtable AS (SELECT REPLACE('aaaaaa. fgdfg.' ||chr(10)||
'bbbbbbbbbbbbbb ccccccccc ' ||chr(13)||
chr(13)||
'dddddd ddd dddddddddddd,' ||chr(10)||
'asdasdasdll ' ||chr(13)||
'sssss '||chr(10),chr(13),chr(10)) AS astr FROM DUAL)
SELECT REGEXP_SUBSTR ( astr, '([^' ||chr(10)||']*)('||chr(10)||'|$)', 1, LEVEL, null, 1) data FROM yourtable
CONNECT BY LEVEL <= LENGTH(astr) - LENGTH(REPLACE(astr, chr(10))) + 1;

Bulk insert into SQL Server a CSV with line breaks in fields

I have a csv that looks like this:
"blah","blah, blah, blah
ect, ect","column 3"
"foo","foo, bar, baz
more stuff on another line", "another column 3"
Is it possible to import this directly into SQL server?
Every row in your file finishes with new line (\n) but the actual rows you want to get finishes with quotation mark and new line. Set ROWTERMINATOR in BULK INSERT command to:
ROWTERMINATOR = '"\n'
EDITED: I think the bigger problem will be with commas in the text. SQL Server does not use text enclosures. So the row will be divided on commas without checking if the comma is inside quotation marks or not.
You may do like this:
BULK INSERT newTable
FROM 'c:\file.txt'
WITH
(
FIELDTERMINATOR ='",',
ROWTERMINATOR = '"\n'
)
This will give you the following result:
col1 | col2 | col3
----------------------------------------------------------------
"blah | "blah, blah, blah ect, ect | "column 3
"foo | "foo, bar, baz more stuff on another line | "another column 3
All you have to do is to get rid of the quotation marks on the beginning of each cell.
For example:
UPDATE newTable
SET col1 = RIGHT(col1,LEN(col1)-1),
col2 = RIGHT(col2,LEN(col2)-1),
col3 = RIGHT(col3,LEN(col3)-1)
I think you can also do this using bcp utility with format file

Resources