SSIS - Flat File with escape characters

SSIS - Flat File with escape characters - file

I have a large flat file I'm using to recover data. It was exported from a system using double quotes " as the qualifier and a pipe | a the delimiter. SSIS can be configured to this without a problem, but where I'm running into issues is with the \ escape char.
the row causing the issue:
"125004267"|"125000316"|"125000491"|"height"|"5' 11\""|"12037"|"46403"|""|"t"|""|"2012-10-01 22:34:01"|"2012-10-01 22:34:01"|"1900-01-01 00:00:00"
The fourth column in the database should be 5' 11".
I'm getting the following error:
Error: 0xC0202055 at Data Flow Task 1, Flat File Source [2]: The column delimiter for column "posting_value" was not found.
How can I tell SSIS to handle the \ escape character?

I know this is quite old, but I just ran into a similar issue regarding escaping quotes in CSV's in SSIS. It seems odd there isn't more flexible support for this but it does support VB-style double-double quotes. So in your example you could pre-parse the file to translate it into
"125004267"|"125000316"|"125000491"|"height"|"5' 11"""|"12037"|"46403"|""|"t"|""|"2012-10-01 22:34:01"|"2012-10-01 22:34:01"|"1900-01-01 00:00:00"
to get your desired output. This at least works on Sql Server 2014.
This also works for Excel (tested with 2010). Though, oddly, only when inserting data from a text file, not when opening a CSV with Excel.
This does appear to be the standardized method according to RFC 4180
which states
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes
...
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.

This probably isn't the answer you are looking for, but...
I'd reach out to the technical contacts of the source of data, and explain to them that if they're going to send you a file that uses double-quotes as text qualifiers, then that implies that there are never any double-quotes in the text. If that is possible, as it happens here, tell them to use another text qualifier, or none at all.
Since there are pipe delimeters in use, what's the point of having text qualifiers?
Seems redundant.

Related

WildFly environment variable parsing a JDBC string with semicolon

I am having a heck of a time using an environment variable with a semicolon in a properties file read by WildFly (24) in Linux. One like:
DATABASE_JDBC_URL=jdbc:sqlserver://sqlserver.c3klg5a2ws.us-east-1.rds.amazonaws.com:1433;DatabaseName=ejbca;encrypt=false
The issue is that its truncating things at the semicolon if I don't use quotes so I end up with it trying to write to master since it thinks no database is specified.
I have it setup so that variable is in a file called datasource.properties that gets read from standalone.conf where this variable sits:
JAVA_OPTS="$JAVA_OPTS -DDATABASE_JDBC_URL=${DATABASE_JDBC_URL}"
It's read in with the following in standalone.conf:
set -a
. /opt/wildfly_config/datasource.properties
set +a
That in turn gets populated in standalone.xml with:
<connection-url>${env.DATABASE_JDBC_URL}</connection-url>
I try putting it in quotes and oddly enough it doesn’t start at all. Standalone.sh is no longer able to parse it:
/opt/wildfly/bin/standalone.sh: line 338: --add-exports=java.desktop/sun.awt=ALL-UNNAMED: No such file or directory
So I then escape it in quotes like this:
DATABASE_JDBC_URL="jdbc:sqlserver://sqlserver.c3klg5a2ws.us-east-1.rds.amazonaws.com:1433\;DatabaseName=ejbca\;encrypt=false"
Startup looks good in the log output this way:
-DDATABASE_JDBC_URL=jdbc:sqlserver://sqlserver.c3klg5a2ws.us-east-1.rds.amazonaws.com:1433;DatabaseName=ejbca;encrypt=false
But then java doesn’t like it, for some reason it sees the escape ticks:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The port number 1433\ is not valid.
I can use sed to change the value in the standalone.xml, but all of the other properties I am doing work fine with the exception of this one and:
<check-valid-connection-sql>${env.DATABASE_CONNECTION_CHECK}</check-valid-connection-sql>
Where that value is "SELECT 1;" which it also does not like. That one worked with "'SELECT 1;'" but this one does not. I tried single quotes as well. That also gives the parsing error above. Is there any way to read in this environment variable that keeps wildfly happy?

You can enclose the characters you want to escape in { and } braces.
From the SQL Server documentation:
For example, {;} escapes a semicolon.
Just to note: Different database vendors will most likely have different ways of escaping characters in their connection URLs. The above approach works for SQL Server. But just to give one different example, MySQL uses URL encoding.

A alternate solution was to change how the variables were created. Part of the problem I had was that sourcing them from a properties file made them properties and not variables. I ended up creating a /opt/wildfly/bin/start.sh script with:
#!/bin/bash
export DATABASE_JDBC_URL="jdbc:sqlserver://sqlserver.c3klg5a2ws.us-east-1.rds.amazonaws.com:1433;DatabaseName=ejbca;encrypt=false"
/opt/wildfly/bin/standalone.sh
I then changed the wildfly service point to that new start.sh script. No longer have any parsing issues as long as the variables are started in memory.
No escaping was needed after that.

BCP Fixed Width Import -> Unexpected EOF encountered in BCP data-file?

I have some sensitive information that I need to import into SQL Server that is proving to be a challenge. I'm not sure what the original database that housed this information was, but I do know it is provided to us in a Unix fixed length text file with LF row terminator. I have two files: a small file that covers a month's worth of data, and a much larger file that covers 5 years worth of data. I have created a BCP format file and command that successfully imports and maps the data to my SQL Server table.
The 5 year data is supposedly in the same format, so I've used the same command and format file on the text file. It starts processing some records, but somewhere in the processing (after several thousand records), it throws Unexpected EOF encountered and I can see in the database some of the rows are mapped correctly according to the fixed lengths, but then something goes horribly wrong and screws up by inserting parts of data in columns they most definitely do not belong in. Is there a character that would cause BCP to mess up and terminate early?
BCP Command: BCP DBTemp.dbo.svc_data_temp in C:\Test\data2.txt -f C:\test\txt2.fmt -T -r "0x0A" -S "stageag,90000" -e log.rtf
Again, format file and command work perfectly for the the smaller data set, but something in the 5 year dataset is screwing up BCP.
Thanks in advance for the replies!

So I found the offending characters in my fixed width file. Somehow whoever pulled the data originally (I don't have access to the source), escaped (or did not escape correctly) the double quotes in some of the text, resulting in some injection of extra spaces breaking the fixed width guidelines we were supposed to be following. After correcting the double quotes by hex editing the file, BCP was able to process all records using the format file without issue. I had used the -F and -L flags to examine certain rows of the data and to narrow it down to where I could visually compare the rows that were ok and the rows where the problems started to arise, which led me to discover the double quotes issue. Hope his helps for somebody else if they have an issue similar to this!

Unable to create directory in oracle 12c

I am using Oracle 12.2 .I wish to import data pump files. To do that, I wish to create a directory, containing the files and then import. I use the following command to create directory
CREATE DIRECTORY dpump_dir1 AS ‘D:\dumpdir’;
I am getting the error as
SQL Error: ORA-00911: invalid character
00911. 00000 - "invalid character"
*Cause: identifiers may not start with any ASCII character other than
letters and numbers. $#_ are also allowed after the first
character. Identifiers enclosed by doublequotes may contain
any character other than a doublequote. Alternative quotes
(q'#...#') cannot use spaces, tabs, or carriage returns as
delimiters. For all other contexts, consult the SQL Language
Reference Manual.
Could anybody tell me what is going wrong?

The quotes being used in the code you provided are not simple straight single quotes; it's slightly easier to see when formatted as code:
CREATE DIRECTORY dpump_dir1 AS ‘D:\dumpdir’;
You can also use your text editor or dump the string to see which chraacters it contains:
select dump(q'[CREATE DIRECTORY dpump_dir1 AS ‘D:\dumpdir’;]', 1016) from dual;
DUMP(Q'[CREATEDIRECTORYDPUMP_DIR1AS‘D:\DUMPDIR’;]',1016)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Typ=96 Len=49 CharacterSet=AL32UTF8: 43,52,45,41,54,45,20,44,49,52,45,43,54,4f,52,59,20,64,70,75,6d,70,5f,64,69,72,31,20,41,53,20,20,e2,80,98,44,3a,5c,64,75,6d,70,64,69,72,e2,80,99,3b
You can see that it's reported at 49 bytes despite being 45 characters long, indicating you have multibyte characters. Before the final semicolon, which is shown as 3b, you have the sequence e2,80,99 which represents the ’ right single quotation mark, and a bit earlier you have the sequence e2,80,98 which represents the ‘ left single quotation mark.
If you use plain quotes it should work:
CREATE DIRECTORY dpump_dir1 AS 'D:\dumpdir';
Presumably you copied and pasted the text from an editor which helpfully substituted curly quotes.

Which is the best character to use as a delimiter for ETL?

I recently unloaded a customer table from an Informix DB and several rows were rejected because the customer name column contained non-escaped vertical bars (pipe symbol) characters, which is the default DBDELIMITER in the source db. I found out that the field in their customer form has an input mask allowing any alphanumeric character to be entered, which can include any letters, numbers or symbols. So I persuaded the user to run a blanket update on that column to change the pipe symbol to a semicolon. I also discovered other rows containing asterisks and commas in different columns. I could imagine what would happen if this table were to be unloaded in csv format or what damage the asterisks could do!
What is the best character to define as a delimiter?
If tables are already tainted with pipes, commas, asterisks, tabs, backslashes, etc., what's the best way to clean them up?

I have to deal with large volumes of narrative data at my job. This is always a nightmare because users are apt to put ANY character in there, including unprintable characters. You can run a cleanup operation, but you have to do it every time you load data, and it likely won't work forever. Eventually someone will put in what every character you choose as a separator, which is not a problem if your CSV handling libraries can handle escaping properly, but many can't. If this is a one time load/unload, you're probably fine, but if you have to do it more often....
In the past I've changed the separator to the back-tick '`', the tilde '~', or the caret '^'. All failed in the current effort. The best solution I could come up with is to not use CSV format at all. I switched to XML. Even so there were still XML illegal characters, but these can be translated out with atlassian-xml-cleaner-0.1.jar.

Unload customer table with default pipe; string search for a character that doesn't exist. ie. "~"
unload to file delimiter "~"
select * from customer;
Clean your file (or not)
(vi replace string):g/theoldstring/s//thenewstring/g)
or
(unix prompt) sed 's/old-char/new-char/g' fileold > filenew
(Once clean id personally change back "~" in unload file to "|" or "," as csv standard)
Load to source db.

If you can, use a multi-character delimiter. It can still fail, but it should be much more highly unlikely.
Or, escape the delimiter while writing the export file (Informix docs say "LOAD TABLE" escapes by prefixing delimiter characters with backslash). Proper CSV has quoting and escaping so it shouldn't matter if a comma is in the data, unless your exporter and loader cannot handle proper CSV.

Characters are converted in special symbols

I have database records available in MSExcel file. I save it as CSV file. And then create database in firefox's SQLiteManager by importing that CSV file .
But the characters like ..., ' , ",- are converted in �.
I have also tried to save CSV file in UTF-8 formate, but it converts that characters in Õ
Has anyone idea , how to solve it?
Thanks.

Perhaps you might want to consider escaping quotes, e.g. try "" or "' in your csv file. And just pay a bit more attention to Fields enclosed by section in SQLiteManager add-on, making sure these fields are enclosed properly.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SSIS - Flat File with escape characters - file

Related

WildFly environment variable parsing a JDBC string with semicolon

BCP Fixed Width Import -> Unexpected EOF encountered in BCP data-file?

Unable to create directory in oracle 12c

Which is the best character to use as a delimiter for ETL?

Characters are converted in special symbols

Categories

Resources