Why won't redshift accept my fixedwidth text file - sql-server

I am reading a varchar(500) column from a SQL Server 2008 R2 database to import into Redshift via a fixedwidth text file.
To pull down the record into a fixed width file, I started out by using a StringBuilder to write out a block of text at a time. I was using AppendFormat and the alignment specifier to align the different records. At certain points, once every 400k lines, I would write the contents of StringBuilder into a StreamWriter to write to disk.
I noticed that there was an issue with text when I tried loading the files into Redshift, the upload into Reshift failed due to extra columns, (there were more columns than my fixed width specification accommodated for).
When I tested the StringBuilder against a regular string, the widths match what I intended them to match, 500 characters.
The discrepancy came when I tried writing my records to disk. I kept getting the same issue when I wrote the aforementioned database column to disk using WriteLineformat the StreamWriter object.
The collation on the database is SQL_Latin1_General_CP1_CI_AS. I understand that strings from the database get converted the database collation to UTF-16. I think there is no problem there, as stated from the test I performed above. I think the issue I'm having is from taking the string in UTF-16 form and writing them to disk using StreamWriter.
I can expect any type of character from the database field, except for a newline or carriage return. I'm pretty confident that white space is trimmed before being pushed into the database column using a combination of the TSQL functions Ltrim and Rtrim.
Edit: Following is the code I use in Powershell
$dw = new-object System.Data.SqlClient.SqlConnection("<connection string details>")
$dw.open()
$reader = (new-object System.Data.SqlClient.Sqlcommand("select email from emails",$dw)).ExecuteReader()
$writer = new-object system.IO.StreamWriter("C:\Emails.txt",[System.Text.Encoding]::UTF8)
while($reader.read())
{
$writer.writelineformat("{0,-500}",$reader["email"])
}
$writer.close()
$reader.close()
Obviously I'm not going to give you the details of my connection string or my table naming convention.
Edit: I'm including the AWS Redshift article that explains that data can only be imported into Redshift using UTF-8 encoding.
http://docs.aws.amazon.com/redshift/latest/dg/t_preparing-input-data.html
Edit: I was able to get a sample of the outputted file through
get-content -encoding utf8
The content inside of the file is definitely UTF-8 proper. All of the line endings within. It's seems like my main issue is with Redshift taking multi-byte characters for fixed width files.

I suspect that the issue is caused by the fact that StreamWriter by default uses UTF-8 so in some instances you will get double byte characters as utf-8 is variable width.
Try using unicode, which will match your database encoding, StreamWriter has an overload which supports encoding.

Just so that anyone seeing this understands. My problem is really with redshift. One thing I have noticed is that the service seems to have processing issues with fixedwidth files. This seems to be specific to Amazon, since the underlying system that runs Redshift is ParAccel. I had issues in the past with Fixedwidth files. I have been able to confirm that there is an issue with Redshift accepting multi-byte characters within fixedwidth version of the S3 Copy command.

Related

_x000D_ appearing when importing into SQL

I am importing some Excel spreadsheets into a MS SQL Server. I load the spreadsheets, cleanse the data and then export it to SQL using Alteryx. Some files have text columns where the cells span multiple lines (i.e. with new line characters, like when you press ALT + ENTER in Excel). When I export the tables to SQL and then query the table, I see lots of '_x000D_' which are not in the original file.
Is it some kind of newline character encoding? How do I get rid of it?
I haven't been able to replicate the error. The original file contains some letters with accents (à á etc); I created multi-line spreadsheets with accented letters, but I managed to export these to SQL just fine, with no 'x000D'.
If these were CSV files I would think of character encoding, but Excel spreadsheets? Any ideas? Thanks!
I know this is old, but: if you're using Alteryx, just run it through the "Data Cleansing" tool as the last thing prior to your export to SQL. For the field in question, tell the tool to remove new lines by checking the appropriate checkbox.
If that still doesn't work... 0x000D is basically ASCII 13; (Hex "D" = Int 13)... so try running your data through a regular Formula tool, and for the [field] in question, just use the expression Replace([field],CharFromInt(13),""), which should remove that character by replacing it with the empty string.
This worked for me:
REGEX_REPLACE([field],"_x000D_","")

Manual import into SQL Server 2000 of tab delimited text file does not format international characters

I have searched for this specific solution and while I have found similar queries, I have not found one that solves my issue. I am manually importing a tab-delimited text file of data that contains international characters in some fields.
This is one such character: Exhibit Hall C–D
it's either an em dash or en dash in between the C & D. It copies and pastes fine, but when the data is taken into SQL Server 2000, it ends up looking like this:
Exhibit Hall C–D
The field is nvarchar and like I said, I am doing the import manually through Enterprise Manager. Any ideas on how to solve this?
The problem is that the encoding between the import file and SQL Server is mismatched. The following approach worked for me in SQL Server 2000 importing into a database with the default encoding (SQL_Latin1_General_CP1_CI_AS):
Open the .csv/.tsv file with the free text editor Notepad++, and ensure that special characters appear normal to start with (if not, try Encoding|Encode in...)
Select Encoding|Convert to UCS-2 Little Endian
Save as a new .csv/.tsv file
In SQL Server Enterprise Manager, in the DTS Import/Export Wizard, choose the new file as the data source (source type: Text File)
If not automatically detected, choose File type: Unicode (in preview on this page, the unicode characters will still look like black blocks)
On the next page, Specify Column Delimiter, choose the correct delimiter. Once chosen, Unicode characters should appear correctly in the Preview pane
Complete import wizard
I would try using the bcputility ( http://technet.microsoft.com/en-us/library/ms162802(v=sql.90).aspx ) with the -w parameter.
You may also want to check the text encoding of the input file.

How to read Arabic characters from varchar datatype?

I have an old system that uses varchar datatype in its database to store Arabic names, now the names appear in the database like this:
"ãíÓÇÁ ÇáãÈíÖíä"
Now I am building a new system using VB.NET, how can I read these names to appear in Arabic characters?
Also I need to point out here that the old system even it stores the data as I mentioned earlier it converts the characters in a correct format.
How to display it properly in the new system and in the SQL Server Management Studio?
have you tried nvarchar? you may find some usefull information at the link below
When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?
I faced the same Problem, and I solved it by two steps:
1.change the datatype of the column in DB into nvarchar
2.use the encoding to change the data into Arabic
I used the following function
private string GetDataWithArabic(string srcData)
{
Encoding iso = Encoding.GetEncoding("iso-8859-1");
Encoding unicode = Encoding.Default;
byte[] unicodeBytes = iso.GetBytes(srcData);
return unicode.GetString(unicodeBytes);
}
but make sure you use this method once on DB data, because it will corrupt the data if used twice
I think your answer is here: "storing and retrieving non english characters" http://aalamrangi.wordpress.com/2012/05/13/storing-and-retrieving-non-english-unicode-characters-hindi-czech-arabic-etc-in-sql-server/

SQL Server Bulk Import With Format File of UTF-8 Data

I have been referring to the following page:
http://msdn.microsoft.com/en-us/library/ms178129.aspx
I simply want to bulk import some data from a file that has Unicode characters. I have tried encoding the actual data file in UC-2, UTF-8, etc but nothing works. I have also modified the format file to use SQLNCHAR, but still it doesn't work and gives error:
Bulk load data conversion error (truncation) for row 1, column 1
I think it has to do with this statement from the above link:
For a format file to work with a Unicode character data file, all the
input fields must be Unicode text strings (that is, either fixed-size
or character-terminated Unicode strings).
What exactly does this mean? I thought this means every character string needs to be a fixed 2 bytes, which encoding the file in UCS-2 should handle???
This blog post was really helpful and solved my problem:
http://blogs.msdn.com/b/joaol/archive/2008/11/27/bulk-insert-using-unicode-data-files.aspx
Something else to note - a Java class was generating the data file. In order for the above solution to work, the data file needed to be encoded in UTF-16LE, which can be set in the constructor of OutputStreamWriter (for example).
In SQL Server 2012 I imported a .csv file saved with Notepad++ enconded in UCS-2 with special spanish characters

SQL 2005 CSV Import Quote Delimited with inner Quotes and Commas

I have a CSV file with quote text delimiters. Most of the 90000 rows are fine, but I have a few rows that have a text field that contains both a quote and a comma. For example the fields value would be:
AB",AB
When Delimited this becomes
"AB"",AB"
When SQL 2005 attempts to import this I get errors such as...
Messages
Error 0xc0202055: Data Flow Task: The column delimiter for column "Column 4" was not found.
(SQL Server Import and Export Wizard)
This only seems to happen when a quote and comma are in a text value together. Values like
AB"AB which becomes "AB""AB"
or
AB,AB which becomes "AB,AB"
work fine.
Here are some example rows...
"1464885","LEVER WM","","B","MP17"
"1465075",":PLT-BC !!NOTE!!","","B",""
"1465076","BRKT-STR MTR !NOTE!","","B",""
"1465172",":BRKT-SW MTG !NOTE!","","B","MP16"
"1465388","BUSS BAR !NOTE!","","B","MP10"
"1465391","PLT-BLKHD ""NOTE""","","B","MP20"
"1465564","SPROCKET:13TEETH,74MM OD,66MM","ID W/.25"" SETSCR","B","MP6"
"S01266330002","CABLE:224"",E122/261,8 CO","","B","MP11"
The last row is an example of the problem - the "", causes the error.
I've had MAJOR problems with SSIS. Things that Access, Excel and even DTS seemed to do very well, SSIS chokes on. Variable record-length data is another problem but, yes, these embedded qualifiers are a major problem. Especially if you do not have access to the import files because they're on someone else's server that you pay to gain access to and might even be 4 to 5 GB in size! Cant just to a "replace all" on that every import.
You may want to check into this at Microsoft Downloads called "UnDouble" and here is another workaround you might try.
Seems like with SSIS in SQL Server 2008, the bug is still there. I dont know why they havent addressed this in the parser but its like we went back in time with SSIS in basic import functionality.
UPDATE 11-18-2010: This bug still exists in SSIS. Amazing.
How about just:
Search/replace all "", with ''; (fix all the broken fields)
Search/replace all ;''; with ,"", (to "unfix" properly empty fields.)
Search/replace all '';''; with "","", (to "unfix" properly empty fields which follow a correct encapsulation of embedded delimiters.)
That converts your original to:
"1464885","LEVER WM","","B","MP17"
"1465075",":PLT-BC !!NOTE!!","","B",""
"1465076","BRKT-STR MTR !NOTE!","","B",""
"1465172",":BRKT-SW MTG !NOTE!","","B","MP16"
"1465388","BUSS BAR !NOTE!","","B","MP10"
"1465391","PLT-BLKHD ""NOTE""","","B","MP20"
"1465564","SPROCKET:13TEETH,74MM OD,66MM","ID W/.25"" SETSCR","B","MP6"
"S01266330002","CABLE:224'';E122/261,8 CO","","B","MP11"
Which seems to run the gauntlet fine in SSIS. You may have to step 3 recursively to account for 3 empty fields in a row ('';'';'';, etc.) but the bottom line here is that when you have embedded text qualifiers, you have to either escape them or replace them. Let this be a lesson in your CSV creation processes going forward.
Microsoft says doubled double quotes inside double quote delimited fields just don't work. A fix is planned for the end of 2011...
In the mean time we will have to use workarounds like described in the other answers.
I would just do a search/replace for ", and replace it with ,
Do you have access to the original file?

Resources