Format of CSV string data for XGBoost model - amazon-sagemaker

I'm brand new to Sagemaker and I'm having trouble finding examples of importing CSV-formatted String data into XGBoost.
Specifically, can it handle foreign characters (if yes, what encoding)? How does it know which column (variable) I will need it to make a prediction on?
Thanks.

In order to use SageMaker XGBoost with csv input, you'll need to prepare your dataset in format label, feature_1, feature_2, ... in each row.
XGBoost can only handle numeric values as input data. If you have foreign characters in the input data, you'll need to encoding it first before feeding it to XGBoost. Depending on your dataset, you should use the encoding method that make most sense to your data.
For csv input, SageMaker XGBoost always assume the first column to be the label/target.

Related

Uploading excel file to sql server [duplicate]

Every time that I try to import an Excel file into SQL Server I'm getting a particular error. When I try to edit the mappings the default value for all numerical fields is float. None of the fields in my table have decimals in them and they aren't a money data type. They're only 8 digit numbers. However, since I don't want my primary key stored as a float when it's an int, how can I fix this? It gives me a truncation error of some sort, I'll post a screen cap if needed. Is this a common problem?
It should be noted that I cannot import Excel 2007 files (I think I've found the remedy to this), but even when I try to import .xls files every value that contains numerals is automatically imported as a float and when I try to change it I get an error.
http://imgur.com/4204g
SSIS doesn't implicitly convert data types, so you need to do it explicitly. The Excel connection manager can only handle a few data types and it tries to make a best guess based on the first few rows of the file. This is fully documented in the SSIS documentation.
You have several options:
Change your destination data type to float
Load to a 'staging' table with data type float using the Import Wizard and then INSERT into the real destination table using CAST or CONVERT to convert the data
Create an SSIS package and use the Data Conversion transformation to convert the data
You might also want to note the comments in the Import Wizard documentation about data type mappings.
Going off of what Derloopkat said, which still can fail on conversion (no offense Derloopkat) because Excel is terrible at this:
Paste from excel into Notepad and save as normal (.txt file).
From within excel, open said .txt file.
Select next as it is obviously tab delimited.
Select "none" for text qualifier, then next again.
Select the first row, hold shift, select the last row, and select the text radial button. Click Finish
It will open, check it to make sure it's accurate and then save as an excel file.
There is a workaround.
Import excel sheet with numbers as float (default).
After importing, Goto Table-Design
Change DataType of the column from Float to Int or Bigint
Save Changes
Change DataType of the column from Bigint to any Text Type (Varchar, nvarchar, text, ntext etc)
Save Changes.
That's it.
When Excel finds mixed data types in same column it guesses what is the right format for the column (the majority of the values determines the type of the column) and dismisses all other values by inserting NULLs. But Excel does it far badly (e.g. if a column is considered text and Excel finds a number then decides that the number is a mistake and insert a NULL instead, or if some cells containing numbers are "text" formatted, one may get NULL values into an integer column of the database).
Solution:
Create a new excel sheet with the name of the columns in the first row
Format the columns as text
Paste the rows without format (use CVS format or copy/paste in Notepad to get only text)
Note that formatting the columns on an existing Excel sheet is not enough.
There seems to be a really easy solution when dealing with data type issues.
Basically, at the end of Excel connection string, add ;IMEX=1;"
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\\YOURSERVER\shared\Client Projects\FOLDER\Data\FILE.xls;Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1";
This will resolve data type issues such as columns where values are mixed with text and numbers.
To get to connection property, right click on Excel connection manager below control flow and hit properties. It'll be to the right under solution explorer. Hope that helps.
To avoid float type field in a simple way:
Open your excel sheet..
Insert blank row after header row and type (any text) in all cells.
Mouse Right-Click on the head of the columns that cause a float issue and select (Format Cells), then choose the category (Text) and press OK.
And then export the excel sheet to your SQL server.
This simple way worked with me.
A workaround to consider in a pinch:
save a copy of the excel file, modify the column to format type 'text'
copy the column values and paste to a text editor, save the file (call it tmp.txt).
modify the data in the text file to start and end with a character so that the SQL Server import mechanism will recognize as text. If you have a fancy editor, use included tools. I use awk in cygwin on my windows laptop. For example, I start end end the column value with a single quote, like "$ awk '{print "\x27"$1"\x27"}' ./tmp.txt > ./tmp2.txt"
copy and paste the data from tmp2.txt over top of the necessary column in the excel file, and save the excel file
run the sql server import for your modified excel file... be sure to double check the data type chosen by the importer is not numeric... if it is, repeat the above steps with a different set of characters
The data in the database will have the quotes once the import is done... you can update the data later on to remove the quotes, or use the "replace" function in your read query, such as "replace([dbo].[MyTable].[MyColumn], '''', '')"

Flink : How to implement TypeInformation without actual number of columns in csv

I am reading a csv file through flink. csv file have a specific number of columns.
I have defined
RowCsvInputFormat format = new RowCsvInputFormat(filePath,
new TypeInformation[]{
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
});
The code works fine if in the file all the rows have proper 4 columns.
I want to handle a scenario when few rows in the file do not have 4 columns OR there is any other issue in few rows.
How can i achieve this in flink.
If you look here at the specifications on wikipedia or the rfc4180 it seems like CSV files should only contain rows which have the same amount of columns. So it makes sense the RowCsvInputFormat would not support this.
You could read the files using readTextFile(path) and then in a flatMap() operator parse the strings into a Row object (or ignore if there are issues in a row)
env.readTextFile(params.get("input"))
.flatMap(someCsvRowParseFunction())

SAP Data Services .csv data file load from Excel with special characters

I am trying to load data from an Excel .csv file to a flat file format to use as a datasource in a Data Services job data flow which then transfers the data to an SQL-Server (2012) database table.
I consistently lose 1 in 6 records.
I have tried various parameter values in the file format definition and settled on setting Adaptable file scheme to "Yes", file type "delimited", column delimeter "comma", row delimeter {windows new line}, Text delimeter ", language eng(English) and all else as defaults.
I have also set "write errors to file" to "yes" but it just creates an empty error file (I expected the 6,000 odd unloaded rows to be in here).
If we strip out three of the columns containing special characters (visible in XL) it loads a treat so I think these characters are the problem.
The thing is, we need the data in those columns and unfortunately, this .csv file is as good a data source as we are likely to get and it is always likely to contain special characters in these three columns so we need to be able to read it in if possible.
Should I try to specifically strip the columns in the Query source component of the dataflow? Am I missing a data-cleansing trick in the query or file format definition?
OK so didn't get the answer I was looking for but did get it to work by setting the "Row within Text String" parameter to "Row delimiter".

SSIS - Converting DT_TEXT(Length 11,000 Characters) to DT_STR and trim to 1,000 characters

I want to read data from text file (.csv), truncate one of the column to 1000 characters and push into SQL table using SSIS Package.
The input (DT_TEXT) is of length 11,000 characters but my Challenge is ...
SSIS can convert to (DT_STR) only if Max length is 8,000 characters.
String operations cannot be performed on Stream (DT_TEXT data type)
Got a workaround/solution now;
I truncate the text in Flat File Source and selected the option to Ignore the Error;
Please share if you find a better solution!
FYI:
To help anyone else that finds this, I applied a similar concept more generally in a data flow when consuming a text stream [DT_TEXT] in a Derived Column Transformation task to transform it to [DT_WSTR] type to my defined length. This more easily calls out the conversion taking place.
Expression: (DT_WSTR,1000)(DT_STR,1000,1252)myLargeTextColumn
Data Type: Unicode string [DT_WSTR]
Length: 1000
*I used 1252 codepage since my DT_TEXT is UTF-8 encoded.
For this Derived Column, I also set the TruncationRowDisposition to RD_IgnmoreFailure in the Advanced Editor (or can be done in the Configure Error Output, setting Truncation to "Ignore failure")
(I'd post images but apparently I need to boost my rep)

Why does a number imported to SQL Server from Excel contain the letter e?

I have got a excel sheet which inserts data in to SQL Server, but noticed for a particular field, the data is being inserted with e, this particular field is of type varchar and size 20.
Why is e being inserted when the actual data for these respective fields is 54607677038, 77200818179 and 9920996.
Help me out
Thanks in anticipation.
You may think of '2007038971' as being just a string of numbers (some kind of article code, I guess). Excel just sees numbers and treats it as a numerical value. It probably is right aligned (default for numbers) and not left-aligned (default for strings).
When asked to store in as a string, it 'helpfully' formats that number into a string, thereby introducing that "e" notation (the value 2007038971 is about 2.00704 * 10^9).
You need to convince Excel that that code really is a string, maybe by adding a quote in front of it.
How about this. When you read value from excel, then convert ToString() and insert into DB. Need to change relevant data type based on data in your excel.
double doub = 2.00704e+009;
string val = doub.ToString();

Resources