SageMaker ClientError: Detected non integer labels in the dataset - amazon-sagemaker

I have created a SageMaker training job to train on a toy, tabular, multiclass(3) dataset which has failed with the following error:
ClientError: Detected non integer labels in the dataset. For classification tasks, the labels should be integers between 0 to (num_classes-1), exit code: 2
It sounds like they're saying that for the classes (labels) they want to see values between 0 and 2 in this case, as I have 3 classes.
I have set num_classes to 3 and have validated that I only have 3 unique values in the rightmost column of my dataset: 0, 1, and 2
I've set feature_dim to 3. I've removed the headers from my dataset. My raw data looks like 5,000 lines of this:
csv snapshot
Can anyone guess as to what might be causing this error?

I wanted to answer this because at the time of this writing, the error message I recieved returns 0 hits on Google.
It turns out that the issue was that SageMaker expects the class labels to appear in the first column, by default. This is different from how datasets are typically structured. So when I got this error message, SageMaker was looking at my first column which had all sorts of float values. I fixed it by moving my labels to the first column.

Related

Long calculation times with XLOOKUP vs INDEX-MIN-COLUMN

I'm using this formula =IF(B24="","",IFERROR(INDEX(Sheet3!$C$3:$EE$3,,MIN(IF(Sheet3!$C$4:$EE$23=(Sheet2!C24&$K$18),COLUMN(Sheet3!$C:$EE)))-2),"NF")) to return a cell value in the top row of an array - a date in this case.
The search criteria is a combination of a unique project number and a 2 digit status alphanumerical code for the project. The array consists of 23 rows where combinations of the unique numbers are found, each with different status codes.
So essentially, I'm building a FILTERED project status dashboard that returns dates linked to the relevant project status.
The code above is inspired from ( LINK ) that uses a very similar layout, but it uses town suburbs linked to postal codes instead of project numbers and status codes. The formula works well (though, not entered as an array formula), but I don't have a single formula in the sheet, I have 3 300 occurrences of this formula.
The problem comes in when the user changes the FILTER - Excel recalculates the entire dashboard and that takes anywhere from 2 to 5 minutes to run. You hit the escape button and cancel the calculation after setting the filter, but Excel just starts calculating again after a few seconds. After that, Excel's response is sluggish and almost unusable. Yes - our hardware is pretty weak ...
I tried XLOOKUP as well, but can't set the "lookup_array" to an array ( Sheet3!$C$4:$EE$23 ) because it doesn't match the "return-array" ( Sheet3!$C$3:$EE$3 ) Concatenating the lookup arrays with & works, but then you'd have to do that for all 23 rows, and again, multiply that by 3 300.
I thought of creating a UDF, but the function will still be called every time Excel recalculates after filtering... 3 300 calls ...
Any ideas on how to make the INDEX version run faster, or make the XLOOKUP accept the lookup_array as Sheet3!$C$4:$EE$23 in the hopes that it'll run faster?
Thank you!
Not really an elegant solution, but it works.
I imported the dataset into a helper sheet, where I combined the cell value with the corresponding value in Column A for each row ( a name in this case ) and the date from row 1 for each column, using underscore as a delimiter.
This new data range was then given a unique name, EE in this case.
On a second helper sheet, using this formula =INDEX(Filtered,1+INT((ROW('Sheet1'!C3)-1)/COLUMNS(Filtered)),MOD(ROW('Sheet1'!C3)-1+COLUMNS(Filtered),COLUMNS(Filtered))+1) and drag it down till it returns an REF! error and going back one row before the error.
This transposes all the data into a single column G. Using =UNIQUE(SORT(FILTER(B3:B3240,B3:B3240<> "",""))) then gives me a filtered list of unique values in column H that I then run
=IF(H3="","",LEFT(H3, SEARCH("_",H3,1)-1)) for the first data value in I, and
=IF(H3="","",MID(H3, SEARCH("_",H3) + 1, SEARCH("_",H3,SEARCH("_",H3)+1) - SEARCH("_",H3) - 1)) for the middle data value in J, and
=IF(H3="","",IFERROR(TEXT(RIGHT(H3,5),"yyyy-mm-dd"),"NF")) for the last data value in K.
Then just run XLOOPUP across columns I, J and K.
Runs quick and easy and solves a few of the other issue I had as well.
The second data set has just over 35 000 rows - still works well and fast.

Google Data Studio : how to obtain a SUM related to a COUNT_DISTINCT?

I have a dataset including 3 columns :
ID transac (The unique ID of the transaction - Dimension)
Source (The source of the transaction - Dimension)
Amount € (The amount of the transaction - Stat)
screenshot of my dataset
To Count the number of transactions (for one or more sources), i use COUNT_DISTINCT function
I want to make the sum of the transactions amounts (for one or more sources). But i don't want to additionate the amounts of the transactions with the same ID !
Is there a way to do this calcul with a DataStudio function ?
Thanks for your answers. :-)
EDIT : I saw that we could do this type of calculation via SQL here and I would like to do this in DataStudio (so that I don't have to pre-calculate the amounts per source.)
IMO, your dataset contains wrong data. Each value should be relative only to that line, but this is not the case: if the total is =20, each line should describe the participation of that line to the total. With 4 sources, each line should be =5 or something else that sums 20.
To solve it in DataStudio, you need something like CALCULATE function in PowerBI, but currently DataStudio doesn't support this feature.
But there are some options to consider to repair your data:
If you're sure there are always 4 sources, just create a new calculated field with the expression Amount/4 and SUM it. It is not an elegant solution, but it works.
If your data source is Google Sheets, you can easily repair the data using formulas, like in this example:
Link to spreadsheet
For this spreadsheet, I used this formula in adjusted_amount column: =C2/COUNTIF(A:A,A2). With this column in DataStudio, just use the usual SUM aggregation function to summarize it correctly.

Count Unique Values with Conditions Excel

I'm trying to create a simple chart to show how many employees are on a given corrective action level within a specified date range. The issue I'm running into is this:
The log shows associate Test 1 received a verbal warning on 8/14/19 for their productivity, then a first written warning on 8/24/19, then a final written warning that was processed later but took place on 8/23/19.
The formula I wrote will show this as 1 person at each level of correction (verbal, first written, and final written). I want it to only count the highest-level warning for each person. So the chart would only count 1 entry at the final written warning level.
What am I missing to accomplish this?
Raw Data:
Summary Chart:
Summary Chart Formula (across from Verbal level):
={SUM(--(FREQUENCY(IF(('2019'!$C:$C<>"")*('2019'!$F:$F=$B$2)*('2019'!$D:$D>=$B$3)*('2019'!$D:$D<=$C$3)*('2019'!$E:$E=$B5)*('2019'!$E:$E<>$B6),MATCH('2019'!$C:$C,'2019'!$C:$C,0)),ROW('2019'!$C:$C)-ROW('2019'!$C$2)+1)>0))}''''
Cracked it! I added two helper columns to the raw data in between Step and Reason.
The first, Level, is a VLOOKUP that converts the Step to a numerical value (in order of severity, the lowest being a Verbal, highest being an Exit).
The second, Max, is a MAXIFS formula to flag which step is the highest severity by associate ID and Reason:
=IF(MAXIFS(F:F,C:C,C2,H:H,H2,D:D,">="&Summary!$B$3,D:D,"<="&Summary!$C$3)=F2,"X","")
The formula in the summary chart now reads as follows:
=SUM(--(FREQUENCY(IF(('2019'!$C:$C<>"")*('2019'!$I:$I=$B$2)*('2019'!$D:$D>=$B$3)*('2019'!$D:$D<=$C$3)*('2019'!$F:$F=$B5)*('2019'!$H:$H="X"),MATCH('2019'!$C:$C,'2019'!$C:$C,0)),ROW('2019'!$C:$C)-ROW('2019'!$C$1)+1)>0))

Google Data Studio piechart from column with multiple values per cell

I have an excel spreadsheet from a Questionnaire. One of the questions was in checkbox format. The result of this question are held in a single column, and where the user has selected more that one answer, the answers are separated by comma's.
What devices do you own? Mobile, PC, Laptop, Tablet
So in a single cell I get 'Mobile,PC' when these two are selected.
I am using Google Data Studio to visualise the data, but stuck on how to create a graph that shows all the values individually.
At the minute I get a combination for every value. So a value of 1 for 'Mobile,PC' rather than a value for 1 'PC' and '1'Mobile.
Google Data Studio doesn't allow countif statements, so a bit lost.
I have tried to TRIM, COUNIF and REGEX but none have worked.
Count(REGEXP_MATCH(Device, "PC"))
I'm a bit lost on this, tried so many combinations. If someone can put me on the right track I would be very grateful
I don't think you'll be able to achieve that with a pie chart without changing your data source first as it needs one dimension (Device) and then one count metric which your data doesn't seem to support.
You could create 4 metrics like
SUM(
CASE
WHEN REGEXP_MATCH(Device, "PC") then 1
ELSE 0
END
)
And put them into a Stacked bar / column chart. You might need to create a dimension that has a single value to avoid having multiple bars/columns.

How do I calculate the percentage of a count function?

I am trying to take the percentage of a count function so to create a MS BIDS report resembling this excel file:
Excel Close Rate Summary
The unique identifier for the opportunities is the field "opportunityid", so I am using COUNT(Fields!opportunityid.Value) to determine the number of cases in each stage. I want to write an expression that will return the percentage of cases in each stage per creation month. Which can be seen in the above excel screenshot.
This is my current MS BIDS report when i preview it
To be more specific, I want to have the percentage of "Active" and "New" opportunities in January to represent 67% and 33% respectively. 67% comes from 4/6. The 4 comes from the active opportunities out of the 6 opportunities created in January. Likewise, the 33% comes from the 2 new opportunities out of the 6 that were created in January.
There are more stage names than Active and New. Other options include New, Warm, Hot, Implementation, Active, Hibernate or Canceled. This is relevant to mention because I have tried to create an expression that counts based on the number of opportunities with a specific stage name, but have been unsuccessful.
Currently the expression I am using to calculate the percentage is:
=COUNT(Fields!new_rptstage.Value)/SUM(COUNT(Fields!opportunityid.Value),"GroupbyStageName")
Based on this expression, I am only able to get 1/1 or 100% for each of the stage names. I have tried a bunch of variations of the above expression by changing the scope, but have been unsuccessful in getting the desired results. Can someone explain how to correct this?
SAMPLE DATA:
In the sample data, I want the expression to be in the percentage column. The percentage should be the # of cases in a particular stage for the total cases that month. So looking at the above picture:
Active February 54 54/168 [have 54/168 display as a percentage]
Warm February 8 8/168
etc.
EDIT:
These are the expressions that may help show the underlying data in the chart.
The creation month expression is
=Fields!MonthCreated.Value & " " & year(Fields!createdon.Value)
The percent expression is listed above.
You don't want to use the COUNT() function. COUNT(*) returns a count of the number of rows that have a value. It doesn't return the actual value.
Since you've only showed a screen shot of your report, I don't know how your underlying data columns relate to it, but what you want to do for your Percent column expression is this:
This is psuedo code because I don't know your dataset field names:
CaseCount.Value / SUM(CaseCount.Value)
EDIT: Now that I better understand how your data relates to your report, I think the only change you need to make to your existing formula is casting it to a decimal type. It's probably rounding all fractions up to 1.
Try this for the expression in your percentage column:
=CDbl(COUNT(Fields!new_rptstage.Value))/CDbl(SUM(COUNT(Fields!opportunityid.Value),"GroupbyStageName"))

Resources