Run a SageMaker pipeline from SageMaker studio, each pipeline step (e.g., ProcessingStep, TrainingStep, LambdaStep) has an output tab,
I tried to search for the SDK doc, but couldn't find anything related. How may I display metrics from custom processing, training or lambda containers to the tab?
For built-in algorithms
In the case of built-in algorithms, I refer to the official "Define Metrics" guide (chapter "Using a built-in algorithm for training").
For Custom algorithms
The problem is solved basically in 2 steps:
Within your script (e.g., the training script) you will need to log
the metric to intercept.
Trivially a print/log:
print(f"New best val_loss score: {your_metric}")
Within the definition of your pipeline component you should set the metric_definitions parameter.
For example in Estimators.
metric_definitions (list[dict[str, str] or list[dict[str,
PipelineVariable]]) – A list of dictionaries that defines the
metric(s) used to evaluate the training jobs. Each dictionary contains
two keys: ‘Name’ for the name of the metric, and ‘Regex’ for the
regular expression used to extract the metric from the logs. This
should be defined only for jobs that don’t use an Amazon algorithm.
To use it for the above example, it will then suffice to define:
metric_definitions=[
{'Name': 'val_loss', 'Regex': 'New best val_loss score: ([0-9\.]+)'}
]
P.S.: Remember that estimators also mean derived classes such as SKLearn, PyTorch, etc...
At this point, at the reference step where you have defined the metrics to be intercepted, you will find a key-value pair, of the last value intercepted, in the SageMaker Studio screen and also a graph to monitor progress (even during training) in cloudwatch metrics.
Related
I am looking at building an application using gstreamer but first have some questions regarding its capabilities with respect to a desired use case.
Say I wanted to build a pipeline that processes video data in a similar way as depicted below.
Videosrc -> Facedetect -> Crop -> Videosink
What is the canonical method for taking metadata produced on each frame by a given video filter (i.e. the bounding box from a facial detection filter) and passing it to succeeding filter to operate on (i.e. the Crop filter cropping each image on the bounding box provided by Facedetect).
I know there are properties and dynamic properties, but as far as I can tell from docs, those both require an idea of what you want to happen when you construct the pipeline.
I also know that you can attach metadata to the GstBuffer object which could potentially be used, but there would need to be an agreed upon interface in that case which doesn't seem very portable and may lack support across many elements with the same capabilities.
Does anyone know what's the mechanism behind hyperparameter tuning job in AWS Sagemaker?
In specific, I am trying to do the following:
Bring my own container
Minimize cross entropy loss (this will be the objective metric of the tuner)
My question is when we define the hyper parameter in HyperParameterTuner class, does that get copied into /opt/ml/input/config/hyperparameters.json?
If so, should one adjust the training image so that it uses the hyper parameters from /opt/ml/input/config/hyperparameters.json?
Edit: I've looked into some sample HPO notebooks that AWS provides and they seem to confuse me more. Sometimes they'd use argparser to pass in the HPs. How is that passed into the training code?
So i finally figured it out and had it wrong all the time.
The file /opt/ml/input/config/hyperparameters.json is there. It just has slightly different content compared to a regular training-job. The params to be tuned as well as static params are contained there. As well as the metric-name.
So here is the structure, i hope it helps:
{
'_tuning_objective_metric': 'your-metric',
'dynamic-param1': '0.3',
'dynamic-param2': '1',
'static-param1': 'some-value',
'static-paramN': 'another-value'
}
If you bring your own container, you should consider pip installing SageMaker Training Toolkit. This will allow you to receive hyperparameters as command line arguments to your training script (to be processed with argparser). This will save you the need to Read and parse the /opt/ml/input/config/hyperparameters.json yourself.
Both Kaplan Meier method and Logistic Regression have their own feature selections. I want to use another method to pick best features for example, back stepwise feature selection. Is it possible to use this sort of methods instead or not.
My data acquires more than 130 features and about 3000 individuals. Since it is medical [cancer] data I don't want to use simple methods.
Further information about the project can be seen here and it is in order of what should I do:
preprocessing the data
separating them for test and train
Data imputation for train data
Feature selection by train data
Training the models which are Kaplan Meier and Logistic Regression
Testing the model
Pleas inform me that is it wrong to use any other feature selection for them or not?
I can use any tip about the model which I have listed too.
Basically there are 4 types of feature selection (fs) techniques namely:-
1.) Filter based fs
2.) Wrapper based fs
3.) Embedded fs techniques
4.) Hybrid fs techniques
Each has it's own advantages and disadvantages. For ex, filter fs is used when you want to determine if "one" feature is important to the output variable. So if you have 400 features in your dataset, you would have to repeat this 400 times!
Wrapper based methods (as you mentioned in you question), on the other hand do this is one step. But they are prone to overfitting, whereas filter based methods are not.
Embedded methods use tree based methods for fs purpose.
I do not have enough knowledge about hybrid methods.
I would say you could use some wrapper based techniques like RFECV since you say you do not want to use simple filter techniques.
Current documentation doesn't fully describe the rules for how a user can build the phrases to trigger any operation and possible answers. Could you please provide the following:
for "action.devices.traits.OnOff" trait:
the full set of phrases that user can use to trigger turning on/off OR rules to build them;
possible response phrases from Google Assistant if turning on/off was started successfully OR rules to build them.
for "action.devices.traits.Cook" trait (for two ways of parameters combination: cookingMode + foodPreset OR cookingMode + foodPreset + quantity + unit (ounces)):
the full set of phrases that user can use to trigger cook operation OR rules to build them;
possible response phrases from Google Assistant if cook operation was started successfully OR rules to build them;
the full set of phrases that user can use to cancel cook operation OR rules to build them;
possible response phrases from Google Assistant if cancellation of cook operation was started successfully OR rules to build them.
what additional words could the user add when framing this phrase for these two traits? For example, “me”, “please”, “my new {foodPreset}”, “a cup of {foodPreset}” (“cup” is not a “unit”) and any other words and phrases. What are the rules for this?
are there any recommendations for “foodPreset” parameter (words amount, words complexity)?
There are no strict rules. Traits can be triggered through natural language processing, so you may expect any relevant phrase should work. The documentation provides examples for OnOff and Cook but aren't limited by the provided phrases.
Responses also are based around good voice design and natural language, so there aren't any strict rules to what you'd expect. Additionally, such requests and responses may change as the platform continues to evolve. The NLP system is able to extract meaning from larger statements, so general things like "turn on the light", "turn on the light please", and "please turn on the light for me" should all match.
With regards to foodPreset, the key can be whatever you want for your service. The synonyms should be fairly varied and include any possible way that an individual would refer to that food item.
We are trying to build a histogram of session length in a given time period. Currently, we have sess:start and sess:end events which contains the session id and user id. I am wondering what's the best way to compute this data? Can this be achieve using the funnel api?
Have you checkout out the recipes section in Keen IO's docs? Here is an excerpt from the section on histogram recipes for Session Length that might be really helpful.
Excerpt
To create a histogram for session lengths, like the one shown above,
you can run a count analysis on an event collection for completed
sessions (e.g. session_end). Along the x-axis you’ll have segments of
time lapsed in a session, and along the y-axis you’ll have the
percentage of sessions that fit into a given session length cohort.
Note: this recipe incorporates the D3 histogram recipe, which is
explained further in the documentation.
histogram('chart-1', {
segment_length: 60, // In seconds
data_points: 10, // i.e. There will be 10 bars on our chart
analysis_type: 'count',
query_parameters: {
event_collection: 'session_end',
timeframe: timeframe,
filters: []
}
});
More information
Keen IO - Analytics for Developers
Keen IO - Documentation
Code excerpt: Keen IO - Recipes for Histograms
Lots of good stuff behind the link that Stephanie posted.
One extra thing I'll venture is that putting an integer sess:length property in the sess:end event would make things easier. You'd have to keep the start time for each session somewhere in your database so that you can compute the difference for the sess:end event. But then you'd have the difference as a plain old number of seconds and can do any type of numerical analysis on it.