AWS SageMaker Training Job not saving model output

AWS SageMaker Training Job not saving model output - amazon-sagemaker

I'm running a training job on SageMaker. The job doesn't fully complete and hits the MaxRuntimeInSeconds stopping condition. When the job is stopping, documentation says the artifact will still be saved. I've attached the status progression of my training job below. It looks like the training job finished correctly. However the output S3 folder is empty. Any ideas on what is going wrong here? The training data is located in the same bucket so it should have everything it needs.

From the status progression, it seems that the training image download completed at 15:33 UTC and by that time the stopping condition was initiated based on the MaxRuntimeInSeconds parameter that you have specified. From then, it takes 2 mins (15:33 to 15:35) to save any available model artifact but in your case, the training process did not happen at all. The only thing that was done was downloading the pre-built image(containing the ML algorithm). Please refer the following lines from the documentation which says model being saved is subject to the state the training process is in. May be you can try to increase the MaxRuntimeInSeconds and run the job again. Also, please check MaxWaitTimeInSeconds value that you have set if you have.It must be equal to or greater than MaxRuntimeInSeconds.
Please find the excerpts from AWS documentation :
"The training algorithms provided by Amazon SageMaker automatically
save the intermediate results of a model training job when possible.
This attempt to save artifacts is only a best effort case as model
might not be in a state from which it can be saved. For example, if
training has just started, the model might not be ready to save."

If MaxRuntimeInSeconds is exceeded then model upload is only best-effort and really depends on whether the algorithm saved any state to /opt/ml/model at all prior to being terminated.
The two minute wait period between 15:33 to 15:35 in the Stopping stage signifies the max time between a SIGTERM and a SIGKILL signal sent to your algorithm (see SageMaker doc for more detail). If your algorithm traps the SIGTERM it is supposed to use that as a signal to gracefully save its work and shutdown before the SageMaker platform kills it forcibly with a SIGKILL signal 2 minutes later.
Given that the wait period in the Stopping step is exactly 2 minutes as well as the fact Uploading step started at 15:35 and completed almost immediately at 15:35 it's likely that your algo did not take advantage of the SIGTERM warning and that there was nothing saved to /opt/ml/model. To give you a definitive answer as to whether this was indeed the case please create a SageMaker forum post and the SageMaker team can private-message you to gather details of your job.

Related

How to restore state after a restart from chosen source (not necessarily last checkpoint)

I've been trying to restart my Apache Flink from a previous checkpoint without much luck. I've uploaded the code to GitHub, here's the main class:
https://github.com/edu05/wordcount/blob/restart/src/main/java/edu/streaming/AppWithKafka.java
It's a simple word count program, only I'd like the program to continue with the counts it had already calculated after a restart.
I've read the docs and tried a few things but there must be something stupid missing, could someone please help?
Also: The end goal is to produce the output of the wordcount program into a compacted kafka topic, how would I go about loading the state of the app by first consuming the compacted topic, which in this case serves as both the output and the checkpointing mechanism of the program?
Many thanks

Flink's checkpoints are for automatic restarts after failures. If you want to do a manual restart, then either use a savepoint, or an externalized checkpoint.
If you've already tried this and are still having trouble, please provide more details about what you tried.

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?

This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

Backends logs depth

I have a long-running process in a backend and I have seen that the log only stores the last 1000 logging calls per request.
While this might not be an issue for a frontend handler, I find it very inconvenient for a backend, where a process might be running indefinitely.
I have tried flushing logs to see if it creates a new logging entry, but it didn't. This seems so wrong, that I'm sure there must be a simple solution for this. Please, help!
Thanks stackoverflowians!
Update: Someone already asked about this in the appengine google group, but there was no answer....
Edit: The 'depth' I am concerned with is not the total number of RequestLogs, which is fine, but the number of AppLogs in a RequestLog (which is limited to 1000).
Edit 2: I did the following test to try David Pope's suggestions:
def test_backends(self):
launched = self.request.get('launched')
if launched:
#Do the job, we are running in the backend
logging.info('There we go!')
from google.appengine.api.logservice import logservice
for i in range(1500):
if i == 500:
logservice.flush()
logging.info('flushhhhh')
logging.info('Call number %s'%i)
else:
#Launch the task in the backend
from google.appengine.api import taskqueue
tq_params = {'url': self.uri_for('backend.test_backends'),
'params': {'launched': True},
}
if not DEBUG:
tq_params['target'] = 'crawler'
taskqueue.add(**tq_params)
Basically, this creates a backend task that logs 1500 lines, flushing at number 500. I would expect to see two RequestLogs, the first one with 500 lines in it and the second one with 1000 lines.
The results are the following:
I didn't get the result that the documentation suggests, manually flushing the logs doesn't create a new log entry, I still have one single RequestLog with 1000 lines in it. I already saw this part of the docs some time ago, but I got this same result, so I thought I wasn't understanding what the docs were saying. Anyways, at the time, I left a logservice.flush() call in my backend code, and the problem wasn't solved.
I downloaded the logs with appcfg.py, and guess what?... all the AppLogs are there! I usually browse the logs in the web UI, I'm not sure if I could get a confortable workflow to view the logs this way... The ideal solution for me would be the one that is described in the docs.
My apps autoflush settings are set to the default, I played with them when at some time, but I saw that the problem persisted, so I left them unset.
I'm using python ;)

The Google docs suggest that flushing should do exactly what you want. If your flushing is working correctly, you will see "partial" request logs tagged with "flush" and the start time of the originating request.
A couple of things to check:
Can you post your code that flushes the logs? It might not be working.
Are you using the GAE web console to view the logs? It's possible that the limit is just a web UI limit, and that if you actually fetch the logs via the API then all the data will be there. (This should only be an issue if flushing isn't working correctly.)
Check your application's autoflush settings.
I assume there are corresponding links for Java, if that's what you're using; you didn't say.

All I can think that might help is to use a timed/cron script like the following to run every hour or so from you workstation/server
appcfg.py --oauth2 request_logs appname/ output.log --append
This should give you a complete log - I haven't tested it myself
I did some more reading and it seems CRON is already part of appcfg
https://developers.google.com/appengine/docs/python/tools/uploadinganapp#oauth
appcfg.py [options] cron_info <app-directory>
Displays information about the scheduled task (cron) configuration, including the
expected times of the next few executions. By default, displays the times of the
next 5 runs. You can modify the number of future run times displayed
with the -- num_runs=... option.
Based on your comment, I would try.
1) Write you own logger class
2) Use more than one version

receive an alert when job is not running

I know how to set up a job to alert when it's running.
But I'm writing a job which is meant to run many times a day, and I don't want to be bombarded by emails, but rather I'd like a solution where I get an alert when the job hasn't been executed for X minutes.
This can be acheived by setting the job to alert on execution, and then setting up some process which checks for these alerts, and warns when no such alert is seen for X minutes.
I'm wondering if anyone's already implemented such a thing (or equivalent).
Supporting multiple jobs with different X values would be great.

The danger of this approach is this: suppose you set this up. One day you receive no emails. What does this mean?
It could mean
the supposed-to-be-running job is running successfully (and silently), and so the absence-of-running monitor job has nothing to say
or alternatively
the supposed-to-be-running job is NOT running successfully, but the absence-of-running monitor job has ALSO failed
or even
your server has caught fire, and can't send any emails even if it wants to
Don't seek to avoid receiving success messages - instead devise a strategy for coping with them. Because the only way to know that a job is running successfully is getting a message which says precisely this.

How can I do the same thing over and over every 1-4 seconds in google app engine?

I want to run a script every few seconds (4 or less) in google app engine to process user input and generate output. What is the best way to do this?

Run a cron job.
http://code.google.com/appengine/docs/python/config/cron.html
http://code.google.com/appengine/docs/java/config/cron.html
A cron job will invoke a URL at a
given time of day. A URL invoked by
cron is subject to the same limits and
quotas as a normal HTTP request,
including the request time limit.
.
Also consider the Task Queue - http://code.google.com/appengine/docs/python/taskqueue/overview.html

Reconsider what you're doing. As Ash Kim says, you can do it with the task queue, but first take a close look if you really need to run a process like this. Is it possible to rewrite things so the task runs only when needed, or immediately, or lazily (that is, only when the results are needed)?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

AWS SageMaker Training Job not saving model output - amazon-sagemaker

Related

How to restore state after a restart from chosen source (not necessarily last checkpoint)

NDB query().iter() of 1000<n<1500 entities is wigging out

Backends logs depth

receive an alert when job is not running

How can I do the same thing over and over every 1-4 seconds in google app engine?

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

AWS SageMaker Training Job not saving model output - amazon-sagemaker

Related

How to restore state after a restart from chosen source (not necessarily last checkpoint)

NDB query().iter() of 1000<n<1500 entities is wigging out

Backends logs depth

receive an alert when job is *not* running

How can I do the same thing over and over every 1-4 seconds in google app engine?

Categories

Resources

receive an alert when job is not running