At unpredictable times (user request) I need to run a memory-intensive job. For this I get a spot or on-demand instance and mark it with a tag as non_idle. When the job is done (which may take hours), I give it the tag idle. Due to the hourly billing model of AWS, I want to keep that instance alive until another billable hour is incurred in case another job comes in. If a job comes in, the instance should be reused and marked it as non_idle. If no job comes in during that time, the instance should terminate.
Does AWS offer a ready solution for this? As far as I know, CloudWatch can't set alarms that should run at a specific time, never mind using the CPUUtilization or the instance's tags. Otherwise, perhaps I could simply set up for every created instance a java timer or scala actor that runs every hour after the instance is created and check for the tag idle.
There is no readily available AWS solution for this fine grained optimization, but you can use the existing building blocks to build you own based on the launch time of the current instance indeed (see Dmitriy Samovskiy's smart solution for deducing How Long Ago Was This EC2 Instance Started?).
Playing 'Chicken'
Shlomo Swidler has explored this optimization in his article Play “Chicken” with Spot Instances, albeit with a slightly different motivation in the context of Amazon EC2 Spot Instances:
AWS Spot Instances have an interesting economic characteristic that
make it possible to game the system a little. Like all EC2 instances,
when you initiate termination of a Spot Instance then you incur a
charge for the entire hour, even if you’ve used less than a full hour.
But, when AWS terminates the instance due to the spot price exceeding
the bid price, you do not pay for the current hour.
The mechanics are the same of course, so you might be able to simply reuse the script he assembled, i.e. execute this script instead of or in addition to tagging the instance as idle:
#! /bin/bash
t=/tmp/ec2.running.seconds.$$
if wget -q -O $t http://169.254.169.254/latest/meta-data/local-ipv4 ; then
# add 60 seconds artificially as a safety margin
let runningSecs=$(( `date +%s` - `date -r $t +%s` ))+60
rm -f $t
let runningSecsThisHour=$runningSecs%3600
let runningMinsThisHour=$runningSecsThisHour/60
let leftMins=60-$runningMinsThisHour
# start shutdown one minute earlier than actually required
let shutdownDelayMins=$leftMins-1
if [[ $shutdownDelayMins > 1 && $shutdownDelayMins < 60 ]]; then
echo "Shutting down in $shutdownDelayMins mins."
# TODO: Notify off-instance listener that the game of chicken has begun
sudo shutdown -h +$shutdownDelayMins
else
echo "Shutting down now."
sudo shutdown -h now
fi
exit 0
fi
echo "Failed to determine remaining minutes in this billable hour. Terminating now."
sudo shutdown -h now
exit 1
Once a job comes in you could then cancel the scheduled termination instead of or in addition to tagging the instance with non_idle as follows:
sudo shutdown -c
This is also the the 'red button' emergency command during testing/operation, see e.g. Shlomo's warning:
Make sure you really understand what this script does before you use
it. If you mistakenly schedule an instance to be shut down you can
cancel it with this command, run on the instance: sudo shutdown -c
Adding CloudWatch to the game
You could take Shlomo's self contained approach even further by integrating with Amazon CloudWatch, which recently added an option to Use Amazon CloudWatch to Detect and Shut Down Unused Amazon EC2 Instances, see the introductory blog post Amazon CloudWatch - Alarm Actions for details:
Today we are giving you the ability to stop or terminate your EC2
instances when a CloudWatch alarm is triggered. You can use this as a
failsafe (detect an abnormal condition and then act) or as part of
your application's processing logic (await an expected condition and
then act). [emphasis mine]
Your use case is listed in section Application Integration specifically:
You can also create CloudWatch alarms based on Custom Metrics that you
observe on an instance-by-instance basis. You could, for example,
measure calls to your own web service APIs, page requests, or message
postings per minute, and respond as desired.
So you could leverage this new functionality by Publishing Custom Metrics to CloudWatch to indicate whether an instance should terminate (is idle) based on and Dmitriy's launch time detection and reset the metric again once a job comes in and an instance should keep running (is non_idle) - like so EC2 would take care of the termination, 2 out of 3 automation steps would have been moved from the instance into the operations environment and management and visibility of the automation process improved accordingly.
Related
I'm running a training job on SageMaker. The job doesn't fully complete and hits the MaxRuntimeInSeconds stopping condition. When the job is stopping, documentation says the artifact will still be saved. I've attached the status progression of my training job below. It looks like the training job finished correctly. However the output S3 folder is empty. Any ideas on what is going wrong here? The training data is located in the same bucket so it should have everything it needs.
From the status progression, it seems that the training image download completed at 15:33 UTC and by that time the stopping condition was initiated based on the MaxRuntimeInSeconds parameter that you have specified. From then, it takes 2 mins (15:33 to 15:35) to save any available model artifact but in your case, the training process did not happen at all. The only thing that was done was downloading the pre-built image(containing the ML algorithm). Please refer the following lines from the documentation which says model being saved is subject to the state the training process is in. May be you can try to increase the MaxRuntimeInSeconds and run the job again. Also, please check MaxWaitTimeInSeconds value that you have set if you have.It must be equal to or greater than MaxRuntimeInSeconds.
Please find the excerpts from AWS documentation :
"The training algorithms provided by Amazon SageMaker automatically
save the intermediate results of a model training job when possible.
This attempt to save artifacts is only a best effort case as model
might not be in a state from which it can be saved. For example, if
training has just started, the model might not be ready to save."
If MaxRuntimeInSeconds is exceeded then model upload is only best-effort and really depends on whether the algorithm saved any state to /opt/ml/model at all prior to being terminated.
The two minute wait period between 15:33 to 15:35 in the Stopping stage signifies the max time between a SIGTERM and a SIGKILL signal sent to your algorithm (see SageMaker doc for more detail). If your algorithm traps the SIGTERM it is supposed to use that as a signal to gracefully save its work and shutdown before the SageMaker platform kills it forcibly with a SIGKILL signal 2 minutes later.
Given that the wait period in the Stopping step is exactly 2 minutes as well as the fact Uploading step started at 15:35 and completed almost immediately at 15:35 it's likely that your algo did not take advantage of the SIGTERM warning and that there was nothing saved to /opt/ml/model. To give you a definitive answer as to whether this was indeed the case please create a SageMaker forum post and the SageMaker team can private-message you to gather details of your job.
I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?
This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.
Currently I am monitoring my target windows hosts for a bunch of services (CPU, memory, disks, ssl certs, http etc). I'm using nsclient as the client that the nagios server will talk to.
My problem is that I deploy to those hosts three times every 24 hours. The deployment process requires the hosts to reboot. Whenever my hosts reboot I get nagios alerts for each service. This means a large volume of alerts, which makes it difficult to identify real issues.
Ideally I'd like to this:
If the host is down, don't send any alerts for the rest of the services
If the host is rebooting, this means that nsclient is not accessible. I want to only receive one alert (e.g CPU is not accessible) and mute everything else for a few minutes, so the host can finish booting and nsclient becomes available.
Implementing this would have me getting one email per host for each deployment. This is much better than everything turning red and me getting flooded with alerts that aren't worth checking (since they're only getting sent because the nagios client -nsclient- is not available during the reboot).
Got to love using a windows stack...
There are several ways to handle this.
If your deploys happens at the same time everyday:
1. you could modify your active time period to exclude those times (or)
2. schedule down time for your host via the Nagios GUI
If your deployments happen at different/random times, things become a bit harder to work-around:
1. when nrpe or nsclient is not reachable, Nagios will often throw an 'UNKNOWN' alert for the check. If you remove the 'u' option for the following entries:
host_notification_options [d,u,r,f,s,n]
service_notification_options [w,u,c,r,f,s,n]
That would prevent the 'UNKNOWN's from sending notifications. (or)
2. dynamically modify active checking of the impacted checks, by 'turning them off' before you start the deployment, and then 'turning them on' after the deployment. This can be automated using the Nagios 'external commands file'.
Jim Black's answer would work, or if you want to go even more in depth you can define dependencies with service notification escalation as described in the documentation below.
Escalating the alerts would mean that you could define: CPU/ssl etc check fail -> check host down -> Notifiy/don't notify.
Nagios Service Escalation (3.0)
I had a user contact me saying that her computer clock is 8 or 9 minutes faster than her cell phone clock. That concerned me because cell phone clocks are always synced. I looked at my computer's clock, and it was the same, about 8 minutes ahead of my phone. Eight minutes is a lot of time to be off. So I looked at my two DC's. The one that serves as the AD PDC Emulator is only 1 minute faster than my phone; that seems more reasonable. But workstations aren't syncing with it. So I looked at my other DC, which has none of the master roles. It is exactly the same as the workstations, about 8 minutes fast.
So there are a couple of big problems here. First, my DC's don't have the same time. Second, my workstations have the same time as the faster DC (are they syncing to it?). I looked in the error logs of both DC's and filtered for the Time-Service. The PDC Emulator DC has Warning Event ID 144: The time service has stopped advertising as a good time source. The other DC has Warning Event ID 142: The time service has stopped advertising as a time source because the local clock is not synchronized. I am getting other Event ID warnings as well. On the primary DC: Event IDs 12, 36, 144 (mentioned above), 131. On the secondary DC: Event IDs 131, 24, 142 (mentioned above), 50, 129. I will give more info on these at the bottom.
From what I'm seeing, it looks like my PDCe is not pointing to an external source. Should I use the instructions here (http://support.microsoft.com/kb/816042) under "Configuring the time service to use an external time source" to set it up? The guy in the article (http://tigermatt.wordpress.com/2009/08/01/windows-time-for-active-directory/) says to use a script to automate it (w32tm /config /manualpeerlist:”uk.pool.ntp.org,0×8 europe.pool.ntp.org,0×8¿ /syncfromflags:MANUAL /reliable:yes /update). But I'm not sure if they're doing the same thing. Even if they did, I'm not sure which address I use. If I look at my secondary DC, it has an NtpServer entry of time.windows.com,0x9. The PDCe had it as well, until I did the reset that the article recommended; now it does not have an NtpServer entry.
So which method is the right one to use, and what address do I use? Does it matter if I'm running Server 2008 R2?
Event ID 12: Time Provider NtpClient: This machine is configured to use the domain hierarchy to determine its time source, but it is the AD PDC emulator for the domain at the root of the forest, so there is no machine above it in the domain hierarchy to use as a time source. It is recommended that you either configure a reliable time service in the root domain, or manually configure the AD PDC to synchronize with an external time source. Otherwise, this machine will function as the authoritative time source in the domain hierarchy. If an external time source is not configured or used for this computer, you may choose to disable the NtpClient.
Event ID 36: The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service will not update the local system time until it is able to synchronize with a time source. If the local system is configured to act as a time server for clients, it will stop advertising as a time source to clients. The time service will continue to retry and sync time with its time sources. Check system event log for other W32time events for more details. Run 'w32tm /resync' to force an instant time synchronization.
Event ID 144: The time service has stopped advertising as a good time source.
Event ID 131: NtpClient was unable to set a domain peer to use as a time source because of DNS resolution error on ''. NtpClient will try again in 3473457 minutes and double the reattempt interval thereafter. The error was: The requested name is valid, but no data of the requested type was found. (0x80072AFC).
Event ID 24: Time Provider NtpClient: No valid response has been received from domain controller DC-DNS.domain.org [this is our primary DC] after 8 attempts to contact it. This domain controller will be discarded as a time source and NtpClient will attempt to discover a new domain controller from which to synchronize. The error was: The peer is unreachable.
Event ID 142: The time service has stopped advertising as a time source because the local clock is not synchronized.
Event ID 50: The time service detected a time difference of greater than 5000 milliseconds for 900 seconds. The time difference might be caused by synchronization with low-accuracy time sources or by suboptimal network conditions. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. When a valid time stamp is received from a time service provider, the time service will correct itself.
Event ID 129: NtpClient was unable to set a domain peer to use as a time source because of discovery error. NtpClient will try again in 3145779 minutes and double the reattempt interval thereafter. The error was: The entry is not found. (0x800706E1)
I had an issue with a small client where the only DC was running as a VM. The clock would be slow by seconds per day, over weeks or months it could be out by 20 minutes.
Following the instructions found here: http://technet.microsoft.com/en-us/library/cc794937(v=ws.10).aspx I used w32tm /stripchart /computer:time.windows.com /samples:5 /dataonly to determine how far out the clock was with the time.windows.com server (you can use any ntp server you like):
Tracking time.windows.com [64.4.10.33].
Collecting 5 samples.
The current time is 23/06/2013 8:12:34 AM (local time).
08:12:34, -53.2859637s
08:12:37, -53.4214102s
08:12:39, -53.3859342s
08:12:41, -53.2913859s
08:12:43, -53.2440682s
I then used w32tm /config /manualpeerlist:time.windows.com /syncfromflags:manual /reliable:yes /update to tell the server to use time.windows.com as its external time source:
The command completed successfully.
I then used w32tm /resync to force it to re-sync with time.windows.com now:
Sending resync command to local computer...
The command completed successfully.
I then used the first command again to confirm that the difference was near enough to 0 seconds:
Tracking time.windows.com [64.4.10.33].
Collecting 5 samples.
The current time is 23/06/2013 8:13:54 AM (local time).
08:13:54, -00.1657880s
08:13:56, +00.0059062s
08:13:59, -00.0088913s
08:14:01, +00.0030319s
08:14:03, +00.0063458s
Please note that the information was for an environment with a single DC. If you have more than 1 DC, you need to perform the above steps on the DC which holds the PDC Emulator FSMO role.
Hope this helps someone.
The Forest root PDC Emulator (ONLY!) may sync externally. http://technet.microsoft.com/en-us/library/cc794937(v=ws.10).aspx All other Clients, Servers, and DCs should use NT5DS. POOL.NTP.ORG is a good choice.
On all other DCs use:
net stop w32Time
w32tm /unregister
w32tm /register
net start w32time
to reset the time service to use NT5DS as stated in http://technet.microsoft.com/en-us/library/cc738995(v=ws.10).aspx.
If clients or other servers are still having problems, use the same technique per GPO for example, as admin rights are required.
You also need to be very weary of VM Domain Controllers, as they may or may not keep acurate time depending on the Host's CPU utilization! Differences of several minuts are common, and deadly - as far as Kerberos is concerned.
One thing you need to clear up - are these DCs VMs running in Hyper-V or are they physical servers? If they're running in Hyper-V, there's a setting which passes the VM host time to the VMs. All you have to do is turn that sync off, then use the w32tm command to set your DCs to an NTP server like time.windows.com as indicated above.
I don't recall the setting off the top of my head, but I had this problem as well...5 DCs all showing different times.
Anyone use the timer feature of RichCopy? I have a job that works fine when I manually start the job. However, when I schedule the job and click run, the app appears to be waiting for the scheduled time to elapse yet never fires. Interesting enough when I stop the job the copy starts.
Anyone have any experience with using RichCopy timer?
IanB
Try created a batch file with command line options. Then use windows scheduler to launch the batch.
OMBG (Bill Gates) You need to read and get security policy and the respect it has to place on a hierarchy of upstream objects and credentials. Well that's the MS answer and attitude...
The reality is if you are working with server OSs you need to understand their security & policy frameworks, and how to debug them :). If your process loses the necessary file permissions or rights (2 different things) you should ask: "Hot damn, why didn't I fix that in the config/setup". People that blast the vendor/project (or even ####&$! MS) are just blinding themselves to the solution/s.
In most cases this kind of issue is due to Windows' AD removing the rights of a Local administrator User to run a scheduled task. It is a common security setting in corporate networks (implemented with glee by Domain Admins to upset developers) though it is really a default setting these days. It happens because the machine updates against an upstream policy (after you've scheduled a task) and decides that all of a sudden it won't trust you to run it (even though previously it let you set it up). In a perfect world it wouldn't let you set it up in the first place, but that isn't the way policy applies in Windows... (####&$! MS). LOL
Wow it only took 5 months to get an answer! (but here they are for the next person at least!)