How can one capture the length of time it takes for an EC2-based job to run in Datadog? - analytics

I have a job that runs several times a day in an EC2 instance. I have a Datadog integration with AWS and the agent on the instance. However, I'm unsure of how best to measure the average time it takes to run.
If I take a time series, it won't capture the average duration because there's not a clear way to measure the "start vs end."
Any ideas?

Related

Why actual execution plan time and elapsed time differs in SSMS?

The actual execution plan of a query shows a total of 2.040s time(taking the sum of time taken at every step) but takes 52 secs(time shown at the bottom of SQL Server Management Studio) to complete.
Why there is so much difference in both the times? And how can I reduce this 52 secs time?
The elapsed time in SSMS includes network round-trip time, client render time, etc.
The execution plan indicates how long it took the server to process the query, not how long it took to stream the results to you.
If you're outputting data to a messages pane or, worse, a grid, that isn't free. As SSMS draws the data in the grid, the server is sending you rows over the network, but the query engine isn't doing anything anymore. Its job is done.
The execution plan itself only knows about the time the query took on the server. It has no idea about network latency or slow client processing. SSMS will tell you how much time it spent doing that, and the execution plan doesn't have any visibility into it at all because it's generated before SSMS has done its thing.
The execution plan runs on the server. It doesn't even know what SSMS is, never mind what it's doing with your 236,833 rows. Let's think about it another way:
You buy some groceries, and the cash register receipt says it took you 4 minutes to check out. Then you take the long way home, stop for coffee, and you dropped the groceries on the way into the house, and then it took you 20 minutes to remember where everything goes. Finally, you sit down on the couch. The cash register receipt doesn't then update and add your travel time and organization time, which is equivalent to what SSMS is doing when it is struggling trying to show you 236,833 rows.
And this is why we don't try to time the performance of a query by adding in unrealistic things that won't happen in the real world, because no real world user can process 200,000 rows of anything. Really don't draw any conclusions about real world performance from your testing in a client GUI. Is your application going to do pagination, or aggregation, or something else so an end user doesn't have to wait for 200,000 rows to render? If so, test that. If not, reconsider.
To make this faster in the meantime, try with the "Discard results after execution" option in SSMS.

stopping hyperparameter tuning (HPO) jobs after reaching metric threshold in aws sage maker

I am running HPO jobs in sage maker, and I am thinking of a way to stop my HPO job after one of the child training jobs reaches a specific metrics threshold.
PS: I tried sage maker early stopping, but it only works on the level of epocs within each training job, so it stops training jobs if it noticed that their learning pattern might not give as good metric as the best training jobs found already. But this does not solve my problem which in the level of HPO combinations, so regardless of what happens within the child training jobs, I want to stop the whole HPO job after one of its children reaches my desired metric threshold.

Google App Engine Instance Hours for every 10min job

I have an script in App engine that gets called every 10min. I am the only user.
The script pulls data from a web source does light processing and returns an image. It takes several minutes to run the first time. The source gets updated every 10min, so the next time my script runs, (10min later), it returns in a few seconds.
I'm using over 30 instance hours a day which is over the 28 free hours.
I read somewhere that every time an instance starts, it uses a minimum of 15min. (so 144x15=36hrs)
Therefore, am I better off trying to keep the instance running 24hrs (using up 24hrs) and limiting to one instance max? Perhaps setting idle_timeout to 10min. Another potential way to save would be to somehow pause my script during late night/early morning hours.
This is where you likely found the 15 minutes minimum statement and it refers to when the accrual of instance hours ends. If you look at that documentation you will see that it depends on what type of scaling you are using.
Here are my tought in all options you mentioned on your question for staying in the free tier:
Use only one instance. The problem with this approach is that giving a higher load to a single instance may increase other cost such as vCPU and memory since all the processing will be done there, also, it's likely that it's going to take longer to run the script.
Pause the script on low activity hours, this will be the ideal if the acumulated load when turning it back on it not too big.
Setting idle_timeout to 10min. This won't work if it runs every 10 min on all instances, since your app runs every 10min and app engine will only stop charging you every 15min idle, which will never happen, unless it does not run every 10min for every instance, if that is the case, it could be worth trying.
So, summing it up, all 3 options have their pros and cons, I would suggest that you test all of them and see what options suits your needs better.
Hope this helped.

App Engine Cron.yaml run multiple instances of a script

How can one run multiple instances of a Script Using Google App Engine's Cron system?
By default, it will run, then wait the specified interval before running again, which means that only one instance runs. What i am looking for is how one can get a script that takes 2+ minutes to run start a new instance every 30-60 seconds regardless of if it is running already or not, which does assume the script does not interfere with itself if multiple instances are running. this would effectively allow the script to deal with several times more information in the same period of time.
Edit, Completely reworded the question.
You only get resolution to the minute. To get finer-grained, you'll need instances that know whether they should handle the request from chron immediately, of if they'll have to sleep 30 seconds first. A 30 second sleep uses up half of the 60 second request deadline. Depending on the workload you expect to handle, this might require that you use Modules.
By the way, I'm not aware of any guarantee that a job scheduled for 01:00 will fire at exactly 01:00:00 (and not at, say, 01:00:03).
Since the cron service doesn't allow intervals below 1 min you'd need to achieve staggering script launching in a different manner.
One possibility would be to have a cron entry handler running every 2 mins which internally sleeps for 30 seconds (or as low as your "few seconds of each-other" requirements are) between triggering the respective script instance launches.
Note: the sleeps would probably burn into your Instance Hours usage. You might be able to incorporate the staggered triggering logic into some other long-living task you may have instead of simply sleeping.
To decouple the actual script execution from the cron handler (or the other long-living task) execution you could use dedicated task queues for each script instance, with queue handlers sharing the actual script code if needed. The actual triggering would be done by enqueueing tasks in the respective script instance queue. As a bonus you may further control each script instance executions by customizing the respective queue configuration.
Note: if your script execution time exceeds the 2 minutes cron period you may need to take extra precautions in the queue configurations as there can be extra delays (due to queueing) which could push lauching of the respective script instance closer to the next instance launch.
Working off Dave W. Smith's answer, The Line would be
every 1 minute from 00:00 to 23:59
Which means that it would create a new instance every minute, even if the script takes longer than a minute to run. It does seem that specifying seconds is not possible.

Time Limit for Task Queue in Google App Engine

I am using Task Queue in GAE for performing some background work for my application. I have come to know that there is a 10 minute time limit for a particular task. My concern is how do I test this thing in my local environment. I tried thread sleep but it didn't throw any exception as mentioned in google app engine docs. Also is this time limit is measured by CPU time or the actual time.
Thanks.
The time is measured in wall clock time. The development server doesn't enforce time limits, although it's unclear why you'd want to test it because it's unlikely your tests will perform the same as they will in production, so trying to guess how much you'll be able to accomplish in 10 minutes on the production servers by seeing how much you can accomplish in 10 minutes on the development server will fail horribly.
For your development server, start a timer when a task is initiated. keep checking in your code if you reached 10 mins wall clock time. When you reach, throw a DeadlineExceededError. It would be better to have the try and except statements in the class handlers which call a particular function of your code.

Resources