stopping hyperparameter tuning (HPO) jobs after reaching metric threshold in aws sage maker - amazon-sagemaker

I am running HPO jobs in sage maker, and I am thinking of a way to stop my HPO job after one of the child training jobs reaches a specific metrics threshold.
PS: I tried sage maker early stopping, but it only works on the level of epocs within each training job, so it stops training jobs if it noticed that their learning pattern might not give as good metric as the best training jobs found already. But this does not solve my problem which in the level of HPO combinations, so regardless of what happens within the child training jobs, I want to stop the whole HPO job after one of its children reaches my desired metric threshold.

Related

Processing job v.s. training job in SageMaker

What is the difference between a processing job and a training job? I am running training job and did not launch processing job, why does my account have processing job running?
Training job focuses on a process that train a model, for example it will use a PyTorch or TensorFlow container, or SageMaker built-in algorithms, or have a Hyper parameters tuning job associated with the training job. It may include an MPI cluster for distributed training. Finally it outputs a model artifact.
Processing jobs focus on pre/post processing, and have API and containers for ML data processing tools like SK-learn pipeline or Dusk/Spark.
Why do you see processing jobs - When you are profiling a training job (enabled by default), then a matching processing job is created to process the profiler report. You can disable profiling by adding disable_profiler=True parameter to the estimator object.
Generally speaking training jobs have a richer API than processing jobs, and allows more customization. The right choice will depend on your specific use case.

How to reduce times between Flink intra-jobs and avoid repeated tasks

I have run a Flink bounded job in standalone cluster. Then Flink breaks it down into 3 jobs.
It takes around 10 secs to start the next job after one job finish. How to reduce the times between jobs? and when observing the details of the tasks flow, I notice that 2nd job did the same tasks that have been done by 1st job, plus new additional tasks, and so on with 3rb job. For example, it repeatedly reads the data from files in every job and then join it. Why does it happen? I am a new Flink user. AFAIK, we can't cache dataset in Flink. Really need help to understand how it works. Thank you.
Here is the code

How does run queue work in Snowflake? Is there a concept timeslice at all?

I am a newbie to Snowflake and documentation is not clear.
Say I use a Large warehouse with 5 max concurrent queries
There are 5 users who fire heavy duty queries which may take many minutes to finish
The 6th user has a simple query to execute
Does the processes running those 5 queries yield at any point in time or do they run to completion?
Will the 6th user have to wait till the timeout limit is reached and attempt using a different Virtual Warehouse
Thanks!
The queue is a first-in-first-out queue like most (all?) other databases. If a query is queued because other queries are consuming all resources of the cluster then it'll have to wait until the other queries are finished (or timeout) before it can run. Snowflake won't pause a query that is running to "sneak in" a smaller query.
You could always resize the warehouse though to push through the query. Here is a good line from the documentation:
Single-cluster or multi-cluster (in Maximized mode): Statements are queued until already-allocated resources are freed or additional resources are provisioned, which can be accomplished by increasing the size of the warehouse.
This is actually a good question to ask and understanding how this works in snowflake will help you use snowflake more optimally. As you already know snowflake uses virtual warehouses for compute which are nothing but cluster of compute nodes. Each node has 8 cores. So, when you submit a query to a virtual warehouse, each query is being processed by one or more core(based on if the query can be parallelized). So, if the virtual warehouse does not have any core to execute the 6th query, it will queue up. If you logon to snowflake UI and click on the warehouse tab, you will see this queueing through the yellow color on the bars. You can also see it under 'QUEUED_OVERLOAD_TIME' if you query the QUERY_HISTORY view.
Now, this is not a good thing for queries to queue up consistently. So, the best practice is to have a multi warehouse strategy. Give every unique group of workload a dedicated warehouse so that you can scale them horizontally/vertically based on the query load of the given workload.

Oracle jobs monitoring under certain schema users with functions

Just started rewriting of oracle jobs monitoring.
Currently which I'm using is that Nagios is calling two different functions to check DBMS and Scheduler job statuses.
What i'm checking now:
DBMS:
if job is broker.
if job worked longer then expected(actually this is not working correctly
because i cant determine any middle or approximately time which it takes)
if it executed late, not on time.
ok, in case non of above was true.
All this data is collected from sys.dba_jobs and custom conf tables
Scheduler:
count of failures in given interval
too few runs in given interval
worked for too long then expected i'm pretty sure that all result expect count of failures are not accurate. this data is collected from SYS.DBA_SCHEDULER_JOB_RUN_DETAILS and custom conf tables.
What is my gain:
avoid useless conf tables
need to monitor jobs without custom confs for each jobs, because there always is risk to don't add data about job in table or add incorrect data.
i need to somehow get accurate data for each job how long it could be take for execution and how many times had to be executed for given time.
If anyone has produced task like this please help or give some advice or source code where I can take a look and modify for my DB.

GAE - Execute many small tasks after a fixed time

I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.

Resources