What's the best way to create recurring tasks?
Should I create some special syntax and parse it, kind of similar to Cronjobs on Linux or should I much rather just use a cronjob that runs every hour to create more of those recurring tasks with no end?
Keep in mind, that you can have endless recurring tasks and tasks with an enddate.
Quartz is an open source job scheduling system that uses cron expressions to control the periodicity of the job executions.
My approach is always "minimum effort for maximum effect" (or best bang per buck).
If it can be done with cron, why not use cron? I'd consider it wasted effort to re-implement cron just for the fun of it so, unless you really need features that cron doesn't have, stick with it.
Related
I have a job that applies some regex to a large number of strings.
This I have achieved by creating a cross join job with List of strings and List of regex as the input.
Normally this runs fine, but once in a while for a certain input and regex pair, the task execution never terminates - mostly due to input being too big / regex not being efficient.
In this case I would prefer the task to get 'timed out', or the job as a whole to get 'timed out' so I know something is wrong and skip processing.
I went through the flink config docs but wasn't able to find out.
I did a workaround creating a future async thread inside task and cancelling it after a certain time to apply the regex, but it seems like an overkill. Hence looking for a better solution.
There is no job limit in Flink that can do what you look for. Since that sounds very batch-oriented, I'd also don't think that this is a feature that's actively worked on.
Nevertheless, your solution is actually quite good already. Other solutions depend on your infrastructure. If you trigger the job by using airflow or any other workflow system, I'd imagine that they can cancel tasks after some time. If you run it on K8s or YARN, you may be able to limit the total resource usage. But if you don't use any of it, then your solution is a good safe guard.
Some more ideas: do you really need the slow java Regexes or could you use Re2 or other automaton libraries? Could you add some sanity checks on very large input strings and skip them? Could you simply stop applying the CrossFunction after your time ran out (graceful termination)?
Regarding best practices and efficiency within Flink, what are the recommendations of when to split analytics into multiple tasks
For example, given a single topic in Kafka as the source of the data. If there were many simple operations that were to be carried out over the stream, such as: if some value is greater than x, or if x & y etc. What would be the point at which you would stop spending more rules into the same task and start to run them in parallel?
Is there any official recommendation for this?
It's hard to give a general recommendation. Performance-wise, it makes sense to put as much as possible into one job.
However, it's much more important to think about maintenance. I'd put everything in one job that is closely related, such that new features or bug fixes will likely only affect one job while at the same time, you also don't want to stop all analytics when upgrading one particular query.
Another dimension to think about is state size. It's related to restarts and update frequency (point above). If the state size becomes too big, restarting this one monster job takes a long time, which would be inefficient if you only touched a small fraction of the code.
Finally, it also depends on the relevance. If some part of your job is super important as it reflects the one KPI that drives your business, then you probably don't want to mix that with some fragile, unimportant part.
I am setting up push task queue on my Google App Engine App with a countdown parameter so it will execute at some point in the future.
However, my countdown parameter can be very large in seconds, for instance months or even a year in the future. Just want to make sure this will not cause any problems or overhead cost? Maybe there is a more efficient way to do this?
It probably would work, but it seems like a bad idea. What do you do if you change your task processing code? You can't modify a task in the queue. You'd somehow have to keep track of the tasks, delete the old ones and replace them with new ones that work with your updated code.
Instead, store information about the tasks in the data store. Run a cron job once a day or once a week, process the info in the data store, and launch the tasks as needed. You can still use a countdown if you need a precise execution date and time.
The current limit in Task Queues is 30 days, and we don't have plans to raise that substantially.
Writing scheduled operations to datastore and running a daily cron job to inject that day's tasks is a good strategy. That would allow you to update the semantics as your product evolves.
I need to run a script (python) in App Engine for many times.
One possibility is just to run a loop and use urlfetch with a link to the script.
The other one is to open a task with the script URL.
What is the difference between both ways? It seems like Tasks have a quota (100,000 daily free tasks) so why should I use them?
Thanks,
Joel
Briefly:
Bulk adding tasks to the queue will probably be easier, and possibly quicker, than using URLFetch. Although using async url-fetches might help with this.
When a task fails, it will automatically retry. Assuming you check the status of your call, URLFetch might just hang for a while before you get some type of error.
You can control the rate at which tasks are executed. So if you add 1,000 tasks fast you can let them slowly run at 10 / minute (or whatever you want), helping you not blow through your other quotas.
If you enable billing, the free quota is 20,000,000 / tasks per day.
Depending on what you are doing, tasks can be transactionally enqueued, which gives you some really powerful abilities.
Let's say I have 1000's of jobs to perform repeatedly, how would you propose I architect my system on Google AppEngine?
I need to be able to add more jobs whilst effectively scaling the system. Scheduled Tasks are part of the solution of course as well as Task Queues but I am looking for more insights has to best utilize these resources.
NOTE: There are no dependencies between "jobs".
Based on what little description you've provided, it's hard to say. You probably want to use the Task Queue, and maybe the deferred library if you're using Python. All that's required to use these is to use the API to enqueue a task.
If you're talking about having many repeating tasks, you have a couple of options:
Start off the first task on the task queue manually, and use 'chaining' to have each invocation queue the next one with the appropriate interval.
Store each schedule in the datastore. Have a cron job regularly scan for any tasks that have reached their ETA; fire off a task queue task for each, updating the ETA for the next run.
I think you could use Cron Jobs.
Regards.