How to scale out flink task managers - apache-flink

I wonder what is the best practice of scaling out flink task managers automatically
that if for example i'm running at aws and add a new task mgr at autoscalling group.
it will start to execute jobs (i.e.) discovered by the job mgr
Thanks

Related

JobID On Restart Flink Session Cluster

We have a Flink Session Cluster that runs multiple jobs within it. I'm trying to see if there is a way, when a job restarts, we can keep the jobID to be sticky (not change on a job restart),or if we can link the new job to the old one somehow.
Say a job with ID X is running on a session cluster, it restarted for some reason (not the job or task manager, the job itself). The new job gets, I believe a jobID Y (or will it remain the same X?). Basically, I would like to know that they're the same jobs.
What options do I have in this regard?

Is Flink job manager stateful or stateless?

Let's say all nodes that are running Flink job manager are restarted at the same time, is there any impact to the running task managers which are untouched?
Thanks!
The new job managers will restart all of the jobs from their latest checkpoints, using the information (job graphs, checkpoint metadata) they find in the HA service provider.

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?
Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

what is the difference between SQL Job and Windows Task Scheduler?

what is the difference between SQL Job and Windows Task Scheduler ?
AS I can add SQL Queries in both sides ... what is the difference ??
SQLJobs operate in the context of SQLServer Agent which is a part of SQLServer,Scheduling something related to SQLServer like running a query ,maintenance tasks through SQLserver jobs is very easy..
Whereas task scheduler comes with operating system and you can also schedule tasks,but very difficult to schedule any thing related to sql server,since you have to take care of authentication and many factors
Job is basic increment any work flow and task is specific steps to complete the job, we can say job is Forrest and tree is task. and you can set job without task but not possible to set up task without job.

Scheduling tasks Advice? .Net, SQL Job?

I am creating a system where users can setup mailings to go out at specific times. Before I being I wanted to get some advice. First, is there already a .Net component that will handle scheduling jobs (either running another application or calling a URL) that will do what I am suggesting (Open Source would be cool)? If there isn’t, is it better to schedule a job in SQL and run some sort of script, create a .Net service that will look at an xml file or db for schedules, or have an application create scheduled tasks? There could be a ton of tasks, so I am thinking creating scheduled tasks or SQL jobs might not be a good idea.
Here may be a typical scenario; a user wants to send a newsletter to their clients. The user creates the newsletter on a Saturday, but doesn’t want it to go out until Monday. The user wants that same e-mail to go out every Monday for a month.
Thanks for looking!
Check out Quartz.NET
Quartz.NET is a full-featured, open
source job scheduling system that can
be used from smallest apps to large
scale enterprise systems.
If you want to use the readily available services in Windows itself, check out this article A New Task Scheduler Task Library on CodeProject on how to create scheuled tasks in Windows from your C# application.
You probably have more flexibility and power if you use C# and scheduled tasks in Windows, rather than limiting yourself to what can be done in SQL Server. SQL Server Agent Jobs are great - for database specific stuff, mostly - maintenance plans and so forth.
You can build your own windows service that schedules and executes jobs. Be sure to make good abstractions. In a similar project, I have used an abstraction where scheduling items are abstracted as Jobs composed of tasks. For example, sending newsletter may be a job whereas sending newsletter to each subscriber can be considered as a task. Then you need to run the job and tasks in defined threading models preferably using Threadpool threads or Task Parallel Library. Be sure to use asynchronous API for IO whenever possible. Also separate your scheduling logic from the abstractions. so that the scheduling logic can execute arbitrary types of jobs and its inclusive tasks.

Resources