We have a Flink Session Cluster that runs multiple jobs within it. I'm trying to see if there is a way, when a job restarts, we can keep the jobID to be sticky (not change on a job restart),or if we can link the new job to the old one somehow.
Say a job with ID X is running on a session cluster, it restarted for some reason (not the job or task manager, the job itself). The new job gets, I believe a jobID Y (or will it remain the same X?). Basically, I would like to know that they're the same jobs.
What options do I have in this regard?
Related
I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?
Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.
I have multiple Kafka topics (multi tenancy) and I run the same job run multiple times based on the number of topics with each job consuming messages from one topic. I have configured file system as state backend.
Assume there are 3 jobs running. How does checkpoints work here? Does all the 3 jobs store the checkpoint information in the same path? If any of the job fails, how does the job knows from where to recover the checkpoint information? We used to give a job name while submitting a job to the flink cluster. Does it have anything to do with it? In general how does Flink differentiate the jobs and its checkpoint information to restore in case of failures or manual restart of the jobs (irrespective of same or different jobs)?
Case1: What happens in case of job failure?
Case2: What happens If we manually restart the job?
Thank you
To follow-on to what #ShemTov was saying:
Each job will write its checkpoints in a sub-dir named with its jobId.
If you manually cancel a job the checkpoints are deleted (since they are no longer needed for recover), unless they have been configured to be retained:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Retained checkpoints can be used for manually restarting, and for rescaling.
Docs on retained checkpoints.
If you have high availability configured, the job manager's metadata about checkpoints will be stored in the HA store, so that recovery does not depend on the job manager's survival.
The JobManager is aware of each job checkpoint, and keep that metadata, checkpoint is being save to the checkpoint directory(via flink-conf.yaml), under this directory it`ll create a randomly hash directory for each checkpoint.
Case 1: The Job will restart (depend on your Fallback Strategy...), and if checkpoint is enabled it'll read the last checkpoint.
Case 2: Im not 100% sure, but i think if you cancel the job manually and then submit it, it won't read the checkpoint. You'll need to use savepoint. (You can kill your job with savepoint, and then submit your job again with the same savepoint). Just be sure that every oprator has a UID. you can read more about savepoints here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html
I have a situation where I have a job that runs every day (Job A), job that runs every 2 days (Job B) and another job that runs every weekend (Job C). I need to make sure that Job A runs before Job B. If Job A does not run appropriately then i don't want Job B to run. The same thing applies to Job C. Anyone have any thoughts on how to go about this?
Appreciate any help
I have used a product called SQL Sentry to do what you are trying to do. SQL Sentry has a lot of other advanced monitoring and control functionality (like killing jobs that hang, queuing low priority jobs, etc). Here is their website https://sentryone.com/platform/sql-server-performance-monitoring.
This is a quote from one of their advertising:
19. Chaining and Queuing
Did you ever wish you could find just a few more hours in your
maintenance window, or need to have jobs run in a particular sequence?
The advanced chaining features in SQL Sentry Event Manager can assure
that interdependent jobs run in the proper order without wasting time
or resources.
Chaining
SQL Sentry Event Manager allows you to chain
SQL Agent Jobs, Windows Tasks, or Oracle Jobs across servers. You can
enforce dependencies and automate workflow throughout your entire
enterprise, even across platforms! The graphical chaining interface
allows you to design the workflow using variables such as completion,
success, or failure. More details are available in the User Guide, but
take a few minutes to watch our two video tutorials on chaining -
Graphical Chaining Interface and Advanced Chaining.
I need to make sure that Job A runs before Job B. If Job A does not run appropriately then i don't want Job B to run. The same thing applies to Job C.
Create all jobs A,B,C and schedule only job A..At the end of job A ,success event ,Call job B like below
EXEC dbo.sp_start_job N'Weekly Sales Data Backup' ;
GO
Now the same things applies to job c,call job c on success event of job B..
I would go with this approach..You also can go with an approach of insert success ,failure values into a table and ensure job b or c reads those values before starting
I work with an environment that uses Merge Replication to publish a dozen publications to 6 a dozen subscribers every 10 minutes. When certain jobs are running simultaneously, deadlocks and blocking is encountered and the replication process is not efficient.
I want to create a SQL Server Agent Job that runs a group of Merge Replication Jobs in a particular order waiting for one to finish before the next starts.
I created an SSIS package that started the jobs in sequence, but it uses sp_start_job and when run it immediately starts all the jobs so they are running together again.
A side purpose is to be able to disable replication to a particular server instead of individually disabling a dozen jobs or temporarily disabling replication completely to avoid 70+ individual disablings.
Right now, if I disable a Merge Replication job, the SSIS package will still start and run it anyway.
I have now tried creating an SSIS package for each Replication Job and then creating a SQL Server Agent job that calls these packages in sequence. That job takes 8 seconds to finish while the individual packages it is calling (starting a replication job) takes at least a minute to finish. In other words, that doesn't work either.
The SQL Server Agent knows when a Replication job finishes! Why doesn't an SSIS package or job step know? What is the point of having a control flow if it doesn't work?
Inserting waits is useless. the individual jobs can take anywhere from 1 second to an hour depending on what needs replicating.
May be I didn't see real problem but it is naturally that you need synchronization point and there are many ways to create it.
For example you could still run jobs simultaneously but let first job lock a resource that is needed for second, that will wait till resource will be unlocked. Or second job can listen log table in loop (with wait for a "minute" and self cancel after "an hour")...
I am working with SQL Server 2008. Using the Agent, I have created a job and scheduled it to execute every minute.
The job executes a stored procedure that moves data from table XXX, to a temp table, and then eventually into table YYY.
The execution of the job may take more than one minute - since the data is rather large.
Will a second instance of the job be started even though the first instance is still running?
If so, should I mark records in temp table (status = 1) to indicate that those records are being processed by a previous instance of the job?
Is there a way for me to check that an instance of the job is currently running, so that I don't initiate a second instance of the job?
Is there another solution for this that I am unaware of? (throughput is important)
Only one instance of a particular job can run at any one time.
So there is no need to take any particular precautions against another execution of the same job beginning before the first one has stopped.
check this post
How to Prevent Sql Server Jobs to Run simultaneously
How to Prevent Sql Server Jobs to Run simultaneously
As Well HERE
Running Jobs
http://technet.microsoft.com/en-us/library/aa213815(v=sql.80).aspx
If a job has started according to its schedule, you cannot start another instance of that job on the same server until the scheduled job has completed. In multiserver environments, every target server can run one instance of the same job simultaneously.