Our flink cluster sometimes restarts and all jobs will be restarted. Occasionally, some job failed to restart and failed count increases on the panel. However, it cannot let us know which jobs failed.
When total job count grows, it becomes harder to find out the stopped job. Does anyone know how can I get the names of the failed jobs?
You could write a simple script for that which will give you the list of job names which have failed.
I am using this command to get a list of failed job.
$yarn application -list -appStates KILLED
Set up alert when your cluster restarts and post restart check the jobs that haven't restarted and you could have alerts for those as well.
Related
I have job which has only one step. This step calls an SSIS package. Most of the time it is running fine however sometimes it is failing due to connectivity some issues.
It is very hard to track this kind of failure since when I open the job history (screenshot below) it shows the job is completed successfully. When I click on the step which I highlighted on the same screenshot I could see the error.
I have plenty of jobs like this and it is very hard to track these kind of errors manually.
The below is the actual error due to which the job step is failed however the overall job status showing it as success.
This is a weird scenario and impossible to track. We already have job failure reporting mechanism where it tracks only the over all job failure but it unable to track the Job step failures.
Logically speaking the overall status of the job should fail if one or more steps failed to complete. I have checked the Advanced option below looks like everything is fine. I am not sure where to start with. Please provide some insights on this.
I am using apache flink 1.10 to batch compute my stream data, today I move my apache flink kubernetes(v1.15.2) pod from machine 1 to machine 2 and find all submit task record and task list disappear, what's happening? the summit record is in the memory? what should I to keep my submit record and task list when restart the kubernetes pod of apache flink? I just found checkpoint persistant but nothing about tasks.
If lose the running task history, I must upload my task jar and recreate all task, so many task should to recreate if lose the history, is there any possible to resume the task automaticlly?
The configurations that might not be set are:
Job Manager
jobmanager.archive.fs.dir: hdfs:///completed-jobs
History Server
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
Please look at for more details: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/historyserver.html#configuration
What happens when there is an Exception thrown from the jar application to the Task Manager while processing an event?
a) Flink Job Manager will kill the existing task manager and create a new task manager?
b) Task manager itself recovers from the failed execution and restart process using local state saved in RocksDB?
java.lang.IllegalArgumentException: "Application error-stack trace"
I have a doubt that if that same kind erroneous events are getting processed by each of the task manager available hence they all get killed and entire flink job is down.
I am noticing that if some application error comes then eventually entire job will get down.
Don't figured out the exact reason as of now.
In general, the exception in the Job should not cause the whole Task Manager to go down. We are talking about "normal" exceptions here. In such case the Job itself will fail and the Task Manager will try to restart it or not depending on the provided restart strategy.
Obviously, if for some reason Your Task Manager will die, for example due to the timeouts or something else. Then it will not be restarted automatically if You do not use some resource manager or orchestration tool like YARN or Kubernetes. The job in such case should be started after there are slots available.
As for the behaviour that You have described that the Job itself is "going down" I assume here that the job is simply going to FAILED state. This is due to the fact that different restart strategies have different thresholds for max number of retries and If the job will not work after the specified number of restarts it will simply go to failed state.
Background:
SQL Compliance Manager is collecting files on an Agent Server to audit and once the trace files collect on the Agent the Compliance Manager agent service account moves these files to the Collection Server folder, processes them and deletes them.
Problem:
Over 5 times in the last month, the trace files have started filling up the Agent drive to the point where the trace files have to be stopped by running a SQL query to change the status of the traces. This has also had a knock on effect with the Collection Server and the folder on there starts to fill up excessively and the Collection Server Agent is unable to process the audit trace files. 4/5 times the issue occurred closely after a SQL fail over, however, the last time this trace error occurred there had been no fail over. The only thing that was noticeable in the event logs was that 3 SQL jobs went off around the time the traces started acting up.
Behaviour:
A pattern has been identified which shows on Windows Event Viewer that there is an execution timeout close or at the time the trace files start becoming unwieldy.
Error: An error occurred starting traces for instance XXXXXXXXX. Error: Execution Timeout Expired.
The timeout period elapsed prior to completion of the operation or the server is not responding..
The trace start timeout value can be modified on the Trace Options tab of the Agent Properties dialog in the SQLcompliance Management Console.
Although, I do not believe by just adjusting the Timeout settings will cause for the traces to stop acting in that way, as these are recommended settings and other audited servers have these same settings but do not act in the same way. The issue only persists with one box.
Questions:
I want to find out if anyone else has experienced a similar issue and if so, was the environment the issue happened in dealing with a heavy load? By reducing the load did it help or were there other remediation steps to take? Or does anyone know of a database auditing tool which is lightweight and doesn't create these issues?!
Any help or advice appreciated!
There is job on our SQL Server. It was started and stopped after some time. However, the job still appears to be running even if it is not.
I can see it "is running" in Job Activity Monitor and also sysjobactivity (= there is no stop_execution_date field filled, but in fact nothing happens. I do not see it running in Activity Monitor.
How to effectively kill the job without need to restart whole server? Stopping the job via SSMS GUI does not work.