flink job failed when encountered DB Connection Exception - apache-flink

I'm newer to flink. My flink job recieved messages from mq, and do some rule check and summary calculation, then write the result to rdbms. Sometimes the job will encounter NullPointException(due to my silly code) , or MQ connection Exception(due to non-exist Topic),and it just halted current message-processing, the job is still running, then next messages will still trigger the exception.
But today I restart DB and the job failed. What's the difference?

Related

Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart

I am receiving the mentioned warning continuously from Debezium Connector for SQL Server, that I am running by using connect-standalone. At a certain moment, I have tried to start two concurent connectors that connect to the same database, but that was yesterday, and after that I restarted this connector several times, and the other connector is stopped, so I don't know where is that information persisted, and why, because, at the moment only one connector is running, so this shouldn't be logged. As CDC does not work, this seems like a problem, regardless of the fact that it is logged as only a warning, because no (other) error is logged but such lines:
WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask:238)

How to get the name of failed flink jobs

Our flink cluster sometimes restarts and all jobs will be restarted. Occasionally, some job failed to restart and failed count increases on the panel. However, it cannot let us know which jobs failed.
When total job count grows, it becomes harder to find out the stopped job. Does anyone know how can I get the names of the failed jobs?
You could write a simple script for that which will give you the list of job names which have failed.
I am using this command to get a list of failed job.
$yarn application -list -appStates KILLED
Set up alert when your cluster restarts and post restart check the jobs that haven't restarted and you could have alerts for those as well.

Flink Task Manager Status When Application Crashes

What happens when there is an Exception thrown from the jar application to the Task Manager while processing an event?
a) Flink Job Manager will kill the existing task manager and create a new task manager?
b) Task manager itself recovers from the failed execution and restart process using local state saved in RocksDB?
java.lang.IllegalArgumentException: "Application error-stack trace"
I have a doubt that if that same kind erroneous events are getting processed by each of the task manager available hence they all get killed and entire flink job is down.
I am noticing that if some application error comes then eventually entire job will get down.
Don't figured out the exact reason as of now.
In general, the exception in the Job should not cause the whole Task Manager to go down. We are talking about "normal" exceptions here. In such case the Job itself will fail and the Task Manager will try to restart it or not depending on the provided restart strategy.
Obviously, if for some reason Your Task Manager will die, for example due to the timeouts or something else. Then it will not be restarted automatically if You do not use some resource manager or orchestration tool like YARN or Kubernetes. The job in such case should be started after there are slots available.
As for the behaviour that You have described that the Job itself is "going down" I assume here that the job is simply going to FAILED state. This is due to the fact that different restart strategies have different thresholds for max number of retries and If the job will not work after the specified number of restarts it will simply go to failed state.

BizTalk SQL Server Job always fails after some hours

We have BizTalk 2016 running on SQL Server 2016 AlwaysOn.
The SQL Server Agent Job MessageBox_Message_ManageRefCountLog_BizTalkMsgBoxDb is doing its thing but after some hours it fails. Sometimes 10 hours sometimes 90 hours or anything in between. I know, the job is designed to run forever and in a case of an error restarts itself within a minute. But I would like to know the actual error message for this failed job. The job history is not helpful because the job log entry is truncated.
A failover is not happening. The question is: WHY is this job failing and ultimately: how do I stop it from doing that?
I have set-up the extended monitoring of the failing step and it revealed, that the job failed because of a deadlock and it was chosen as the deadlock victim. So now is the question, why is there a deadlock? Is MessageBox_Message_ManageRefCountLog_BizTalkMsgBoxDbknown for deadlock issues?
Check the documentation at Description of the SQL Server Agent Jobs in BizTalk Server, it says:
Important At first, the MessageBox_Message_ManageRefCountLog_BizTalkMsgBoxDb job status icon displays a status of Success. However, there will be no corresponding success entry in the job history. If one of the jobs in the MessageBox_Message_ManageRefCountLog_BizTalkMsgBoxDb job fails, a failure entry appears in the job history and the status icon displays a status of Failure. The job will always display a status of Failure after the first failure. To verify that the other BizTalk Server SQL Server Agent jobs run correctly, check the status of the other BizTalk Server SQL Server Agent jobs.
Hope this answer your question.

Send database test E-Mail exception - lock request time out period exceeded

I am troubleshooting an error with database mail, and when I went to management->database mail to send a test e mail, I get the following erorr:
An exception occured while executing a Transact-SQL statement or batch.
(Microsoft.SQLServer.ConnectionInfo)
Additional information:
Lock request time out period exceeded
The statement has been terminated (Microsoft SQL Server, Error: 1222)
When I investigate this further by looking at all blocking transactions on msdb, I find one transaction, that has the name "implicit transaction". Is this the one which is blocking? What can I do?
Ok I found out what happened by looking at all transactions (right click -> reports -> all transactions). From there I found that a server user had a locking transaction open. After making him close it, everything went fine.

Resources