Flink Task Manager Status When Application Crashes

Flink Task Manager Status When Application Crashes - apache-flink

What happens when there is an Exception thrown from the jar application to the Task Manager while processing an event?
a) Flink Job Manager will kill the existing task manager and create a new task manager?
b) Task manager itself recovers from the failed execution and restart process using local state saved in RocksDB?
java.lang.IllegalArgumentException: "Application error-stack trace"
I have a doubt that if that same kind erroneous events are getting processed by each of the task manager available hence they all get killed and entire flink job is down.
I am noticing that if some application error comes then eventually entire job will get down.
Don't figured out the exact reason as of now.

In general, the exception in the Job should not cause the whole Task Manager to go down. We are talking about "normal" exceptions here. In such case the Job itself will fail and the Task Manager will try to restart it or not depending on the provided restart strategy.
Obviously, if for some reason Your Task Manager will die, for example due to the timeouts or something else. Then it will not be restarted automatically if You do not use some resource manager or orchestration tool like YARN or Kubernetes. The job in such case should be started after there are slots available.
As for the behaviour that You have described that the Job itself is "going down" I assume here that the job is simply going to FAILED state. This is due to the fact that different restart strategies have different thresholds for max number of retries and If the job will not work after the specified number of restarts it will simply go to failed state.

Related

Fail the SQL Server agent job in case any one of the step failed to complete

I have job which has only one step. This step calls an SSIS package. Most of the time it is running fine however sometimes it is failing due to connectivity some issues.
It is very hard to track this kind of failure since when I open the job history (screenshot below) it shows the job is completed successfully. When I click on the step which I highlighted on the same screenshot I could see the error.
I have plenty of jobs like this and it is very hard to track these kind of errors manually.
The below is the actual error due to which the job step is failed however the overall job status showing it as success.
This is a weird scenario and impossible to track. We already have job failure reporting mechanism where it tracks only the over all job failure but it unable to track the Job step failures.
Logically speaking the overall status of the job should fail if one or more steps failed to complete. I have checked the Advanced option below looks like everything is fine. I am not sure where to start with. Please provide some insights on this.

how to resolve ERROR: could not start WAL streaming: ERROR: replication slot "xxx" is active for PID 124563

PROBLEM!!
After setting up my Logical Replication and everything is running smoothly, i wanted to just dig into the logs just to confirm there was no error there. But when i tail -f postgresql.log, i found the following error keeps reoccurring ERROR: could not start WAL streaming: ERROR: replication slot "sub" is active for PID 124898
SOLUTION!!
This is the simple solution...i went into my postgresql.conf file and searched for wal_sender_timeout on the master and wal_receiver_timeout on the slave. The values i saw there 120s for both and i had to change both to 300s which is equivalent to 5mins. Then remember to reload both servers as you dont require a restart. Then wait for about 5 to 10 mins and the error is fixed.

We had an identical error message in our logs and tried this fix and unfortunately our case was much more diabolical. Putting the notes here just for the next poor soul but in our case, the publishing instance was an AWS managed RDS server and it managed (ha ha) to create such a WAL backlog that it was going into catchup state, processing the WAL and running out of memory (getting killed by the OS every time) before it caught up. The experience on the client side was exactly what you see here - timeouts and failed WAL streaming. The fix was kind of nasty - we had to drop the whole replication link and rebuild it (fortunately it was a test database so not harm done but it's a situation you want to avoid). It was obvious after looking on the publisher side and seeing the logs but from the subscription side more mysterious.

How to get the name of failed flink jobs

Our flink cluster sometimes restarts and all jobs will be restarted. Occasionally, some job failed to restart and failed count increases on the panel. However, it cannot let us know which jobs failed.
When total job count grows, it becomes harder to find out the stopped job. Does anyone know how can I get the names of the failed jobs?

You could write a simple script for that which will give you the list of job names which have failed.
I am using this command to get a list of failed job.
$yarn application -list -appStates KILLED
Set up alert when your cluster restarts and post restart check the jobs that haven't restarted and you could have alerts for those as well.

How to resolve DB connection invalidated warning in Airflow Scheduler?

I am upgrading our Airflow instance from 1.9 to 1.10.3 and whenever the scheduler runs now I get a warning that the database connection has been invalidated and it's trying to reconnect. A bunch of these errors show up in a row. The console also indicates that tasks are being scheduled but if I check the database nothing is ever being written.
The following warning shows up where it didn't before
[2019-05-21 17:29:26,017] {sqlalchemy.py:81} WARNING - DB connection invalidated. Reconnecting...
Eventually, I'll also get this error
FATAL: remaining connection slots are reserved for non-replication superuser connections
I've tried to increase the SQL Alchemy pool size setting in airflow.cfg but that had no effect
# The SqlAlchemy pool size is the maximum number of database connections in the pool.
sql_alchemy_pool_size = 10
I'm using CeleryExecutor and I'm thinking that maybe the number of workers is overloading the database connections.
I run three commands, airflow webserver, airflow scheduler, and airflow worker, so there should only be one worker and I don't see why that would overload the database.
How do I resolve the database connection errors? Is there a setting to increase the number of database connections, if so where is it? Do I need to handle the workers differently?
Update:
Even with no workers running, starting the webserver and scheduler fresh, when the scheduler fills up the airflow pools the DB connection warning starts to appear.
Update 2:
I found the following issue in the Airflow Jira: https://issues.apache.org/jira/browse/AIRFLOW-4567
There is some activity with others saying they see the same issue. It is unclear whether this directly causes the crashes that some people are seeing or whether this is just an annoying cosmetic log. As of yet there is no resolution to this problem.

This has been resolved in the latest version of Airflow, 1.10.4
I believe it was fixed by AIRFLOW-4332, updating SQLAlchemy to a newer version.
Pull request

SQL Server Trace Files Filling Up Agent Drive

Background:
SQL Compliance Manager is collecting files on an Agent Server to audit and once the trace files collect on the Agent the Compliance Manager agent service account moves these files to the Collection Server folder, processes them and deletes them.
Problem:
Over 5 times in the last month, the trace files have started filling up the Agent drive to the point where the trace files have to be stopped by running a SQL query to change the status of the traces. This has also had a knock on effect with the Collection Server and the folder on there starts to fill up excessively and the Collection Server Agent is unable to process the audit trace files. 4/5 times the issue occurred closely after a SQL fail over, however, the last time this trace error occurred there had been no fail over. The only thing that was noticeable in the event logs was that 3 SQL jobs went off around the time the traces started acting up.
Behaviour:
A pattern has been identified which shows on Windows Event Viewer that there is an execution timeout close or at the time the trace files start becoming unwieldy.
Error: An error occurred starting traces for instance XXXXXXXXX. Error: Execution Timeout Expired.
The timeout period elapsed prior to completion of the operation or the server is not responding..
The trace start timeout value can be modified on the Trace Options tab of the Agent Properties dialog in the SQLcompliance Management Console.
Although, I do not believe by just adjusting the Timeout settings will cause for the traces to stop acting in that way, as these are recommended settings and other audited servers have these same settings but do not act in the same way. The issue only persists with one box.
Questions:
I want to find out if anyone else has experienced a similar issue and if so, was the environment the issue happened in dealing with a heavy load? By reducing the load did it help or were there other remediation steps to take? Or does anyone know of a database auditing tool which is lightweight and doesn't create these issues?!
Any help or advice appreciated!