How to debug Google app engine Sever Error 500? - google-app-engine

I deployed a Django web app to GAE, no errors during deployment.
But whenI try to open the website, it shows Server Error (500).
I tried to see some logs using gcloud app logs read, it only shows
2020-05-28 16:07:48 default[20200528t144758] [2020-05-28 16:07:48 +0000] [1] [INFO] Handling signal: term
2020-05-28 16:07:48 default[20200528t144758] [2020-05-28 16:07:48 +0000] [8] [INFO] Worker exiting (pid: 8)
2020-05-28 16:07:49 default[20200528t144758] [2020-05-28 16:07:49 +0000] [1] [INFO] Shutting down: Master
2020-05-28 16:07:49 default[20200528t144758] [2020-05-28 16:07:49 +0000] [1] [INFO] Handling signal: term
2020-05-28 16:07:49 default[20200528t144758] [2020-05-28 16:07:49 +0000] [8] [INFO] Worker exiting (pid: 8)
2020-05-28 16:07:50 default[20200528t144758] [2020-05-28 16:07:50 +0000] [1] [INFO] Shutting down: Master
2020-05-28 16:08:06 default[20200528t165550] "GET /" 500
The logs are not informative, so I wonder
1) if I could logon to the App Engine machine, and run my web application manually and see what's the error?
2) if not, what are the suggested ways to debug app engine errors?

In App Engine Flex environment, you can debug your instance by enabling the debug mode and SSH to the instance.
You may also write app logs and structured logs to stdout and stderr so that you can look into your application logs and request logs via Logs Viewer or the command line. You may also consider using Cloud Profiler which is currently a free service to capture profiling data of you application so that you would get a better undrestanding of the characteristics of your application as it runs.
Cloud Debugger would also let you inspect the state of your application while running without adding logging statements. Note that Cloud Debugger is currently a free service as well.

By setting the DEBUG=1 in Django project settings.py, now I'm able to see error details on GAE.

Related

Port XXXX is in use by another program. Either identify and stop that program, or start the server with a different port

I want to run a single python flask hello world. I deploy to App Engine, but it's showing like it's saying that the port is in use and it looks like it's running on multiple instances/threads/clones concurrently.
This is my main.py
from flask import Flask
app = Flask(__name__)
#app.route('/hello')
def helloIndex():
print("Hello world log console")
return 'Hello World from Python Flask!'
app.run(host='0.0.0.0', port=4444)
This is my app.yaml
runtime: python38
env: standard
instance_class: B2
handlers:
- url: /
script: auto
- url: .*
script: auto
manual_scaling:
instances: 1
This is my requirements.txt
gunicorn==20.1.0
flask==2.2.2
And this is the logs that I got:
* Serving Flask app 'main'
* Debug mode: off
Address already in use
Port 4444 is in use by another program. Either identify and stop that program, or start the server with a different port.
[2022-08-10 15:57:28 +0000] [1058] [INFO] Worker exiting (pid: 1058)
[2022-08-10 15:57:29 +0000] [1059] [INFO] Booting worker with pid: 1059
[2022-08-10 15:57:29 +0000] [1060] [INFO] Booting worker with pid: 1060
[2022-08-10 15:57:29 +0000] [1061] [INFO] Booting worker with pid: 1061
It says that Port 4444 is in use. Initially I tried 5000 (flask's default port) but it says it's in use. Also I tried removing the port=4444 but now it's saying Port 5000 is in use by another program, I guess flask by default assign port=5000. I'm suspecting that it's because GAE is running in multiple instances that's causing this error. If not, then please help to solve this issue.
App Engine apps should listen on port 8080 and not in any other ones.
So you may need to set this like
app.run(host='0.0.0.0', port=8080)
Close the editor and then reopen. When you stop the process next time use Ctrl+C if you're on the terminal
I figured it out. Delete your old file from your terminal and or folder that created the web app. In the terminal this is done by:
rm -file_name
Then try all over again with your a fresh file and it should be okay.

Diagnosing error in deploying GAE flex app

I've been using GAE flex for awhile now, and all of a sudden my deploy process ends on the command line with:
ERROR: (gcloud.app.deploy) Error Response: [4] Flex operation
projects/MY-PROJECT/regions/us-central1/operations/xxx
error [DEADLINE_EXCEEDED]: An internal error occurred while processing
task
/appengine-flex-v1/insert_flex_deployment/flex_create_resources>2019-09-04T21:29:03.412Z8424.ow.0:
Gave up polling Deployment Manager operation
MY-PROJECT/operation-xxx.
My logs don't have any helpful info. These are relevant logs from the deployment:
2019-09-04T14:07:07Z [2019-09-04 14:07:07 +0000] [1] [INFO] Shutting down: Master
2019-09-04T14:07:06Z [2019-09-04 14:07:06 +0000] [16] [INFO] Worker exiting (pid: 16)
2019-09-04T14:07:06Z [2019-09-04 14:07:06 +0000] [14] [INFO] Worker exiting (pid: 14)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [13] [INFO] Worker exiting (pid: 13)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [11] [INFO] Worker exiting (pid: 11)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [10] [INFO] Worker exiting (pid: 10)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [9] [INFO] Worker exiting (pid: 9)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [8] [INFO] Worker exiting (pid: 8)
2019-09-04T14:07:05Z [2019-09-04 14:07:05 +0000] [1] [INFO] Handling signal: term
2019-09-04T14:03:04Z [2019-09-04 14:03:04 +0000] [16] [INFO] Booting worker with pid: 16
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [14] [INFO] Booting worker with pid: 14
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [13] [INFO] Booting worker with pid: 13
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [11] [INFO] Booting worker with pid: 11
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [10] [INFO] Booting worker with pid: 10
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [9] [INFO] Booting worker with pid: 9
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [8] [INFO] Booting worker with pid: 8
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [1] [INFO] Using worker: sync
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
2019-09-04T14:03:03Z [2019-09-04 14:03:03 +0000] [1] [INFO] Starting gunicorn 19.9.0
The instance exists in the console and appears to be running, but it just returns a 404. The code runs fine locally.
Any ideas for how to diagnose what is going on?
I wonder if Google reduced a default deadline since the current deadline appears to be 4 minutes and my build has always taken longer than 4 minutes.
I figured this out and it is kind of a crazy Google Cloud bug. TL; DR -- Don't use Google Cloud Organization Policy Constraints.
Here is what happened according to my best understanding:
For my Google Cloud project, I picked the us-central region.
About 6 months ago I set a Google Cloud policy constraint for my organization so that I would use only US-based resources. This set a policy that allowed US resources that existed at that time.
My recent deploys of my flex app were being deployed to the us-central1-f zone. I believe Google picked the zone and I don't have control over that.
The us-central1-f was not allowed by my location policy because that zone did not exist at the time I set my location policy.
This caused my deploy to crash with the unhelpful error message in my question.
The way I figured this out was that I deployed Google's hello world flask app, and when deploying that app, I received a more helpful error message that allowed me to understand the problem.

Gunicorn Command Line Error inGoogle App Engine Standard Environment

I'm getting an endless stream of the following errors after deploying a Flask/Python3/Postgres app to app engine standard environment. The warning about Psycopg2 package is of concern, but not what's causing the app to fail to run. Rather, it's the invalid command line arguments to gunicorn, which are supplied by GAE, not by me. Is anyone able to deploy a Python3 Flask that uses Postgres to the Standard environment successfully?
Here's the log file output:
2018-12-11 02:51:37 +0000] [3738] [INFO] Booting worker with pid: 3738
2018-12-11 02:51:37 default[20181210t140744] /env/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
2018-12-11 02:51:37 default[20181210t140744] """)
2018-12-11 02:51:38 default[20181210t211942] usage: gunicorn [-h] [--debug] [--args]
2018-12-11 02:51:38 default[20181210t211942] gunicorn: error: unrecognized arguments: main:app --workers 1 -c /config/gunicorn.py
2018-12-11 02:51:38 default[20181210t211942] [2018-12-11 02:51:38 +0000] [882] [INFO] Worker exiting (pid: 882)
By default, if entrypoint is not defined in app.yaml, App Engine will look for an app called app in main.py. If you look at the official code sample in Github, it addresses it in the main.py file by declaring:
app = Flask(__name__)
Alternatively, this can be configured by adding an entrypoint to app.yaml pointing to another file. For example, if you declare app in a file called prod:
entrypoint: gunicorn -b :$PORT prod.app
You'll find additional details about the entrypoint configuration here.
From the comment and answer above, It looked like the problem had to be in my app. I confirmed that by running gunicorn locally on a bare-bones app without difficulty, but failing when running the real app.
One culprit turned out to be the use of argparse, which I was using so I could add a "debug" argument to the command line when working locally. Moving that code to the if __name__ == '__main__': section made the app run fine when using gunicorn locally.
But it still fails when deployed to GAE:
2018-12-12 20:09:16 default[20181212t094625] [2018-12-12 20:09:16 +0000] [8145] [INFO] Booting worker with pid: 8145
2018-12-12 20:09:16 default[20181212t094625] /env/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
2018-12-12 20:09:16 default[20181212t094625] """)
2018-12-12 20:09:16 default[20181212t094625] [2018-12-12 20:09:16 +0000] [8286] [INFO] Booting worker with pid: 8286
2018-12-12 20:09:16 default[20181212t094625] /env/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
2018-12-12 20:09:16 default[20181212t094625] """)
2018-12-12 20:09:19 default[20181212t094625] usage: gunicorn [-h] [--debug] [--args]
2018-12-12 20:09:19 default[20181212t094625] gunicorn: error: unrecognized arguments: -b :8081 main:app
2018-12-12 20:09:19 default[20181212t094625] [2018-12-12 20:09:19 +0000] [8145] [INFO] Worker exiting (pid: 8145)
Here’s app.yaml (minus the environment variable for SendGrid):
runtime: python37
env: standard
#threadsafe: true
entrypoint: gunicorn --workers 2 --bind :5000 main:app
#runtime_config:
# python_version: 3
# This beta setting is necessary for the db hostname parameter to be able to handle a URI in the
# form “/cloudsql/...” where ... is the instance given here:
beta_settings:
cloud_sql_instances: provost-access-148820:us-east1:cuny-courses

Akka Clustering is not working

I am trying to learn Akka Clustering following the tutorial provided here
I have created the app and the repo is here.
As mentioned in the tutorial I have started the FrontEndApp
> runMain TransformationFrontendApp
[info] Running TransformationFrontendApp
[INFO] [10/31/2017 17:28:05.293] [run-main-0] [akka.remote.Remoting] Starting remoting
[INFO] [10/31/2017 17:28:05.543] [run-main-0] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://ClusterSystem#localhost:54746]
[INFO] [10/31/2017 17:28:05.556] [run-main-0]
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node
[akka.tcp://ClusterSystem#localhost:54746] - Starting up...
[INFO] [10/31/2017 17:28:05.648] [run-main-0]
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node
[akka.tcp://ClusterSystem#localhost:54746] - Registered cluster JMX MBean
[akka:type=Cluster]
[INFO] [10/31/2017 17:28:05.648] [run-main-0]
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node
[akka.tcp://ClusterSystem#localhost:54746] - Started up successfully
[WARN] [10/31/2017 17:28:05.683] [ClusterSystem-akka.actor.default-dispatcher-2]
[WARN] [10/31/2017 17:28:05.748] [New I/O boss #3]
[NettyTransport(akka://ClusterSystem)] Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:2551
[WARN] [10/31/2017 17:28:05.750] [New I/O boss #3]
[NettyTransport(akka://ClusterSystem)] Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:2552
[WARN] [10/31/2017 17:28:05.751] [ClusterSystem-akka.remote.default-remote-dispatcher-12] [akka.tcp://ClusterSystem#localhost:54746/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40127.0.0.1%3A2551-0] Association with remote system [akka.tcp://ClusterSystem#127.0.0.1:2551] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem#127.0.0.1:2551]] Caused by: [Connection refused: /127.0.0.1:2551]
The above warn message repeats continuously even after I start the Backend App on 2551 and 2552.
The terminal log of starting backend actor on 2551.
> runMain TransformationBackendApp 2551
[info] Running TransformationBackendApp 2551
[INFO] [10/31/2017 17:28:50.867] [run-main-0] [akka.remote.Remoting] Starting remoting
[INFO] [10/31/2017 17:28:51.122] [run-main-0] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://ClusterSystem#localhost:2551]
[INFO] [10/31/2017 17:28:51.134] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2551] - Starting up...
[INFO] [10/31/2017 17:28:51.228] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2551] - Registered cluster JMX MBean [akka:type=Cluster]
[INFO] [10/31/2017 17:28:51.228] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2551] - Started up successfully
[WARN] [10/31/2017 17:28:51.259] [ClusterSystem-akka.actor.default-dispatcher-3] [akka.tcp://ClusterSystem#localhost:2551/system/cluster/core/daemon/downingProvider] Don't use auto-down feature of Akka Cluster in production. See 'Auto-downing (DO NOT USE)' section of Akka Cluster documentation.
[ ERROR] [10/31/2017 17:28:51.382] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40localhost%3A2551-2/endpointWriter] dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://ClusterSystem#127.0.0.1:2551/]] arriving at [akka.tcp://ClusterSystem#127.0.0.1:2551] inbound addresses are [akka.tcp://ClusterSystem#localhost:2551]
The last [Error] log repeats continuously.
The terminal log of starting backend actor on 2552.
> runMain TransformationBackendApp 2552
[info] Running TransformationBackendApp 2552
[INFO] [10/31/2017 17:28:25.451] [run-main-0] [akka.remote.Remoting] Starting remoting
[INFO] [10/31/2017 17:28:25.689] [run-main-0] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://ClusterSystem#localhost:2552]
[INFO] [10/31/2017 17:28:25.706] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2552] - Starting up...
[INFO] [10/31/2017 17:28:25.803] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2552] - Registered cluster JMX MBean [akka:type=Cluster]
[INFO] [10/31/2017 17:28:25.803] [run-main-0] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem#localhost:2552] - Started up successfully
[WARN] [10/31/2017 17:28:25.836] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://ClusterSystem#localhost:2552/system/cluster/core/daemon/downingProvider] Don't use auto-down feature of Akka Cluster in production. See 'Auto-downing (DO NOT USE)' section of Akka Cluster documentation.
[WARN] [10/31/2017 17:28:25.909] [New I/O boss #3] [NettyTransport(akka://ClusterSystem)] Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:2551
[WARN] [10/31/2017 17:28:25.910] [ClusterSystem-akka.remote.default-remote-dispatcher-13] [akka.tcp://ClusterSystem#localhost:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40127.0.0.1%3A2551-0] Association with remote system [akka.tcp://ClusterSystem#127.0.0.1:2551] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem#127.0.0.1:2551]] Caused by: [Connection refused: /127.0.0.1:2551]
[INFO] [10/31/2017 17:28:25.914] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/deadLetters] Message [akka.cluster.InternalClusterAction$InitJoin$] from Actor[akka://ClusterSystem/system/cluster/core/daemon/joinSeedNodeProcess-1#-937368711] to Actor[akka://ClusterSystem/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[ERROR] [10/31/2017 17:28:25.958] [ClusterSystem-akka.remote.default-remote-dispatcher-17] [akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40localhost%3A2552-2/endpointWriter] dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://ClusterSystem#127.0.0.1:2552/]] arriving at [akka.tcp://ClusterSystem#127.0.0.1:2552] inbound addresses are [akka.tcp://ClusterSystem#localhost:2552]
Not sure what is the reason backend cluster nodes are not able to detect each other and frontend actor node with back end.
Do I miss any settings?
The problem is in your application.conf. You have akka.remote.netty.tcp.hostname = "localhost" and akka.cluster.seed-nodes=["akka.tcp://ClusterSystem#127.0.0.1:2551", "akka.tcp://ClusterSystem#127.0.0.1:2552"]. You have to use either localhost or 127.0.0.1, not both:
akka {
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "localhost"
port = 0
}
}
cluster {
seed-nodes = ["akka.tcp://ClusterSystem#localhost:2551", "akka.tcp://ClusterSystem#localhost:2552"]
auto-down-unreachable-after = 10s
}
}

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Resources