Snowflake Kafka Connector, topic2table mapping, getting insufficient error - snowflake-cloud-data-platform

"snowflake.topic2table.map": "uat.product.topic:UAT_PRODUCT_TOPIC_15DEC2021" has been configured in the connector. I am getting below error. However. below grants were already given.
GRANT READ,WRITE ON FUTURE STAGES IN SCHEMA "KAFKA_DB"."KAFKA_SCHEMA" TO ROLE "KAFKA_CONNECTOR_ROLE_1";
GRANT WRITE ON STAGE "KAFKA_DB"."KAFKA_SCHEMA"."SNOWFLAKE_KAFKA_CONNECTOR_FILE_STREAM_DEMO_DISTRIBUTED_1770328299_STAGE_UAT_PRODUCT_TOPIC_15DEC2021" TO ROLE "KAFKA_CONNECTOR_ROLE_1";
But still getting below error
**10126** [2021-12-16 11:19:20,700] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1236)
**10117** [2021-12-16 11:19:20,651] INFO [Worker clientId=connect-1, groupId=connect-cluster] Attempt to heartbeat failed since group is rebalancing (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1082)
Please help
Below is the error which has made the snowflake-distributed-connector go down:
10072 [SF_KAFKA_CONNECTOR] Detail: Failed to upload file to Snowflake Stage though JDBC
10073 [SF_KAFKA_CONNECTOR] Message: SQL access control error:
10074 [SF_KAFKA_CONNECTOR] Insufficient privileges to operate on table stage 'UAT_PRODUCT_TOPIC_15DEC2021'
10075 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowExceptionSub(SnowflakeUtil.java:126)
10076 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowException(SnowflakeUtil.java:66)
10077 [SF_KAFKA_CONNECTOR] net.snowflake.client.core.StmtUtil.pollForOutput(StmtUtil.java:434)
10078 [SF_KAFKA_CONNECTOR] net.snowflake.client.core.StmtUtil.execute(StmtUtil.java:338)
10079 [SF_KAFKA_CONNECTOR] net.snowflake.client.core.SFStatement.executeHelper(SFStatement.java:482)
10080 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeFileTransferAgent.parseCommandInGS(SnowflakeFileTransferAgent.java:1180)
10081 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeFileTransferAgent.parseCommand(SnowflakeFileTransferAgent.java:843)
10082 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeFileTransferAgent.<init>(SnowflakeFileTransferAgent.java:819)
10083 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.DefaultSFConnectionHandler.getFileTransferAgent(DefaultSFConnectionHandler.java:187)
10084 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeConnectionV1.uploadStreamInternal(SnowflakeConnectionV1.java:867)
10085 [SF_KAFKA_CONNECTOR] net.snowflake.client.jdbc.SnowflakeConnectionV1.uploadStream(SnowflakeConnectionV1.java:772)
10086 [SF_KAFKA_CONNECTOR] com.snowflake.kafka.connector.internal.SnowflakeConnectionServiceV1.moveToTableStage(SnowflakeConnectionServiceV1.java:509)
10087 [SF_KAFKA_CONNECTOR] com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1$ServiceContext.moveToTableStage(SnowflakeSinkServiceV1.java:884)
10088 [SF_KAFKA_CONNECTOR] com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1$ServiceContext.checkStatus(SnowflakeSinkServiceV1.java:823)
10089 [SF_KAFKA_CONNECTOR] com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1$ServiceContext.lambda$startCleaner$0(SnowflakeSinkServiceV1.java:478)
10090 [SF_KAFKA_CONNECTOR] java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
10091 [SF_KAFKA_CONNECTOR] java.util.concurrent.FutureTask.run(FutureTask.java:266)
10092 [SF_KAFKA_CONNECTOR] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
10093 [SF_KAFKA_CONNECTOR] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
10094 [SF_KAFKA_CONNECTOR] java.lang.Thread.run(Thread.java:748)
10095 [com.snowflake.kafka.connector.internal.SnowflakeErrors.getException(SnowflakeErrors.java:284), com.snowflake.kafka.connector.internal.SnowflakeErrors.getException(SnowflakeErrors.java:266), com. snowflake.kafka.connector.internal.SnowflakeErrors.getException(SnowflakeErrors.java:256), com.snowflake.kafka.connector.internal.SnowflakeConnectionServiceV1.moveToTableStage(SnowflakeConnection ServiceV1.java:516), com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1$ServiceContext.moveToTableStage(SnowflakeSinkServiceV1.java:884), com.snowflake.kafka.connector.internal.Snowfla keSinkServiceV1$ServiceContext.checkStatus(SnowflakeSinkServiceV1.java:823), com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1$ServiceContext.lambda$startCleaner$0(SnowflakeSinkServic eV1.java:478), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:748)] (com.snowflake.kafka.connector.internal.Sno wflakeSinkServiceV1:81)
10096 [2021-12-16 11:19:11,319] INFO
10097 [SF_KAFKA_CONNECTOR] uploadWithoutConnection successful for stageName:SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_STAGE_UAT_PRODUCT_TOPIC_15DEC2021, filePath:file_stream_dem o_distributed_1770328299/UAT_PRODUCT_TOPIC_15DEC2021/2/2653366_2653661_1639653551128.json.gz (com.snowflake.kafka.connector.internal.SnowflakeInternalStage:63)
10098 [2021-12-16 11:19:11,320] INFO
10099 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2, flush pipe: file_stream_demo_distributed_1770328299/UAT_PRODUCT_TOP IC_15DEC2021/2/2653366_2653661_1639653551128.json.gz (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10100 [2021-12-16 11:19:15,828] INFO
10101 [SF_KAFKA_CONNECTOR] uploadWithoutConnection successful for stageName:SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_STAGE_UAT_PRODUCT_TOPIC_15DEC2021, filePath:file_stream_dem o_distributed_1770328299/UAT_PRODUCT_TOPIC_15DEC2021/5/2653598_2653914_1639653555629.json.gz (com.snowflake.kafka.connector.internal.SnowflakeInternalStage:63)
10102 [2021-12-16 11:19:15,828] INFO
10103 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5, flush pipe: file_stream_demo_distributed_1770328299/UAT_PRODUCT_TOP IC_15DEC2021/5/2653598_2653914_1639653555629.json.gz (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10104 [2021-12-16 11:19:17,994] INFO
10105 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2, ingest files: [file_stream_demo_distributed_1770328299/UAT_PRODUCT_ TOPIC_15DEC2021/2/2653366_2653661_1639653551128.json.gz] (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10106 [2021-12-16 11:19:17,994] INFO Sending Request UUID - 3050a7a3-afd8-4ff4-a86d-427c6d3724c1 (net.snowflake.ingest.SimpleIngestManager:554)
10107 [2021-12-16 11:19:17,994] INFO Created Insert Request : https://abc0000.XX-XXX.snowflakecomputing.com:443/v1/data/pipes/KAFKA_DB.KAFKA_SCHEMA.SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distri buted_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2/insertFiles?requestId=3050a7a3-afd8-4ff4-a86d-427c6d3724c1&showSkippedFiles=false (net.snowflake.ingest.connection.RequestBuilder:504)
10108 [2021-12-16 11:19:18,092] INFO In retryRequest for service unavailability with statusCode:200 and uri:/v1/data/pipes/KAFKA_DB.KAFKA_SCHEMA.SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1 770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2/insertFiles?requestId=3050a7a3-afd8-4ff4-a86d-427c6d3724c1&showSkippedFiles=false (net.snowflake.ingest.utils.HttpUtil:118)
[SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5, ingest files: [file_stream_demo_distributed_1770328299/UAT_PRODUCT_ TOPIC_15DEC2021/5/2653598_2653914_1639653555629.json.gz] (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10112 [2021-12-16 11:19:18,092] INFO Sending Request UUID - 24c5542c-75fc-4265-957f-e9c2d167ba7a (net.snowflake.ingest.SimpleIngestManager:554)
10113 [2021-12-16 11:19:18,092] INFO Created Insert Request : https://abc0000.XX-XXX-1.snowflakecomputing.com:443/v1/data/pipes/KAFKA_DB.KAFKA_SCHEMA.SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distri buted_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5/insertFiles?requestId=24c5542c-75fc-4265-957f-e9c2d167ba7a&showSkippedFiles=false (net.snowflake.ingest.connection.RequestBuilder:504)
10114 [2021-12-16 11:19:18,222] INFO In retryRequest for service unavailability with statusCode:200 and uri:/v1/data/pipes/KAFKA_DB.KAFKA_SCHEMA.SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1 770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5/insertFiles?requestId=24c5542c-75fc-4265-957f-e9c2d167ba7a&showSkippedFiles=false (net.snowflake.ingest.utils.HttpUtil:118)
10115 [2021-12-16 11:19:18,222] INFO Attempting to unmarshall insert response - HttpResponseProxy{HTTP/1.1 200 OK [Content-Type: application/json, Date: Thu, 16 Dec 2021 11:19:18 GMT, Expect-CT: enforc e, max-age=3600, Strict-Transport-Security: max-age=31536000, Vary: Accept-Encoding, User-Agent, X-Content-Type-Options: nosniff, X-Country: United States, X-Frame-Options: deny, X-XSS-Protection : : 1; mode=block, Content-Length: 88, Connection: keep-alive] ResponseEntityProxy{[Content-Type: application/json,Content-Length: 88,Chunked: false]}} (net.snowflake.ingest.SimpleIngestManager:5 62)
10116 [2021-12-16 11:19:18,223] INFO WorkerSinkTask{id=file-stream-demo-distributed-0} Committing offsets asynchronously using sequence number 1245: {uat.product.topic-2=OffsetAndMetadata{offset=265366 2, leaderEpoch=null, metadata=''}, uat.product.topic-3=OffsetAndMetadata{offset=2653774, leaderEpoch=null, metadata=''}, uat.product.topic-0=OffsetAndMetadata{offset=2657653, leaderEpoch=null, me tadata=''}, uat.product.topic-1=OffsetAndMetadata{offset=2665892, leaderEpoch=null, metadata=''}, uat.product.topic-4=OffsetAndMetadata{offset=2665734, leaderEpoch=null, metadata=''}, uat.product .topic-5=OffsetAndMetadata{offset=2653915, leaderEpoch=null, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask:346)
**10117** [2021-12-16 11:19:20,651] INFO [Worker clientId=connect-1, groupId=connect-cluster] Attempt to heartbeat failed since group is rebalancing (org.apache.kafka.clients.consumer.internals.AbstractCoo rdinator:1082)
10118 [2021-12-16 11:19:20,652] INFO [Worker clientId=connect-1, groupId=connect-cluster] Rebalance started (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:225)
10119 [2021-12-16 11:19:20,652] INFO [Worker clientId=connect-1, groupId=connect-cluster] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:553)
10120 [2021-12-16 11:19:20,661] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully joined group with generation 16 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:5 04)
10121 [2021-12-16 11:19:20,661] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 16 with protocol version 2 and got assignment: Assignment{error=0, leader='connect-1 -41cfdb70-f9b9-443b-bf18-4d6c77b5951c', leaderUrl='http://XX.XX.XX.XXX:8083/', offset=422, connectorIds=[file-stream-demo-distributed], taskIds=[file-stream-demo-distributed-0], revokedConnectorI ds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1681)
10122 [2021-12-16 11:19:20,661] WARN [Worker clientId=connect-1, groupId=connect-cluster] Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1089 )
10123 [2021-12-16 11:19:20,661] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset 419 is behind group assignment 422, reading to end of config log (org.apache.kafka. connect.runtime.distributed.DistributedHerder:1150)
10124 [2021-12-16 11:19:20,700] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished reading to end of log and updated config snapshot, new config log offset: 422 (org.apache.kafka.connec t.runtime.distributed.DistributedHerder:1154)
10125 [2021-12-16 11:19:20,700] INFO [Worker clientId=connect-1, groupId=connect-cluster] Starting connectors and tasks using config offset 422 (org.apache.kafka.connect.runtime.distributed.Distributed Herder:1208)
**10126** [2021-12-16 11:19:20,700] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1236)
10127 [2021-12-16 11:19:23,662] INFO [Worker clientId=connect-1, groupId=connect-cluster] Attempt to heartbeat failed since group is rebalancing (org.apache.kafka.clients.consumer.internals.AbstractCoo rdinator:1082)
10128 [2021-12-16 11:19:23,662] INFO [Worker clientId=connect-1, groupId=connect-cluster] Rebalance started (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:225)
10129 [2021-12-16 11:19:23,662] INFO [Worker clientId=connect-1, groupId=connect-cluster] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:553)
10130 [2021-12-16 11:19:23,665] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully joined group with generation 17 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:5 04)
10131 [2021-12-16 11:19:23,665] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 17 with protocol version 2 and got assignment: Assignment{error=0, leader='connect-1 -41cfdb70-f9b9-443b-bf18-4d6c77b5951c', leaderUrl='http://XX.XX.XX.XXX:8083/', offset=422, connectorIds=[file-stream-demo-distributed], taskIds=[file-stream-demo-distributed-0], revokedConnectorI ds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1681)
10132 [2021-12-16 11:19:23,666] INFO [Worker clientId=connect-1, groupId=connect-cluster] Starting connectors and tasks using config offset 422 (org.apache.kafka.connect.runtime.distributed.Distributed Herder:1208)
10133 [2021-12-16 11:19:23,666] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1236)
10134 [2021-12-16 11:19:24,376] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:67)
10135 [2021-12-16 11:19:24,377] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:327)
10136 [2021-12-16 11:19:24,388] INFO Stopped http_XX.XX.XX.XXX8083#464ede1f{HTTP/1.1,[http/1.1]}{XX.XX.XX.XXX:8083} (org.eclipse.jetty.server.AbstractConnector:380)
10137 [2021-12-16 11:19:24,389] INFO node0 Stopped scavenging (org.eclipse.jetty.server.session:158)
10138 [2021-12-16 11:19:24,395] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:344)
10139 [2021-12-16 11:19:24,395] INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder:676)
10140 [2021-12-16 11:19:24,396] INFO [Worker clientId=connect-1, groupId=connect-cluster] Stopping connectors and tasks that are still assigned to this worker. (org.apache.kafka.connect.runtime.distrib uted.DistributedHerder:650)
10141 [2021-12-16 11:19:24,399] INFO Stopping connector file-stream-demo-distributed (org.apache.kafka.connect.runtime.Worker:387)
10142 [2021-12-16 11:19:24,399] INFO Scheduled shutdown for WorkerConnector{id=file-stream-demo-distributed} (org.apache.kafka.connect.runtime.WorkerConnector:249)
10143 [2021-12-16 11:19:24,399] INFO Stopping task file-stream-demo-distributed-0 (org.apache.kafka.connect.runtime.Worker:836)
10144 [2021-12-16 11:19:24,399] INFO
10145 [SF_KAFKA_CONNECTOR] SnowflakeSinkConnector:stop (com.snowflake.kafka.connector.SnowflakeSinkConnector:121)
10146 [2021-12-16 11:19:24,401] INFO
10147 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]:close (com.snowflake.kafka.connector.SnowflakeSinkTask:209)
10148 [2021-12-16 11:19:24,401] INFO
10149 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10150 [2021-12-16 11:19:24,401] INFO
10151 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10152 [2021-12-16 11:19:24,402] INFO
10153 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_2: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10154 [2021-12-16 11:19:24,402] INFO
10155 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_3: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10156 [2021-12-16 11:19:24,402] INFO
10157 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10158 [2021-12-16 11:19:24,402] INFO
10159 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_3: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10160 [2021-12-16 11:19:24,402] INFO
10161 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_0: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10162 [2021-12-16 11:19:24,402] INFO
10163 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10164 [2021-12-16 11:19:24,402] INFO
10165 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_0: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10166 [2021-12-16 11:19:24,402] INFO
10167 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_1: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10168 [2021-12-16 11:19:24,402] INFO
10169 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10170 [2021-12-16 11:19:24,403] INFO
10171 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_1: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10172 [2021-12-16 11:19:24,403] INFO
10173 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_4: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10174 [2021-12-16 11:19:24,403] INFO
10175 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10176 [2021-12-16 11:19:24,403] INFO
10177 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_4: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10178 [2021-12-16 11:19:24,403] INFO
10179 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5: cleaner terminated (com.snowflake.kafka.connector.internal.Snowflak eSinkServiceV1:63)
10180 [2021-12-16 11:19:24,403] INFO
10181 [SF_KAFKA_CONNECTOR] IngestService Closed (com.snowflake.kafka.connector.internal.SnowflakeIngestionServiceV1:32)
10182 [2021-12-16 11:19:24,403] INFO
10183 [SF_KAFKA_CONNECTOR] pipe SNOWFLAKE_KAFKA_CONNECTOR_file_stream_demo_distributed_1770328299_PIPE_UAT_PRODUCT_TOPIC_15DEC2021_5: service closed (com.snowflake.kafka.connector.internal.SnowflakeSin kServiceV1:63)
10184 [2021-12-16 11:19:24,404] INFO
10185 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]:close. Time: 0 seconds (com.snowflake.kafka.connector.SnowflakeSinkTask:214)
10186 [2021-12-16 11:19:24,404] INFO
10187 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]:stop (com.snowflake.kafka.connector.SnowflakeSinkTask:167)
10188 [2021-12-16 11:19:24,404] INFO
10189 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10190 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10191 [2021-12-16 11:19:24,405] INFO
10192 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10193 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10194 [2021-12-16 11:19:24,405] INFO
10195 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10196 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10197 [2021-12-16 11:19:24,409] INFO
10198 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10199 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10200 [2021-12-16 11:19:24,411] INFO Completed shutdown for WorkerConnector{id=file-stream-demo-distributed} (org.apache.kafka.connect.runtime.WorkerConnector:269)
10201 [2021-12-16 11:19:24,413] INFO
10202 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10203 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10204 [2021-12-16 11:19:24,413] INFO
10205 [SF_KAFKA_CONNECTOR] Cleaner terminated by an interrupt:
10206 [SF_KAFKA_CONNECTOR] sleep interrupted (com.snowflake.kafka.connector.internal.SnowflakeSinkServiceV1:63)
10207 [2021-12-16 11:19:24,413] INFO [Consumer clientId=connector-consumer-file-stream-demo-distributed-0, groupId=connect-file-stream-demo-distributed] Revoke previously assigned partitions uat.produc t.topic-2, uat.product.topic-3, uat.product.topic-0, uat.product.topic-1, uat.product.topic-4, uat.product.topic-5 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:307)
10208 [2021-12-16 11:19:24,414] WARN
10209 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]: sink not initialized or closed before preCommit (com.snowflake.kafka.connector.SnowflakeSinkTask:256)
10210 [2021-12-16 11:19:24,414] INFO
10211 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]:close (com.snowflake.kafka.connector.SnowflakeSinkTask:209)
10212 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 2, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10213 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 3, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10214 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 0, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10215 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 1, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10216 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 4, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10217 [2021-12-16 11:19:24,414] WARN Failed to close sink service for Topic: uat.product.topic, Partition: 5, sink service hasn't been initialized (com.snowflake.kafka.connector.internal.SnowflakeSinkS erviceV1:81)
10218 [2021-12-16 11:19:24,414] INFO
10219 [SF_KAFKA_CONNECTOR] SnowflakeSinkTask[ID:0]:close. Time: 0 seconds (com.snowflake.kafka.connector.SnowflakeSinkTask:214)
10220 [2021-12-16 11:19:24,414] INFO [Consumer clientId=connector-consumer-file-stream-demo-distributed-0, groupId=connect-file-stream-demo-distributed] Member connector-consumer-file-stream-demo-distr ibuted-0-4d8276fe-e3e4-4b89-843f-b2c595f17e94 sending LeaveGroup request to coordinator 10.28.18.248:9092 (id: 2147483645 rack: null) due to the consumer is being closed (org.apache.kafka.clients .consumer.internals.AbstractCoordinator:1005)
10221 [2021-12-16 11:19:24,433] INFO [Worker clientId=connect-1, groupId=connect-cluster] Member connect-1-41cfdb70-f9b9-443b-bf18-4d6c77b5951c sending LeaveGroup request to coordinator XX.XX.XX.XXX:90 92 (id: 2147483647 rack: null) due to the consumer is being closed (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1005)
10222 [2021-12-16 11:19:24,433] WARN [Worker clientId=connect-1, groupId=connect-cluster] Close timed out with 1 pending requests to coordinator, terminating client connections (org.apache.kafka.client s.consumer.internals.AbstractCoordinator:986)
10223 [2021-12-16 11:19:24,434] INFO Stopping KafkaBasedLog for topic connect-status (org.apache.kafka.connect.util.KafkaBasedLog:167)
10224 [2021-12-16 11:19:24,435] INFO [Producer clientId=producer-2] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1189)
10225 [2021-12-16 11:19:24,438] INFO Stopped KafkaBasedLog for topic connect-status (org.apache.kafka.connect.util.KafkaBasedLog:193)
10226 [2021-12-16 11:19:24,438] INFO Closing KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:285)
10227 [2021-12-16 11:19:24,438] INFO Stopping KafkaBasedLog for topic connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:167)
10228 [2021-12-16 11:19:24,444] INFO [Producer clientId=producer-3] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1189)
10229 [2021-12-16 11:19:24,448] INFO Stopped KafkaBasedLog for topic connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:193)
10230 [2021-12-16 11:19:24,448] INFO Closed KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:287)
10231 [2021-12-16 11:19:24,448] INFO Worker stopping (org.apache.kafka.connect.runtime.Worker:209)
10232 [2021-12-16 11:19:24,448] INFO Stopping KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:134)
10233 [2021-12-16 11:19:24,448] INFO Stopping KafkaBasedLog for topic connect-offsets (org.apache.kafka.connect.util.KafkaBasedLog:167)
10234 [2021-12-16 11:19:24,448] INFO [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1189)
10235 [2021-12-16 11:19:24,451] INFO Stopped KafkaBasedLog for topic connect-offsets (org.apache.kafka.connect.util.KafkaBasedLog:193)
10236 [2021-12-16 11:19:24,451] INFO Stopped KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:136)
10237 [2021-12-16 11:19:24,451] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker:230)
10238 [2021-12-16 11:19:24,452] INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder:299)
10239 [2021-12-16 11:19:24,453] INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder:696)
10240 [2021-12-16 11:19:24,453] INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect:72)error

In order to allow grants to the database and schema the following grants need to be done:
grant usage on database kafka_db to role kafka_connector_role_1;
grant usage on schema kafka_schema to role kafka_connector_role_1;
grant create table on schema kafka_schema to role kafka_connector_role_1;
grant create stage on schema kafka_schema to role kafka_connector_role_1;
grant create pipe on schema kafka_schema to role kafka_connector_role_1
Has all these grants been done?

Related

Nodetool Rebuild :- Stream Failing afer sometime

I have added a new node to my existing single node cassandra cluster. It has around 48gb of data.
There is only one keyspace responsible for that and it has a replication factor of '2'(I changed it after adding the new node). I am trying to run nodetool rebuild on the new node so data can be streamed to it from the seed node.
The stream ended after transferring 36gb of data and the node went down. So I repeated the process but the stream keeps on failing after transferring some data (12-25 gb).
It ends with the following error.
error: Error while rebuilding node: Stream failed
-- StackTrace --
java.lang.RuntimeException: Error while rebuilding node: Stream failed
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1319)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
P.S. I have made sure that the streaming_socket_timeout_in_ms is set to at least 24 hours.
Kindly help me out here guys.
Thanks.
Update :-
I ran nodetool rebuild keyspace_name instead of nodetool rebuild and it ended with this error again.
WARN [StreamReceiveTask:9] 2019-10-23 11:14:41,522 StreamResultFuture.java:214 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Stream failed
ERROR [RMI TCP Connection(12)-10.128.1.3] 2019-10-23 11:14:42,316 StorageService.java:1318 - Error while rebuilding node
org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:571) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:281) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_222]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_222]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]
INFO [Service Thread] 2019-10-23 11:14:43,223 GCInspector.java:284 - ConcurrentMarkSweep GC in 310ms. CMS Old Gen: 2391324840 -> 639245216; Code Cache: 38320192 -> 38627904; Compressed Class Space: 554$
ERROR [STREAM-IN-/10.128.1.1:7000] 2019-10-23 11:14:48,769 StreamSession.java:593 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.1
java.lang.RuntimeException: Outgoing stream handler has been closed
at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:143) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:655) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:523) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
Update 2 :-
I tried to do nodetool rebuild again on a fresh node
The stream fails again after transfering around 95% of data.
This is the log of streaming node
INFO [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,497 StreamResultFuture.java:116 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56488] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:16:03,600 StreamResultFuture.java:173 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO [Service Thread] 2019-10-23 11:19:14,472 GCInspector.java:284 - ParNew GC in 517ms. CMS Old Gen: 104131728 -> 121315352; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 67963984 -> 61263088
ERROR [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:56:43,902 StreamSession.java:706 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Remote peer 10.128.1.3 failed stream session.
INFO [IndexSummaryManager:1] 2019-10-23 11:58:32,284 IndexSummaryRedistribution.java:77 - Redistributing index summaries
INFO [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,687 StreamResultFuture.java:187 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56486] 2019-10-23 11:59:38,688 StreamSession.java:593 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.lang.RuntimeException: Transfer of file /var/lib/cassandra/data/thingsboard/ts_kv_cf-53b7bf3096ec11e99154356269723c5c/md-583-big-Data.db already completed or aborted (perhaps session failed?).
at org.apache.cassandra.streaming.messages.OutgoingFileMessage.startTransfer(OutgoingFileMessage.java:119) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:49) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
WARN [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,688 StreamResultFuture.java:214 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Stream failed
INFO [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,860 StreamResultFuture.java:116 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56676] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:03:24,950 StreamResultFuture.java:173 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO [Service Thread] 2019-10-23 12:04:18,160 GCInspector.java:284 - ParNew GC in 307ms. CMS Old Gen: 124972984 -> 125070416; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 61042328 -> 82423296
INFO [GossipStage:1] 2019-10-23 12:27:39,200 Gossiper.java:1026 - InetAddress /10.128.1.3 is now DOWN
INFO [HANDSHAKE-/10.128.1.3] 2019-10-23 12:27:39,424 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3
ERROR [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,107 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.net.SocketException: End-of-stream reached
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:71) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
INFO [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:187 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56674] 2019-10-23 12:27:45,108 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:145) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter.lambda$write$0(CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.applyToChannel(BufferedDataOutputStreamPlus.java:350) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter.write(CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:101) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:52) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) ~[na:1.8.0_222]
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:141) ~[apache-cassandra-3.11.4.jar:3.11.4]
... 10 common frames omitted
WARN [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:214 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Stream failed
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:525 - Removing host: 0e8ad28d-6cc2-46df-8d3f-f346d464db40
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:526 - Sleeping for 30000ms to ensure /10.128.1.3 does not change
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:49,854 Gossiper.java:533 - Advertising removal for /10.128.1.3
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,245 StreamResultFuture.java:90 - [Stream #aae08f50-f590-11e9-9934-850cf6bcace3] Executing streaming plan for Restore replica count
INFO [MiscStage:1] 2019-10-23 12:28:50,247 StorageService.java:4459 - Received unexpected REPLICATION_FINISHED message from /10.128.1.1. Was this node recently a removal coordinator?
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,248 StorageService.java:2584 - Removing tokens [-9135980046459212380, -9100471967410923634, -9097242662756219549, -8974765285872613713, -895$
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,317 Gossiper.java:557 - Completing removal of /10.128.1.3
INFO [HANDSHAKE-/10.128.1.3] 2019-10-23 12:31:35,019 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3
I am totally clueless about why it's failing.
Can anyone point me in the right drection ?
I have made sure that the there is no firewall issue also I am not using SSL for internode communication.
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
There could be a number of reasons for this (corrupted data, networking problems, schema issues between the two nodes) but basically it means the connection is getting severed and killing off the streaming in progress.
Networking issues are most likely. If you have any networking metrics try to use those for debbugging the connection.
There are a few things you can do to try to be clever here, the main thing is reduce the volume of streaming you need to do. You can achieve this by:
reducing the keyspace RF back to 1
Adding the node using auto_bootstrap: true in cassandra.yaml
Re-increasing the RF back to 2
Repairing the data
This will yield the same outcome where you've created 2 nodes that both contain 100% of the data, but during the node standup process, you only streamed 1/2 of that data. The repairs then in smaller sessions (smaller units of work) restored any other data that was missing to get you back to 100%.
On a side-note my advice would be for you to start regularly snapshotting your node, as there appear to be signs of bad health. Running a single node of Cassandra means you're not really protected from data loss, thats why C* is distrubuted and why replication_factor 3 is recommended for most setups.

Java client fails to connect to local Flink cluster

I am trying out a small program with a local Flink cluster, setup according to the instructions here. The sample wordcount program runs fine, but when I attempt to run my own program, it stalls and fails while connecting to the job manager. This is Flink 1.5 with JDK 1.8
The relevant part of the code is
FlinkPipelineOptions options = PipelineOptionsFactory.as(FlinkPipelineOptions.class);
options.setStreaming(true);
options.setFlinkMaster("localhost:6123");
options.setRunner(FlinkRunner.class);
I start the cluster with start-cluster.sh, and I can see the two processes (job and task managers) are running. The logs on Flink don't have much. On the client side, after turning on debug, I can see the following
18:43:20.507 [flink-akka.actor.default-dispatcher-4] INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#talonx:38183]
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Actor system started at akka.tcp://flink#talonx:38183
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
18:43:20.521 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received SubmitJobAndWait(JobGraph(jobId: dbf63281771465550fd3598b2b67b91f)) but there is no connection to a JobManager yet.
18:43:20.522 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received job test-talonx-0618131319-b721a69a (dbf63281771465550fd3598b2b67b91f).
18:43:20.523 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Disconnect from JobManager null.
After a while, I get the following exception on the client
19:03:19.396 [main] ERROR org.apache.beam.runners.flink.FlinkRunner - Pipeline execution failed
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:449)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:212)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:176)
at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:126)
at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:115)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at Test.main(Test.java:106)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
... 10 common frames omitted
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What might be missing here?

For SQL Server sqoop, import doesn't work when trying to pass the query from a file

This is the script. sqoop_sql.sh
query=$(cat ${SQL_SCRIPT})
where_clause=" where dateadded >= '2016-05-01' and dateadded < '2016-06-01' and \$CONDITIONS"
sqoop import -D mapreduce.job.queuename=s_sourcedata \
--connect 'jdbc:sqlserver://connection' \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--username name \
--password pas \
--query "${query}${where_clause}" \
--as-parquetfile \
--split-by dateadded \
--delete-target-dir \
--target-dir prioritypass_history \
-m 1
It doesn't work this way, but if I change first string to
query="select * FROM smth.[dbo].[tablename]"
it works.
My action looks like this
<action name="history" cred="hv_cred">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${JOB_TRACKER}</job-tracker>
<name-node>${NAME_NODE}</name-node>
<exec>sqoop_sql.sh</exec>
<env-var>SQL_SCRIPT=${SQL_SCRIPT_HISTORY}</env-var>
...
<file>${WORKFLOW_APPLICATION_PATH}/bash/sqoop_sql.sh#sqoop_sql.sh</file>
<file>${WORKFLOW_APPLICATION_PATH}/oracle/${SQL_SCRIPT_HISTORY}#${SQL_SCRIPT_HISTORY}</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
The thing is I used this same code to import data from oracle, changing only connection details. My only guess is that oozie doesn't like the fact the script is in folder oracle, but I'm not sure and don't know what to change it to if that's the case.
PS
I don't use sqoop action because there are some libs missing on the claster and it doesn't work.
Edit
I was wrong. The problem isn't in folder name.
I left a bare shell action and shell script. After running it, it worked, then I killed the coordinator and restarted it. It didn't work. There were no changes to the code inbetween. Is it because of some settings, some error on claster? I don't know. Yet.
Log Type: stderr
Log Upload Time: Tue Mar 20 18:52:54 +0300 2018
Log Length: 240
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Log Type: stdout
Log Upload Time: Tue Mar 20 18:52:54 +0300 2018
Log Length: 0
Log Type: syslog
Log Upload Time: Tue Mar 20 18:52:54 +0300 2018
Log Length: 28581
2018-03-20 18:52:33,683 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1518685780617_47486_000001
2018-03-20 18:52:33,898 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2018-03-20 18:52:33,921 WARN [main] org.apache.hadoop.security.token.Token: Cannot find class for token kind HIVE_DELEGATION_TOKEN
2018-03-20 18:52:33,921 WARN [main] org.apache.hadoop.security.token.Token: Cannot find class for token kind HIVE_DELEGATION_TOKEN
Kind: HIVE_DELEGATION_TOKEN, Service: , Ident: 00 0a 6d 6d 69 6b 68 61 79 6c 6f 76 05 6f 6f 7a 69 65 3f 6f 6f 7a 69 65 2f 62 64 61 31 31 6e 6f 64 65 30 34 2e 6d 6f 73 63 6f 77 2e 61 6c 66 61 69 6e 74 72 61 2e 6e 65 74 40 42 44 41 2e 4d 4f 53 43 4f 57 2e 41 4c 46 41 49 4e 54 52 41 2e 4e 45 54 8a 01 62 44 1c a4 42 8a 01 62 68 29 28 42 8e 03 82 07
2018-03-20 18:52:33,921 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier#6bf0219d)
2018-03-20 18:52:33,923 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: RM_DELEGATION_TOKEN, Service: 172.25.55.232:8032,172.25.55.233:8032, Ident: (owner=user, renewer=yarn, realUser=oozie/host#host, issueDate=1521561150672, maxDate=1522165950672, sequenceNumber=82589, masterKeyId=554)
2018-03-20 18:52:33,924 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:bda11, Ident: (HDFS_DELEGATION_TOKEN token 270591 for user)
2018-03-20 18:52:33,924 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: MR_DELEGATION_TOKEN, Service: 172.25.55.232:10020, Ident: (owner=user, renewer=yarn, realUser=oozie/host#host, issueDate=1521561150796, maxDate=1522165950796, sequenceNumber=1524, masterKeyId=35)
2018-03-20 18:52:34,414 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-20 18:52:34,430 WARN [main] org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2018-03-20 18:52:34,489 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,490 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,490 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,542 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in config null
2018-03-20 18:52:34,543 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
2018-03-20 18:52:34,642 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
2018-03-20 18:52:34,643 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher
2018-03-20 18:52:34,644 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher
2018-03-20 18:52:34,644 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskAttemptEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher
2018-03-20 18:52:34,645 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventType for class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler
2018-03-20 18:52:34,646 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.speculate.Speculator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$SpeculatorEventDispatcher
2018-03-20 18:52:34,646 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
2018-03-20 18:52:34,647 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncher$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerLauncherRouter
2018-03-20 18:52:34,697 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://bda11:8020]
2018-03-20 18:52:34,723 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://bda11:8020]
2018-03-20 18:52:34,747 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://bda11:8020]
2018-03-20 18:52:34,755 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,755 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,755 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2018-03-20 18:52:34,763 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
2018-03-20 18:52:34,805 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
2018-03-20 18:52:34,983 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/var/run/cloudera-scm-agent/process/22786-yarn-NODEMANAGER/hadoop-metrics2.properties
2018-03-20 18:52:35,035 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Adding job token for job_1518685780617_47486 to jobTokenSecretManager
2018-03-20 18:52:35,133 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Not uberizing job_1518685780617_47486 because: not enabled;
2018-03-20 18:52:35,156 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Input size for job job_1518685780617_47486 = 0. Number of splits = 1
2018-03-20 18:52:35,156 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Number of reduces for job job_1518685780617_47486 = 0
2018-03-20 18:52:35,156 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1518685780617_47486Job Transitioned from NEW to INITED
2018-03-20 18:52:35,157 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster launching normal, non-uberized, multi-container job job_1518685780617_47486.
2018-03-20 18:52:35,183 INFO [main] org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2018-03-20 18:52:35,190 INFO [Socket Reader #1 for port 38335] org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 38335
2018-03-20 18:52:35,204 INFO [main] org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB to the server
2018-03-20 18:52:35,229 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2018-03-20 18:52:35,229 INFO [IPC Server listener on 38335] org.apache.hadoop.ipc.Server: IPC Server listener on 38335: starting
2018-03-20 18:52:35,230 INFO [main] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Instantiated MRClientService at host/host:38335
2018-03-20 18:52:35,288 INFO [main] org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2018-03-20 18:52:35,292 INFO [main] org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
2018-03-20 18:52:35,297 INFO [main] org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.mapreduce is not defined
2018-03-20 18:52:35,306 INFO [main] org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2018-03-20 18:52:35,331 INFO [main] org.apache.hadoop.http.HttpServer2: Added filter AM_PROXY_FILTER (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to context mapreduce
2018-03-20 18:52:35,331 INFO [main] org.apache.hadoop.http.HttpServer2: Added filter AM_PROXY_FILTER (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to context static
2018-03-20 18:52:35,334 INFO [main] org.apache.hadoop.http.HttpServer2: adding path spec: /mapreduce/*
2018-03-20 18:52:35,334 INFO [main] org.apache.hadoop.http.HttpServer2: adding path spec: /ws/*
2018-03-20 18:52:35,342 INFO [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 24744
2018-03-20 18:52:35,342 INFO [main] org.mortbay.log: jetty-6.1.26.cloudera.4
2018-03-20 18:52:35,370 INFO [main] org.mortbay.log: Extract jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/jars/hadoop-yarn-common-2.6.0-cdh5.7.0.jar!/webapps/mapreduce to ./tmp/Jetty_0_0_0_0_24744_mapreduce____z2u350/webapp
2018-03-20 18:52:35,578 INFO [main] org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:24744
2018-03-20 18:52:35,578 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: Web app /mapreduce started at 24744
2018-03-20 18:52:35,859 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules
2018-03-20 18:52:35,864 INFO [main] org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2018-03-20 18:52:35,865 INFO [Socket Reader #1 for port 27207] org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 27207
2018-03-20 18:52:35,897 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2018-03-20 18:52:35,897 INFO [IPC Server listener on 27207] org.apache.hadoop.ipc.Server: IPC Server listener on 27207: starting
2018-03-20 18:52:35,918 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
2018-03-20 18:52:35,918 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2018-03-20 18:52:35,918 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
2018-03-20 18:52:35,986 INFO [main] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm340
2018-03-20 18:52:36,010 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:29304, vCores:19>
2018-03-20 18:52:36,010 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.s_sourcedata
2018-03-20 18:52:36,014 INFO [main] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
2018-03-20 18:52:36,014 INFO [main] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: The thread pool initial size is 10
2018-03-20 18:52:36,016 INFO [main] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
2018-03-20 18:52:36,023 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1518685780617_47486Job Transitioned from INITED to SETUP
2018-03-20 18:52:36,024 INFO [CommitterEvent Processor #0] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_SETUP
2018-03-20 18:52:36,027 INFO [CommitterEvent Processor #0] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
2018-03-20 18:52:36,034 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1518685780617_47486Job Transitioned from SETUP to RUNNING
2018-03-20 18:52:36,064 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1518685780617_47486_m_000000 Task Transitioned from NEW to SCHEDULED
2018-03-20 18:52:36,066 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2018-03-20 18:52:36,067 INFO [Thread-54] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:2048, vCores:1>
2018-03-20 18:52:36,090 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1518685780617_47486, File: hdfs://bda11:8020/user/user/.staging/job_1518685780617_47486/job_1518685780617_47486_1.jhist
2018-03-20 18:52:36,298 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://bda11:8020]
2018-03-20 18:52:37,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2018-03-20 18:52:37,047 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1518685780617_47486: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:91136, vCores:111> knownNMs=6
2018-03-20 18:52:38,055 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2018-03-20 18:52:38,089 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_e16_1518685780617_47486_01_000002 to attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:38,090 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:0
2018-03-20 18:52:38,136 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Job jar is not present. Not adding any jar to the list of resources.
2018-03-20 18:52:38,204 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: The job-conf file on the remote FS is /user/user/.staging/job_1518685780617_47486/job.xml
2018-03-20 18:52:38,212 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Adding #4 tokens and #1 secret keys for NM use for launching container
2018-03-20 18:52:38,212 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Size of containertokens_dob is 5
2018-03-20 18:52:38,212 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Putting shuffle token in serviceData
2018-03-20 18:52:38,244 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
2018-03-20 18:52:38,248 INFO [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_e16_1518685780617_47486_01_000002 taskAttempt attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:38,249 INFO [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:38,250 INFO [ContainerLauncher #0] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : host:8041
2018-03-20 18:52:38,297 INFO [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1518685780617_47486_m_000000_0 : 13562
2018-03-20 18:52:38,299 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1518685780617_47486_m_000000_0] using containerId: [container_e16_1518685780617_47486_01_000002 on NM: [host:8041]
2018-03-20 18:52:38,302 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
2018-03-20 18:52:38,302 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1518685780617_47486_m_000000 Task Transitioned from SCHEDULED to RUNNING
2018-03-20 18:52:39,092 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1518685780617_47486: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:89088, vCores:110> knownNMs=6
2018-03-20 18:52:41,461 INFO [Socket Reader #1 for port 27207] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1518685780617_47486 (auth:SIMPLE)
2018-03-20 18:52:41,476 INFO [Socket Reader #1 for port 27207] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1518685780617_47486 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol
2018-03-20 18:52:41,483 INFO [IPC Server handler 0 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1518685780617_47486_m_17592186044418 asked for a task
2018-03-20 18:52:41,483 INFO [IPC Server handler 0 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1518685780617_47486_m_17592186044418 given task: attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,454 INFO [IPC Server handler 5 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1518685780617_47486_m_000000_0 is : 0.0
2018-03-20 18:52:46,529 INFO [IPC Server handler 4 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state update from attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,529 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from RUNNING to COMMIT_PENDING
2018-03-20 18:52:46,530 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: attempt_1518685780617_47486_m_000000_0 given a go for committing the task output.
2018-03-20 18:52:46,530 INFO [IPC Server handler 1 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request from attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,531 INFO [IPC Server handler 1 on 27207] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit for attempt_1518685780617_47486_m_000000_0:true
2018-03-20 18:52:46,593 INFO [IPC Server handler 3 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1518685780617_47486_m_000000_0 is : 1.0
2018-03-20 18:52:46,599 INFO [IPC Server handler 2 on 27207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,601 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from COMMIT_PENDING to SUCCESS_FINISHING_CONTAINER
2018-03-20 18:52:46,609 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,610 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1518685780617_47486_m_000000 Task Transitioned from RUNNING to SUCCEEDED
2018-03-20 18:52:46,611 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1
2018-03-20 18:52:46,612 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1518685780617_47486Job Transitioned from RUNNING to COMMITTING
2018-03-20 18:52:46,612 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT
2018-03-20 18:52:46,641 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for JobFinishedEvent
2018-03-20 18:52:46,642 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1518685780617_47486Job Transitioned from COMMITTING to SUCCEEDED
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: We are finishing cleanly so this is the last retry
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the services
2018-03-20 18:52:46,643 INFO [Thread-73] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0
2018-03-20 18:52:46,668 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Copying hdfs://bda11:8020/user/user/.staging/job_1518685780617_47486/job_1518685780617_47486_1.jhist to hdfs://bda11:8020/user/history/done_intermediate/user/job_1518685780617_47486-1521561150808-user-oozie%3Alauncher%3AT%3Dshell%3AW%3Dserver_import%3AA%3Dhistory%3AI-1521561166640-1-0-SUCCEEDED-root.s_sourcedata-1521561156018.jhist_tmp
2018-03-20 18:52:46,705 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://bda11:8020/user/history/done_intermediate/user/job_1518685780617_47486-1521561150808-user-oozie%3Alauncher%3AT%3Dshell%3AW%3Dserver_import%3AA%3Dhistory%3AI-1521561166640-1-0-SUCCEEDED-root.s_sourcedata-1521561156018.jhist_tmp to hdfs://bda11:8020/user/history/done_intermediate/user/job_1518685780617_47486-1521561150808-user-oozie%3Alauncher%3AT%3Dshell%3AW%3Dserver_import%3AA%3Dhistory%3AI-1521561166640-1-0-SUCCEEDED-root.s_sourcedata-1521561156018.jhist
2018-03-20 18:52:46,706 INFO [Thread-73] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop()
2018-03-20 18:52:46,706 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1518685780617_47486_m_000000_0
2018-03-20 18:52:46,706 INFO [Thread-73] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : host:8041
2018-03-20 18:52:46,716 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1518685780617_47486_m_000000_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCEEDED
2018-03-20 18:52:46,717 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to
2018-03-20 18:52:46,718 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: History url is http://host:19888/jobhistory/job/job_1518685780617_47486
2018-03-20 18:52:46,722 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Waiting for application to be successfully unregistered.
2018-03-20 18:52:47,723 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:0
2018-03-20 18:52:47,724 INFO [Thread-73] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting staging directory hdfs://bda11 /user/user/.staging/job_1518685780617_47486
2018-03-20 18:52:47,727 INFO [Thread-73] org.apache.hadoop.ipc.Server: Stopping server on 27207
2018-03-20 18:52:47,727 INFO [IPC Server listener on 27207] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 27207
2018-03-20 18:52:47,728 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-03-20 18:52:47,728 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted
2018-03-20 18:52:47,729 INFO [Ping Checker] org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: TaskAttemptFinishingMonitor thread interrupted

Yokozuna shutting down and taking Riak with it -Can't seem to find why

Currently experiencing an issue on a 10 node cluster, whereby after approx a day of running, 3 nodes will drop out (always a random 3).
Riak Version : 2.1.4
10 VM's running with 10GB Ram each, Running Oracle Linux version 7.3
Java Version :
[riak#pp2xria01trd001 riak$] java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Our usual Riak guy is on holiday at the moment, so don't have much resource to look into. Any help or guidance on where to possibly start looking would be greatly appreciated.
Crash dump details :
Slogan: Kernel pid terminated (application_controller) ({application_terminated,yokozuna,shutdown})System version: Erlang
R16B02_basho10 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true] [frame-pointer]
Not much in the solr.log to detail why :
2017-04-06 21:04:13,958 [INFO] <qtp1924582348-828>#LogUpdateProcessorFactory.java:198 [marketblueprints_index] webapp=/internal_solr path=/update params={} {} 0 0
2017-04-06 21:04:18,567 [INFO] <qtp1924582348-855>#SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:23,573 [INFO] <qtp1924582348-1161>#SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:28,578 [INFO] <qtp1924582348-865>#SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:33,584 [INFO] <qtp1924582348-848>#SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:38,589 [INFO] <qtp1924582348-641>#SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:54,242 [INFO] <Thread-1>#Monitor.java:41 Yokozuna has exited - shutting down Solr
2017-04-06 21:04:55,219 [INFO] <Thread-2>#Server.java:320 Graceful shutdown SocketConnector#0.0.0.0:8093
2017-04-06 21:04:56,027 [INFO] <Thread-2>#Server.java:329 Graceful shutdown o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/
riak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:04:59,288 [INFO] <Thread-2>#CoreContainer.java:314 Shutting down CoreContainer instance=1916575798
2017-04-06 21:04:59,710 [INFO] <Thread-2>#SolrCore.java:1040 [feed_mapping_index] CLOSING SolrCore org.apache.solr.core.SolrCore#78acc5b
However, after some of the merge processes in the solr.log, we are getting the following (which I suspect is preventing the supervisor from re-starting it for the 2nd time, and hence stopping Riak
2017-04-06 21:05:13,546 [INFO] <Thread-2>#CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data
2017-04-06 21:05:13,547 [INFO] <Thread-2>#CachingDirectoryFactory.java:236 looking to close /var/lib/riak/yz/endpoint_mappings_index/data/index [CachedDir<<refCount=0;path=
/var/lib/riak/yz/endpoint_mappings_index/data/index;done=false>>]
2017-04-06 21:05:13,547 [INFO] <Thread-2>#CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data/index
2017-04-06 21:05:14,657 [INFO] <Thread-2>#ContextHandler.java:832 stopped o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/ri
ak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:05:15,298 [WARN] <Thread-2>#QueuedThreadPool.java:145 79 threads could not be stopped
Erlang.log contains :
2017-04-06 21:04:54.193 [error] <0.5934.108> gen_server yz_solr_proc terminated with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:
8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false
},[]}]}}
2017-04-06 21:04:54.198 [error] <0.5934.108> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection
,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,h
ostname},100,1,{[],false},[]}]}} in gen_server:terminate/6 line 744
2017-04-06 21:04:54.201 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.5934.108> exit with reason {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=j
son","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false},[]}]}} in context child_terminated
2017-04-06 21:04:57.422 [info] <0.1102.0>#riak_ensemble_peer:leading:631 {{kv,1141798154164767904846628775559596109106197299200,3,114179815416476790484662877555959610910619
7299200},'riak#pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.422 [info] <0.1090.0>#riak_ensemble_peer:leading:631 {{kv,685078892498860742907977265335757665463718379520,3,6850788924988607429079772653357576654637183
79520},'riak#pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.780 [info] <0.1072.0>#riak_ensemble_peer:leading:631 {{kv,0,3,0},'riak#pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:01.432 [info] <0.8030.232>#yz_solr_proc:init:119 Starting solr: "/usr/bin/riak/java" ["-Djava.awt.headless=true","-Djetty.home=/usr/lib64/riak/lib/yokozuna
-2.1.7-0-g6cf80ad/priv/solr","-Djetty.temp=/var/lib/riak/yz_temp","-Djetty.port=8093","-Dsolr.solr.home=/var/lib/riak/yz","-DhostContext=/internal_solr","-cp","/usr/lib64/r
iak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/start.jar","-Dlog4j.configuration=file:///etc/riak/solr-log4j.properties","-Dyz.lib.dir=/usr/lib64/riak/lib/yokozuna-2.1.7-0-g6c
f80ad/priv/java_lib","-d64","-Xms4g","-Xmx4g","-XX:+UseStringCache","-XX:+UseCompressedOops","-Dcom.sun.management.jmxremote.port=8985","-Dcom.sun.management.jmxremote.auth
enticate=false","-Dcom.sun.management.jmxremote.ssl=false","org.eclipse.jetty.start.Main"]
2017-04-06 21:05:01.483 [info] <0.1108.0>#riak_ensemble_peer:leading:631 {{kv,1370157784997721485815954530671515330927436759040,3,137015778499772148581595453067151533092743
6759040},'riak#pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:02.032 [info] <0.8030.232>#yz_solr_proc:handle_info:184 solr stdout/err: OpenJDK 64-Bit Server VM warning: ignoring option UseSplitVerifier; support was re
moved in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option UseStringCache; support was removed in 8.0
2017-04-06 21:05:04.212 [info] <0.1110.0>#riak_ensemble_peer:leading:631 {{kv,1415829711164312202009819681693899175291684651008,3,0},'riak#pp2xria01trd001.pp2.williamhill.p
lc'}: Leading
2017-04-06 21:05:10.798 [info] <0.1096.0>#riak_ensemble_peer:leading:631 {{kv,913438523331814323877303020447676887284957839360,3,9134385233318143238773030204476768872849578
39360},'riak#pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:17.001 [info] <0.8030.232>#yz_solr_proc:handle_info:184 solr stdout/err: Error: Exception thrown by the agent : java.rmi.server.ExportException: Port alrea
dy in use: 8985; nested exception is:
java.net.BindException: Address already in use (Bind failed)
2017-04-06 21:05:17.964 [error] <0.8030.232> gen_server yz_solr_proc terminated with reason: {"solr OS process exited",1}
2017-04-06 21:05:17.964 [error] <0.8030.232> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {"solr OS process exited",1} in gen_server:terminate/6
line 744
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason {"solr OS process exited",1} in context child_terminated
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason reached_max_restart_intensity in context shutdown
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason shutdown in context
child_terminated
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason reached_max_restart_
intensity in context shutdown
2017-04-06 21:05:23.072 [error] <0.1551.0> Supervisor yz_index_hashtree_sup had child ignored started with yz_index_hashtree:start_link() at undefined exit with reason kill
ed in context shutdown_error
2017-04-06 21:05:24.353 [info] <0.745.0>#yz_app:prep_stop:74 Stopping application yokozuna.
2017-04-06 21:05:27.582 [error] <0.745.0>#yz_app:prep_stop:82 Stopping application yokozuna - exit:{noproc,{gen_server,call,[yz_solrq_drain_mgr,{drain,[]},infinity]}}.
2017-04-06 21:05:27.582 [info] <0.745.0>#yz_app:stop:88 Stopped application yokozuna.
2017-04-06 21:05:27.940 [info] <0.7.0> Application yokozuna exited with reason: shutdown
2017-04-06 21:05:28.165 [info] <0.431.0>#riak_kv_app:prep_stop:228 Stopping application riak_kv - marked service down.
2017-04-06 21:05:28.252 [info] <0.431.0>#riak_kv_app:prep_stop:232 Unregistered pb services
2017-04-06 21:05:28.408 [info] <0.431.0>#riak_kv_app:prep_stop:237 unregistered webmachine routes
2017-04-06 21:05:28.459 [info] <0.431.0>#riak_kv_app:prep_stop:239 all active put FSMs completed
2017-04-06 21:05:29.665 [info] <0.540.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.540.0>)
2017-04-06 21:05:29.665 [info] <0.539.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.539.0>)
2017-04-06 21:05:30.379 [info] <0.532.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.532.0>)
2017-04-06 21:05:31.116 [info] <0.534.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.534.0>)
2017-04-06 21:05:31.362 [info] <0.533.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.533.0>)
2017-04-06 21:05:32.153 [info] <0.536.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.536.0>)
2017-04-06 21:05:32.245 [info] <0.537.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.537.0>)
2017-04-06 21:05:32.676 [info] <0.535.0>#riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.535.0>)
2017-04-06 21:05:33.450 [info] <0.431.0>#riak_kv_app:stop:250 Stopped application riak_kv.
2017-04-06 21:05:41.701 [info] <0.195.0>#riak_core_app:stop:116 Stopped application riak_core.
2017-04-06 21:05:43.061 [info] <0.93.0> alarm_handler: {clear,system_memory_high_watermark}
We have the extra options added to riak.conf
search = on
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms4g -Xmx4g -XX:+UseStringCache -XX:+UseCompressedOops
search.solr.port = 8093
search.solr.start_timeout = 180s
No sign of any OOM errors, or processes being killed by a oom_killer

Flink HA Cluster JobManager issues

I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 TaskManagers. I start the Zookeeper Quorum from JobManager1, I get confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink job on this JobManager1.
The flink-conf.yaml is the same on all 5 VMs this means jobmanager.rpc.address: points to JobManager1 everywhere.
If I turn off the VM running JobManager1 I would expect Zookeeper to say one of the remaining JobManagers is the leader and the TaskManagers should reconnect to it. Instead I get in the TaskManagers' logs a lot of these messages
2017-03-14 14:13:21,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds)
2017-03-14 14:13:21,836 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660]
I modified the original IP to 1.2.3.4 for confidentiality and because it's always the same IP (of JobManager1).
More logs:
2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,489 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
2017-03-15 10:29:10,490 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data.
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager
2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache
2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Flat Map (1/1)
2017-03-15 10:29:10,552 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:29:10,632 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#1.2.3.5:43893/user/jobmanager), starting network stack and library cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
Does anyone know why the TaskManagers are not trying to reconnect to one of the remaining JobManagers (like 1.2.3.5 above)?
Thanks!
For everyone facing the same issue, HA requires you to provide a DFS location accessible from all nodes. I had backend state checkpoint directory and zookeeper storage directory pointing on each VM to a local filesystem location and when one of the JobManagers went down the new leader couldn't resume the running jobs because of lack of information / location not accessible.
Edit: Since this was asked, the file I modified (In the case of Apache Flink 1.2 (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/config.html)) was
conf/flink-conf.yaml
I set
state.backend.fs.checkpointdir
high-availability.zookeeper.storageDir
to AWS S3 paths .accessible from both TaskManagers and JobManagers.

Resources