Cannot use the Pulsar Shared subscription in Apache Flink - apache-flink

A Flink data pipeline reads from Apache Pulsar partitioned topic. I have set the PulsarSource subscription to SubscriptionType.Exclusive. When this is changed to SubscriptionType.Shared it expects transaction policies to be enabled for the namespace. I then enabled transaction manager in the broker and started it. But still get this exception
Caused by: java.io.IOException: org.apache.flink.util.FlinkRuntimeException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$CoordinatorNotFoundException: Transaction manager is not started or not enabled
at org.apache.flink.connector.pulsar.source.reader.split.PulsarPartitionSplitReaderBase.fetch(PulsarPartitionSplitReaderBase.java:140)
at org.apache.flink.connector.pulsar.source.reader.split.PulsarUnorderedPartitionSplitReader.fetch(PulsarUnorderedPartitionSplitReader.java:55)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:142)
... 6 more
Caused by: org.apache.flink.util.FlinkRuntimeException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$CoordinatorNotFoundException: Transaction manager is not started or not enabled
at org.apache.flink.connector.pulsar.common.utils.PulsarTransactionUtils.createTransaction(PulsarTransactionUtils.java:56)
at org.apache.flink.connector.pulsar.source.reader.split.PulsarUnorderedPartitionSplitReader.newTransaction(PulsarUnorderedPartitionSplitReader.java:165)
at org.apache.flink.connector.pulsar.source.reader.split.PulsarUnorderedPartitionSplitReader.pollMessage(PulsarUnorderedPartitionSplitReader.java:91)
at org.apache.flink.connector.pulsar.source.reader.split.PulsarPartitionSplitReaderBase.fetch(PulsarPartitionSplitReaderBase.java:115)
... 9 more
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$CoordinatorNotFoundException: Transaction manager is not started or not enabled
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
at org.apache.flink.connector.pulsar.common.utils.PulsarTransactionUtils.createTransaction(PulsarTransactionUtils.java:51)
... 12 more
Caused by: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$CoordinatorNotFoundException: Transaction manager is not started or not enabled
at org.apache.pulsar.client.impl.TransactionMetaStoreHandler.getExceptionByServerError(TransactionMetaStoreHandler.java:419)
at org.apache.pulsar.client.impl.TransactionMetaStoreHandler.handleTransactionFailOp(TransactionMetaStoreHandler.java:352)
at org.apache.pulsar.client.impl.TransactionMetaStoreHandler.handleNewTxnResponse(TransactionMetaStoreHandler.java:210)
at org.apache.pulsar.client.impl.ClientCnx.handleNewTxnResponse(ClientCnx.java:945)
at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:382)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at org.apache.pulsar.shade.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at org.apache.pulsar.shade.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at org.apache.pulsar.shade.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at org.apache.pulsar.shade.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at org.apache.pulsar.shade.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at org.apache.pulsar.shade.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at org.apache.pulsar.shade.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at org.apache.pulsar.shade.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at org.apache.pulsar.shade.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at org.apache.pulsar.shade.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at org.apache.pulsar.shade.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
I do not see a way (method) to enable the transaction inside the PulsarSource of Flink.

The standalone Pulsar uses the standalone.conf and not the broker.conf. I wrongly have enabled the transaction coordinator in broker.conf.

Related

Flink try to recover checkpoint from deleted directories

After cleaning the s3 bucket, which is used to store checkpoints from old files (files that have been accessed for more than a month), when restarting or when restoring from actual checkpoints, some processes of job do not start due to some missing old files
Job works well and save actual checkpoins (save path s3://flink-checkpoints/check/af8b0712ae0c1f20d2226b86e6bddb60/chk-100274)
2022-04-24 03:58:32.892 Triggering checkpoint 100273 # 1653353912890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 03:58:55.317 Completed checkpoint 100273 for job af8b0712ae0c1f20d2226b86e6bddb60 (679053131 bytes in 22090 ms).
2022-04-24 04:03:32.892 Triggering checkpoint 100274 # 1653354212890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 04:03:35.844 Completed checkpoint 100274 for job af8b0712ae0c1f20d2226b86e6bddb60 (9606712 bytes in 2494 ms).
After one taskmanager switched off and job restarted
2022-04-24 04:04:40.936 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:14.150 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RESTARTING to RUNNING.
2022-04-24 04:05:14.198 Restoring job af8b0712ae0c1f20d2226b86e6bddb60 from latest valid checkpoint: Checkpoint 100274 # 1653354212890 for af8b0712ae0c1f20d2226b86e6bddb60.
after some time job failed because some process can't restore state
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:17.093 Process first events -> Sink: Sink to test-job (5/10) (4f9089b1015540eb6e13afe4c07fa97b) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_f1d5710fb330fd579d15b292e305802c_(5/10) from any of the 1 provided restore options.
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default), S3 Extended Request ID: 51f03da-default-default (Path: s3://flink-checkpoints/check/e3d82336005fc40be9af536938716199/shared/64452a30-c8a0-454f-8164-34d9e70142e0)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default)
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
if I completely cancel the job and start a new one with the savepoint set to the path of the last checkpoint, I get the same errors.
Why when working with checkpoint from af8b0712ae0c1f20d2226b86e6bddb60 folder job tries to get some files from e3d82336005fc40be9af536938716199 folder and what are the rules for clearing old checkpoints from storage?
UPDATE
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.
This is something that was quite ambiguous for a long time and has recently been addressed in Flink 1.15. I would recommend reading https://flink.apache.org/news/2022/05/05/1.15-announcement.html on the section 'Clarification of checkpoint and savepoint semantics', including the section where the comparison between checkpoints and savepoints is made.
The behaviour you've experienced depends on your checkpointing setup (aligned vs unaligned).
By default cancelling job removes old checkpoints. There is a configuration flag to control it execution.checkpointing.externalized-checkpoint-retention. As mentioned by Martijn, normally you would resort to savepoints for controlled job upgrades / restarts.
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.

flink job failed when encountered DB Connection Exception

I'm newer to flink. My flink job recieved messages from mq, and do some rule check and summary calculation, then write the result to rdbms. Sometimes the job will encounter NullPointException(due to my silly code) , or MQ connection Exception(due to non-exist Topic),and it just halted current message-processing, the job is still running, then next messages will still trigger the exception.
But today I restart DB and the job failed. What's the difference?

Apache beam job failing sometimes on Apache Flink cluster with OptimizerPlanEnvironment$ProgramAbortException

I have been running an apache beam based data ingestion job which parses an input CSV file and writes data on a sink. This job works fine when one job is submitted at a time (under normal load circumstances).
But recently when I started load testing, I started scheduling multiple jobs sequentially in a loop where I observed some exceptions and job failures. Currently, I am using a script to schedule 9 jobs through Flink REST API for RUNNING. When these jobs are scheduled, sometimes all jobs are executing without any issue but the majority of the time 2 or 3 out of 9 jobs are failing with the following exceptions. I have tried the job with multiple input CSV files but it is showing similar behavior.
Exception 1:
2020-06-02 11:22:45,686 ERROR org.apache.flink.runtime.webmonitor.handlers.JarRunHandler - Unhandled exception.
org.apache.flink.client.program.ProgramInvocationException: The program plan could not be fetched - the program aborted pre-maturely.
System.err: (none)
System.out: (none)
at org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:108)
at org.apache.flink.client.program.PackagedProgramUtils.createJobGraph(PackagedProgramUtils.java:80)
at org.apache.flink.runtime.webmonitor.handlers.utils.JarHandlerUtils$JarHandlerContext.toJobGraph(JarHandlerUtils.java:128)
at org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$getJobGraphAsync$6(JarRunHandler.java:142)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception 2:
2020-06-02 11:22:46,035 ERROR org.apache.flink.runtime.webmonitor.handlers.JarRunHandler - Unhandled exception.
org.apache.flink.client.program.ProgramInvocationException: The program caused an error:
at org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:93)
at org.apache.flink.client.program.PackagedProgramUtils.createJobGraph(PackagedProgramUtils.java:80)
at org.apache.flink.runtime.webmonitor.handlers.utils.JarHandlerUtils$JarHandlerContext.toJobGraph(JarHandlerUtils.java:128)
at org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$getJobGraphAsync$6(JarRunHandler.java:142)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.client.program.OptimizerPlanEnvironment$ProgramAbortException
at org.apache.flink.client.program.OptimizerPlanEnvironment.execute(OptimizerPlanEnvironment.java:54)
at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:139)
at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:113)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
at com.spotify.scio.ScioContext.execute(ScioContext.scala:598)
at com.spotify.scio.ScioContext$$anonfun$run$1.apply(ScioContext.scala:586)
at com.spotify.scio.ScioContext$$anonfun$run$1.apply(ScioContext.scala:574)
at com.spotify.scio.ScioContext.requireNotClosed(ScioContext.scala:694)
at com.spotify.scio.ScioContext.run(ScioContext.scala:574)
at com.sparkcognition.foundation.ingest.jobs.delimitedingest.DelimitedIngest$.main(DelimitedIngest.scala:70)
at com.sparkcognition.foundation.ingest.jobs.delimitedingest.DelimitedIngest.main(DelimitedIngest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:557)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:449)
at org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:83)
... 7 more
The job jar file is developed in Scala using Spotify's Scio library (0.8.0). The Flink cluster has these specs:
Flink version 1.8.3
1 master and 2 worker nodes
32 task slots with 2 task executors
job manager heap size = 48Gb
task executor heap size = 24Gb
Submitting from the REST or UI, fails if you code (env.execute()) is enclosed on Exception() block.
I see the stackTrace is similar, but succeeds sometimes and fails sometimes behavior is different. With my code, REST submission used to fail consistently. Check the below strackOverFlow reference and see if it helps.
Issue with job submission from Flink Job UI (Exception:org.apache.flink.client.program.OptimizerPlanEnvironment$ProgramAbortException)

Running Flink build-in program sometimes arise Exception:java.io.IOException: Connecting the channel failed

I have set up a flink standalone cluster, with one master and three slaves , all SESU Linux machines. In the master Dashboard http://flink-master:8081/ I can see 3 Task Managers and 3 task slots as I have set taskmanager.numberOfTaskSlots: 1 in flink-conf.yaml in all of the slaves.
When I run a flink built-in program,like the examples/streaming/Iteration.jar,I get exception often:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Caused by: java.net.ConnectException: Connection refused: ccr202/127.0.0.2:49651
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 6 more
It seems that the network causes the problem,but sometimes the flink program can successfully finish.So what is the reason?
I also encounter this issue very frequently especially when there are many taskManagers. There are a few config I have tried to solve this issue. It's happened when the taskManager read the remote partition through netty connection. It timed out when request the connection. I increased the config "taskmanager.network.netty.server.numThreads", it solved the issue.

Intermittent Communications link failure with Cloud SQL

I'm using jmeter to stress test a GAE web service which uses CloudSQL and I'm getting intermittent communications link failure exceptions.
I've tried using direct connections and a connection pool, and I see exceptions in either scenario. The exceptions increase as the number of requests per second increase.
Note that we are using the highest tier of cloud sql, D32 and the tests are well under the max 3200 connections.
Here's a stack trace for reference:
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.GeneratedConstructorAccessor48.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:33)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:350)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2413)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2450)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2235)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:818)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:46)
at sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:33)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:404)
at com.mysql.jdbc.GoogleNonRegisteringDriver$JdbcWrapper.getInstance(GoogleNonRegisteringDriver.java:276)
at com.mysql.jdbc.GoogleNonRegisteringDriver.connect(GoogleNonRegisteringDriver.java:246)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
Update: I changed the connection pool settings to maxActive = 5 and maxIdle = 5 and the intermittent communications link exceptions went away. Note that I've tried commons dbcp and tomcat dbcp. I'm now seeing the following exceptions in the logs:
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access gatherPerformanceMetrics
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access includeThreadDumpInDeadlockExceptions
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access nullNamePatternMatchesAll
From https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
"Each App Engine instance cannot have more than 12 concurrent connections to a Google Cloud SQL instance."
Can you tell more about the test set-up? How many requests is jmeter sending to appengine and how many connections does the app instance open for each of those requests?
To everyone who are looking for why you might be getting "com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure" on a connection.
Make sure your IP is allowed if you are calling from a test server!
I was testing at a friends house, and this unhelpful error kept showing up.

Resources