Call to RocksDB.get times out when FlinkSQL job restarts - apache-flink

We are running a streaming FlinkSQL job. If the job restarts (say due to a checkpoint failure), task managers log this error (see full stack trace below):
did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method:
app//org.rocksdb.RocksDB.get(Native Method)
We see from the logs that rocksdb is closed exactly 30 seconds earlier:
Closed RocksDB State Backend. Cleaning up RocksDB working directory /flink-tmp/rocksdb/job_90331a967b94c5abd7b5377a55cc67ac_op_SlicingWindowOperator_4808b9a6cd8a2889e00c15fe1a792329__17_50__uuid_eb1b282c-8952-4b9f-b6d6-ee7be011d59f.
Is closing rocksdb causing the get operation to not return?
We are using Flink 1.15.0, and running a query like this:
INSERT INTO BigtableTable
SELECT CONCAT_WS('#', user_id, bucket) as rowkey, cell_timestamp, ROW(hllAttributeCount)
FROM (
SELECT
user_id,
window_end as cell_timestamp,
DATE_FORMAT(window_end, 'yyyy-MM-dd:HH') AS bucket,
STRING_HLL(attribute_to_count) AS hllAttributeCount
FROM TABLE(TUMBLE(TABLE inputTable, DESCRIPTOR(event_time), INTERVAL '5' MINUTES))
GROUP BY user_Id, window_start, window_end)
Full stack trace:
Task 'GlobalWindowAggregate[5] -> Calc[6] -> Sink: table[7] (4/50)#0' did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method:
app//org.rocksdb.RocksDB.get(Native Method)
app//org.rocksdb.RocksDB.get(RocksDB.java:2084)
app//org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:83)
app//org.apache.flink.table.runtime.operators.window.state.WindowValueState.value(WindowValueState.java:44)
app//org.apache.flink.table.runtime.operators.aggregate.window.combines.GlobalAggCombiner.combineAccumulator(GlobalAggCombiner.java:94)
app//org.apache.flink.table.runtime.operators.aggregate.window.combines.GlobalAggCombiner.combine(GlobalAggCombiner.java:85)
app//org.apache.flink.table.runtime.operators.aggregate.window.buffers.RecordsWindowBuffer.flush(RecordsWindowBuffer.java:112)
app//org.apache.flink.table.runtime.operators.aggregate.window.processors.AbstractWindowAggProcessor.prepareCheckpoint(AbstractWindowAggProcessor.java:203)
app//org.apache.flink.table.runtime.operators.window.slicing.SlicingWindowOperator.prepareSnapshotPreBarrier(SlicingWindowOperator.java:267)
app//org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier(RegularOperatorChain.java:89)
app//org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:300)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12(StreamTask.java:1253)
app//org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1997/0x0000000840efa440.run(Unknown Source)
app//org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1241)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1198)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493)
app//org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74)
app//org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$2010/0x0000000840efd040.apply(Unknown Source)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
app//org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
app//org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519)
app//org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1068/0x00000008409dfc40.runDefaultAction(Unknown Source)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
app//org.apache.flink.runtime.taskmanager.Task$$Lambda$1951/0x0000000840e47840.run(Unknown Source)
app//org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
app//org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base#11.0.15/java.lang.Thread.run(Unknown Source)

Related

Flink Task Manager Suddenly Crashed

Flink TM suddenly got crashed after 3 months of running with the below error stack trace.
2021-12-05 07:22:05,369 WARN org.apache.flink.runtime.taskmanager.Task [] - Task 'GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6' did not react to cancelling signal for 30 seconds, but is stuck in method:
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.notifyBufferAvailable(BufferManager.java:296)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.fireBufferAvailableNotification(LocalBufferPool.java:507)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:494)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:460)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager$AvailableBufferQueue.addExclusiveBuffer(BufferManager.java:399)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.recycle(BufferManager.java:200)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:95)
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:95)
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$615/1465249724.runDefaultAction(Unknown Source)
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1480/994476387.run(Unknown Source)
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
java.lang.Thread.run(Thread.java:748)
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 (5e34a8de7bcff882f37c073f250c2594).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Task GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 is already in state CANCELING
2021-12-05 07:22:05,372 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:RELEASING, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:15,362 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Terminating TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) [flink-dist_2.12-1.13.1.jar:1.13.1]
Caused by: java.util.concurrent.TimeoutException
at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_232]
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-9fad861a-b657-4625-a184-db126c423c2f
While debugging, I found Input and output buffer usage reached 100% usage on datadog dashboard.
Also found out that last 2 checkpoints got failed with message Checkpoint expired before completing. Checkpoint timeout is 2 mins.
How can I fix this issue.
Checkpoint timeouts are generally caused by either
backpressure causing the checkpoint barriers to progress too slowly across the execution graph, or
some sort of bottleneck preventing Flink from writing fast enough to the checkpoint storage (e.g., network starvation, insufficient iops quota)
It looks like you are using unaligned checkpointing. This should help with point number 1 above, but could be causing point number 2 to be a problem, since unaligned checkpoints increase the amount of data being checkpointed (by up to about a 1GB in your case, it looks like).
You might just want to increase the checkpoint timeout. Having checkpoints timeout is almost never helpful.
But it also appears that you have significant backpressure. Figuring out what's causing that and doing something about it should help. (If you can upgrade to Flink 1.13 (or later) the improved backpressure monitoring will make this easier.) Perhaps you have data skew, or perhaps you need to scale up the cluster.

Problem during state restore; when Flink job is submitted

We are getting the exception, copied at the end of this post. The exception is thrown when a new flink job is submitted; when Flink tries to restore the previous state.
Environment:
Flink version: 1.10.1
State persistence: Hadoop 3.3
Zookeeper 3.5.8
Parallelism: 4
The code implements DataStream Transformation functions: ProcessFunction -> KeySelector -> ProcessFunction. Inbound messages are partitioned by key "sourceId" which is a part of the exception stack trace. SourceId is String type and is unique.
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
We have overridden "org.apache.flink.streaming.api.functions.ProcessFunction.open()" method
Any help is appreciated
Exception stack trace:
2021-01-19 19:59:56,934 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Process (3/4) of job c957f40043721b5cab3161991999a7ed is not in state RUNNING but DEPLOYING instead. Aborting checkpoint.
2021-01-19 19:59:57,358 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Sink: Unnamed (4/4) (b2605627c2fffc83dd412b3e7565244d) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:989)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:453)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:448)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:460)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for LegacyKeyedProcessOperator_c27dcf7b54ef6bfd6cff02ca8870b681_(4/4) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore heap backend
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:116)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 11 more
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:346)
at org.apache.flink.runtime.state.heap.StateTableByKeyGroupReaders.lambda$createV2PlusReader$0(StateTableByKeyGroupReaders.java:77)
at org.apache.flink.runtime.state.KeyGroupPartitioner$PartitioningResultKeyGroupReader.readMappingsInKeyGroup(KeyGroupPartitioner.java:297)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readKeyGroupStateData(HeapRestoreOperation.java:293)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readStateHandleStateData(HeapRestoreOperation.java:254)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.restore(HeapRestoreOperation.java:154)
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:114)
... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:728)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
... 24 more

Degree of parallelism in Apache Flink

Can I set different degree of parallelism for different part of the task in our program in Flink?
For instance, how does Flink interpret the following sample code?
The two custom practitioners MyPartitioner1, MyPartitioner2, partition the input data two 4 and 2 partitions.
partitionedData1 = inputData1
.partitionCustom(new MyPartitioner1(), 1);
env.setParallelism(4);
DataSet<Tuple2<Integer, Integer>> output1 = partitionedData1
.mapPartition(new calculateFun());
partitionedData2 = inputData2
.partitionCustom(new MyPartitioner2(), 2);
env.setParallelism(2);
DataSet<Tuple2<Integer, Integer>> output2 = partitionedData2
.mapPartition(new calculateFun());
I get the following error for this code:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:314)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:80)
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:92)
at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Unknown Source)
ExecutionEnvironment.setParallelism() sets the parallelism for the whole program, i.e., all operators of the program.
You can specify the parallelism for each individual operator by calling the setParallelism() method on the operator.
The ArrayIndexOutOfBoundsException is thrown because your custom partitioner returns an invalid partition number probably due to the unexpected degree of parallelism. The custom partitioner receives the actual parallelism of the receiver as a parameter in its partition(K key, int numPartitions) method.

solr indexing not working when i try to insert 1000000 rows but works fine when i try to index 400000 rows or below

iam using solr 4.7.1 and trying to do a full import.My data source is a table in mysql. It has 10000000 rows and 20 columns.
Whenever iam trying to do a full import solr stops responding. But when i try to do a import of 400000 or less it works fine.
If i try to import more than this solr wont index the result it either stops responding or will show "indexing failed". In the error log it says "Unable to execute query".But i dont understand how is the query running fine for lesser number of records but fails when i run more number of records
My system config are follows
CPU-i7
Ram -6Gb
OS-64 bit windows 7
I am not able to figure out what the problem is ,i have tried increasing the max_allowed_packet to 1000M and even java heap size.
please help thanks in advance
This is the error code
`Exception while processing: playername document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT player_id,firstname,lastname,value1,value2,value3,value4,value5,value6, value7,value8,value9,value10, value11,value18,value19,value20, country_id, playername_modtime,player_flag from playername WHERE 'true' != 'false' OR playername.playername_modtime > '2014-05-23 10:38:56' Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:281) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:238) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:42) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464) Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 130,037 milliseconds ago. The last packet sent successfully to the server was 130,038 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at com.mysql.jdbc.Util.handleNewInstance(Util.java:409) at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1127) at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2288) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2044) at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3549) at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:489) at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3240) at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2411) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2834) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2832) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2781) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:908) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:788) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:274) ... 12 more Caused by: java.io.EOFException: Can not read response from server. Expected to read 6 bytes, read 4 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3161) at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2269) ... 23 more 5/23/2014 8:32:18 PM ERROR DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT player_id,​firstname,​lastname,​value1,​value2,​value3,​value4,​value5,​value6,​ value7,​value8,​value9,​value10,​ value11,​value18,​value19,​value20,​ country_id,​ playername_modtime,​player_flag from playername WHERE 'true' != 'false' OR playername.playername_modtime > '2014-05-23 10:38:56' Processing Document # 1 Last Check: 5/23/2014 8:36:34 PM`
Added batchSize="-1" to data-config.xml and it worked
http://wiki.apache.org/solr/DataImportHandlerFaq

Heroku POSTGRESQL - "Too many connections for role" error

I've set up a connection from localhost to the Dev database on Heroku (as described in: Errors in evolutions on Heroku) and I am receving the following error after trying to apply evolutions a couple of times:
SQLException: Unable to open a test connection to the given database. JDBC url = [URL], username = null. Terminating connection pool.
Original Exception: org.postgresql.util.PSQLException: FATAL: too many connections for role "ntnkypawxazhwo"
at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:469)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:110)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:123)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:28)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:20)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:22)
at org.postgresql.Driver.makeConnection(Driver.java:391)
at org.postgresql.Driver.connect(Driver.java:265)
at play.utils.ProxyDriver.connect(ProxyDriver.scala:9)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:256)
at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:305)
at com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
at play.api.db.DBApi$class.getConnection(DB.scala:64)
at play.api.db.BoneCPApi.getConnection(DB.scala:273)
at play.api.db.evolutions.Evolutions$.databaseEvolutions(Evolutions.scala:306)
at play.api.db.evolutions.Evolutions$.evolutionScript(Evolutions.scala:284)
at play.api.db.evolutions.OfflineEvolutions$.applyScript(Evolutions.scala:452)
at play.core.ReloadableApplication.handleWebCommand(ApplicationProvider.scala:175)
at play.core.server.Server$$anonfun$getHandlerFor$1.apply(Server.scala:86)
at play.core.server.Server$$anonfun$getHandlerFor$1.apply(Server.scala:86)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:110)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:110)
at scala.util.control.Exception$Catch.apply(Exception.scala:88)
at scala.util.control.Exception$Catch.either(Exception.scala:110)
at play.core.server.Server$class.getHandlerFor(Server.scala:86)
at play.core.server.NettyServer.getHandlerFor(NettyServer.scala:38)
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:226)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:777)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndFireMessageReceived(ReplayingDecoder.java:522)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:501)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:438)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:553)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:343)
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:274)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:194)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Dev databases have a fixed number of available connections (20 or so). How can I make sure I am properly closing my connections?
You can use the JDBC settings of Play to reduce the number of connections. Try setting only 1 partition to start:
db.default.partitionCount=1
and keep tweaking to limit time and number of connections per partition.

Resources