I am using Flink session cluster(Kubernetes Cluster (Session Mode)) to deploy batch jobs with HA. Inside the recovery/default/blob/ directory, directories starting with job_ is getting piled up.
drwxr-xr-x 1 flink flink 1 Nov 16 09:03 job_747a694a765d1b580a703e2785a9e3fa
Job get submitted every 1 min. But in the above above ls -ltr of /recovery/default/blob/, blobs of one of the job is not getting cleared. This job has neither completed nor failed. Also it is not listed in the list on the web ui
The log file when this happens is
2021-11-22 09:03:11,537 INFO org.apache.flink.kubernetes.highavailability.KubernetesHaServices [] - Finished cleaning up the high availability data for job 6a71a36a3c82d8a9438c9aa9ed6b8993.
2021-11-22 09:03:14,904 ERROR org.apache.flink.runtime.blob.BlobServerConnection [] - PUT operation failed
java.io.IOException: Cannot allocate memory
at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_312]
at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_312]
at org.apache.flink.core.fs.local.LocalDataOutputStream.write(LocalDataOutputStream.java:55) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.guava30.com.google.common.io.ByteStreams.copy(ByteStreams.java:113) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.guava30.com.google.common.io.ByteSource.copyTo(ByteSource.java:243) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.guava30.com.google.common.io.Files.copy(Files.java:301) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.FileSystemBlobStore.put(FileSystemBlobStore.java:79) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.FileSystemBlobStore.put(FileSystemBlobStore.java:72) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.BlobUtils.moveTempFileToStore(BlobUtils.java:385) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.BlobServer.moveTempFileToStore(BlobServer.java:680) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:350) [flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:110) [flink-dist_2.11-1.14.0.jar:1.14.0]
It seems that Flink is not retrying this job. Is there a config which can retry this job?
Related
I'm using flink mysql connector with a single executor of 32Gb RAM, 16vCPU with 32 slots. If I run a job with parallelism 32 (job parallelism 224) that is doing temporal lookup joins with 10 MySQL tables, it starts to fail after 2-3 successful runs with below error.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.lang.IllegalArgumentException: open() failed.
at org.apache.flink.connector.jdbc.table.JdbcRowDataLookupFunction.open(JdbcRowDataLookupFunction.java:138)
at LookupFunction$55178.open(Unknown Source)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:34)
at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.open(LookupJoinRunner.java:67)
at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinWithCalcRunner.open(LookupJoinWithCalcRunner.java:51)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:34)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:100)
at org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:110)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:403)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:990)
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:335)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2187)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2220)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2015)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:768)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:403)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:385)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:323)
at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider.getOrEstablishConnection(SimpleJdbcConnectionProvider.java:121)
at org.apache.flink.connector.jdbc.table.JdbcRowDataLookupFunction.establishConnectionAndStatement(JdbcRowDataLookupFunction.java:211)
at org.apache.flink.connector.jdbc.table.JdbcRowDataLookupFunction.open(JdbcRowDataLookupFunction.java:129)
... 17 more
Caused by: java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:478)
at java.net.Socket.getImpl(Socket.java:538)
at java.net.Socket.setTcpNoDelay(Socket.java:998)
at com.mysql.jdbc.StandardSocketFactory.configureSocket(StandardSocketFactory.java:132)
at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:203)
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:299)
... 32 more
Did Some debugging, the process list on MySQL shows ~ 2* (total job parallelism) connections, i.e. 448 connections from Task Manager IP. The output of lsof | grep mysql-cj- | wc -l on task manager also reached to 12k from 3k. But after cancelling job, sometime this number doesn't go down. Am I missing something ?
The error is mainly because there are too many connections requesting mysql at the same time. Provide several optimization ideas for reference
Consider reducing the total concurrency of tasks
By default, lookup cache is not enabled. You can enable it by setting both lookup.cache.max-rows and lookup.cache.ttl, refer to https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/
Flink job failed,The error information is as follows
2020-12-02 09:37:27
java.util.concurrent.CompletionException: java.lang.reflect.UndeclaredThrowableException
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy41.submitTask(Unknown Source)
at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.submitTask(RpcTaskManagerGateway.java:77)
at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$9(Execution.java:735)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
... 7 more
Caused by: java.io.IOException: The rpc invocation size exceeds the maximum akka framesize.
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(AkkaInvocationHandler.java:270)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:200)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:129)
... 11 more
The logic of this job is simple,Consumption data of Kafka is saved to Clickhouse.
Start command
flink run -m yarn-cluster -p 2 -ys 2 -yjm 2048 -ytm 2048 -ynm xx --class xx /data/flink/lib/xx.jar -name --input --groupId xx --bootstrapServers xx:9092 --CheckpointInterval 60000 --CheckpointTimeout 600000 --clientId xx
Why is that? thanks
The exception means the payload of message(JM submits task to TM) exceeds max size. Try to increase the max size by adding akka.framesize to flink-conf.yaml.
The default for this is: 10485760b. Try to set a bigger number for this. Probably needing to restart the JM/TM or Flink cluster.
Doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#akka-framesize
i just configure my flink cluster (1.0.3 version) and i whanted to run tersort with the flink framework, but its me returned the next erreor, i used this comamnde to run my programm:
/bin/flink run -c --class es.udc.gac.flinkbench.ScalaTeraSort flinkbench-assembly-1.0.jar hdfs://192.168.3.89:8020/teragen/10mo hdfs://192.168.3.89:8020/teragen/rstflink
and its me return this :
Cluster configuration: Standalone cluster with JobManager at localhost/127.0.0.1:6123
Using address localhost:6123 to connect to JobManager.
JobManager web interface address http://localhost:8081
Starting execution of program
2017-06-23 10:28:43,692 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2
Spent 2278ms computing base-splits.
Spent 2ms computing TeraScheduler splits.
Computing input splits took 2281ms
Sampling 2 splits of 2
Making -1 from 100000 sampled records
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116)
Caused by: java.lang.NegativeArraySizeException
at es.udc.gac.flinkbench.terasort.TeraInputFormat$TextSampler.createPartitions(TeraInputFormat.java:103)
at es.udc.gac.flinkbench.terasort.TeraInputFormat.writePartitionFile(TeraInputFormat.java:183)
at es.udc.gac.flinkbench.ScalaTeraSort$.main(ScalaTeraSort.scala:49)
at es.udc.gac.flinkbench.ScalaTeraSort.main(ScalaTeraSort.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
... 13 more
I have 4 SOLR indexes running on a single node Hadoop environment.
The VM went out of space so i freed some space on the OS but after i restarted the VM , the indexes were not starting.
The VM where I am running SOLR has 64 GB of RAM
I am getting solr.DirectUpdateHandler2 failed to instantiate error
I need help on this as my system is not working anymore due to this issue and the backup i have is old.
This is the error:
null:org.apache.solr.common.SolrException: SolrCore 'EMAIL_DOMAINS_shard1_replica1' is not available due to init failure: Error Instantiating Update Handler, solr.DirectUpdateHandler2 failed to instantiate org.apache.solr.update.UpdateHandler
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:752)
at org.apache.solr.servlet.SolrDispatchFilter.checkProps(SolrDispatchFilter.java:768)
at org.apache.solr.servlet.SolrDispatchFilter.getCoreByCollection(SolrDispatchFilter.java:742)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:325)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:394)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
at org.apache.solr.servlet.SolrHadoopAuthenticationFilter.doFilter(SolrHadoopAuthenticationFilter.java:399)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.solr.servlet.HostnameFilter.doFilter(HostnameFilter.java:86)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:620)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Error Instantiating Update Handler, solr.DirectUpdateHandler2 failed to instantiate org.apache.solr.update.UpdateHandler
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:893)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:663)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:498)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:262)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:256)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: org.apache.solr.common.SolrException: Error Instantiating Update Handler, solr.DirectUpdateHandler2 failed to instantiate org.apache.solr.update.UpdateHandler
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:581)
at org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:637)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:855)
... 8 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:567)
... 10 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.hdfs.DFSOutputStream.start(DFSOutputStream.java:2359)
at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1934)
at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1853)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1878)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1871)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:325)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1172)
at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:93)
at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:200)
at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:136)
at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94)
at org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:100)
... 15 more
In the exception is clearly written that you have an Out Of Memory (OOM) problem.
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
Given that you already had an out of space, please double check if you have already freed enough disk space.
After this, even if there are many reasons for an OOM, the common reason is that your JVM hasn't enough memory so double check your heap size configuration (i.e. –Xms and –Xmx).
If freeing more disk space and allocate more memory heap doesn't resolve your problem, I suggest also to:
check and if you need to increase user limits (ulimit –a), this means user max thread limits, max open processes and max open files limits.
check threads system wide settings and, even in this case, raise the limit if you need it.
On the other hand, rarely happens that you have allocated so much heap space or so much resource that the operating system has no enough room to run your application.
In conclusion just try to find the right balance between allocated resources and application needs.
We upgraded Cassandra (5+5 nodes) 2.0.9 to 2.1.2 (binaries) and ran nodetool upgradesstables one-by-one (bash script), after this we observe some problems:
on every node we observe about 50 "Pending Tasks" on one of them more than 500, it has persist for 5 days - when we started nodetool upgradesstables, even if concurrent_compactors is set to 8 cassandra never run more than 3-4 at the same time. One node with more than 500 tasks pending has about 11k files in column family directory...
we have 2 ssd disks but during compacting there is up to 10MB/s reads and maximum 5MB/s writes - even if compaction_throughput_mb_per_sec is set to 32 or 64 or 256
during upgradesstables on some tables got :
WARN [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 ColumnFamilyStore.java:2492 - Unable to cancel in-progress compactions for reco_active_items_v1. Perhaps there is an unusually large row in progress somewhere, or the system is simply overloaded.
INFO [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 CompactionManager.java:247 - Aborting operation on reco_prod.reco_active_items_v1 after failing to interrupt other compaction operations
nodetool is failing with:
Aborted upgrading sstables for atleast one column family in keyspace reco_prod, check server logs for more information.
on some nodes nodetool upgradesstables finished succefully but still can see jb files in column family directory.
nodetool upgradesstables on some nodes returns:
error: null
-- StackTrace --
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewStart(SSTableReader.java:952)
at org.apache.cassandra.io.sstable.SSTableRewriter.moveStarts(SSTableRewriter.java:250)
at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:300)
at org.apache.cassandra.io.sstable.SSTableRewriter.abort(SSTableRewriter.java:186)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:204)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$4.execute(CompactionManager.java:340)
at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is our production env (24h) and we observe higher load on nodes , higher read latency even more than 1 sec.
Any advise...?