Unable to Mount /opt/flink/conf to flink job manager - apache-flink

I tried to run Apache Flink application as a read-only filesystem, As Flink needs to update the conf files I mounted the /opt/flink/conf, but it gives the below error:
Failure executing: POST at:
/apis/apps/v1/namespaces/flink/deployments. Message: Deployment.apps
"" is invalid:
spec.template.spec.containers[0].volumeMounts[3].mountPath: Invalid
value: "/opt/flink/conf": must be unique. Received status:
Status(apiVersion=v1, code=422,
details=StatusDetails(causes=[StatusCause(field=spec.template.spec.containers[0].volumeMounts[3].mountPath,
message=Invalid value: "/opt/flink/conf": must be unique,
reason=FieldValueInvalid, additionalProperties={})], group=apps,
kind=Deployment, name=, retryAfterSeconds=null, uid=null,
additionalProperties={}), kind=Status, message=Deployment.apps "" is
invalid: spec.template.spec.containers[0].volumeMounts[3].mountPath:
Invalid value: "/opt/flink/conf": must be unique,
metadata=ListMeta(_continue=null, remainingItemCount=null,
resourceVersion=null, selfLink=null, additionalProperties={}),
reason=Invalid, status=Failure, additionalProperties={}).
How do I run Apache Flink as a read-only file system?

Related

Mongodump not working well with Atlas cluster

When I run mongodump on one of my clusters in Atlas it throws the following error:
root#anuj-exportify-EXP02:/home/anuj/Exportify/DataMigration# mongodump --uri="mongodb+srv://<USERNAME>:<PASSWORD>#<CLUSTER-NAME>.mongodb.net/exportifydb"
2023-01-17T13:51:44.172+0530 writing exportifydb.schedules to
2023-01-17T13:51:44.173+0530 writing exportifydb.automated_logs to
2023-01-17T13:51:44.173+0530 writing exportifydb.automated_logs_bkp to
2023-01-17T13:51:44.173+0530 writing exportifydb.logs to
2023-01-17T13:51:44.190+0530 Failed: error writing data for collection `exportifydb.automated_logs` to disk: error reading collection: Failed to parse: { find: "automated_logs", skip: 0, snapshot: true, $db: "exportifydb" }. Unrecognized field 'snapshot'.
Can anyone guide me that where am I going wrong?

Molecule fails with "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"

I am using Ansible with Molecule. I just ran into the situation that converging my role failed with:
fatal: [instance]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"}
How to mitigate? Hold on a sec, I will answer ...
What worked for me is to set logging to "true" for molecule:
Go to the "molecule.yml" file, this is where you do your configuration for molecule.
You should find it in the molecule/default/ directory
Look for the provisioner: section
Add log: true to it.
VoilĂ !
That's how it looks:
provisioner:
name: ansible
log: true
Note that there may by other settings in this very section for the provisioner.

[flink]Task manager initialization failed

I am new to flink. I am trying to run the flink example on my local PC(windows).
However, after I run the start-cluster.bat, I login to the dashboard, it shows the task manager is 0.
I checked the log and seems it fails to initialize:
2020-02-21 23:03:14,202 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpec.FromConfig(TaskExecutorResourceUtils.java:72)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerSecurely$2(TaskManagerRunner.java:322)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerSecurely(TaskManagerRunner.java:321)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:287)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.cpu.cores' , default: null (fallback keys: []) is not set
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
at java.util.Arrays$ArrayList.forEach(Arrays.java:3880)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
... 7 more
2020-02-21 23:03:14,217 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
Basically, it looks like a required option 'taskmanager.cpu.cores' is not set. However, I can't find this property in flink-conf.yaml and in the document(https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html) either.
I am using flink 1.10.0. Any help would be highly appreciated!
That configuration option is intended for internal use only -- it shouldn't be user configured, which is why it isn't documented.
The windows start-cluster.bat is failing because of a bug introduced in Flink 1.10. See https://jira.apache.org/jira/browse/FLINK-15925.
One workaround is to use the bash script, start-cluster.sh, instead.
See also this mailing list thread: https://lists.apache.org/thread.html/r7693d0c06ac5ced9a34597c662bcf37b34ef8e799c32cc0edee373b2%40%3Cdev.flink.apache.org%3E

Flink job create RocksDB instance failure

I run many jobs on Flink,and backend use rocksDB,
one of my job got error and restart all night,
error message like :
java.lang.IllegalStateException: Could not initialize keyed state backend.
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:330)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:221)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:679)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:666)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:708)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error while opening RocksDB instance.
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.openDB(RocksDBKeyedStateBackend.java:1063)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.access$3300(RocksDBKeyedStateBackend.java:128)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstance(RocksDBKeyedStateBackend.java:1472)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:1569)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:996)
at org.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(StreamTask.java:775)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:319)
Caused by: org.rocksdb.RocksDBException: Corruption: Sst file size mismatch: /mnt/dfs/3/hadoop/yarn/local/usercache/sloth/appcache/application_1526888270443_0002/flink-io-84ec9962-f37f-4fbc-8262-a215984d8d70/job-1a72a5f09ac8a80914256306363505aa_op-CoStreamFlatMap_1361_4_uuid-0b019d7f-2d28-44dc-baf2-12774ed3518f/db/008919.sst. Size recorded in manifest 132174005, actual size 2674688
Sst file size mismatch: /mnt/dfs/3/hadoop/yarn/local/usercache/sloth/appcache/application_1526888270443_0002/flink-io-84ec9962-f37f-4fbc-8262-a215984d8d70/job-1a72a5f09ac8a80914256306363505aa_op-CoStreamFlatMap_1361_4_uuid-0b019d7f-2d28-44dc-baf2-12774ed3518f/db/008626.sst. Size recorded in manifest 111956797, actual size 14286848
Sst file size mismatch: /mnt/dfs/3/hadoop/yarn/local/usercache/sloth/appcache/application_1526888270443_0002/flink-io-84ec9962-f37f-4fbc-8262-a215984d8d70/job-1a72a5f09ac8a80914256306363505aa_op-CoStreamFlatMap_1361_4_uuid-0b019d7f-2d28-44dc-baf2-12774ed3518f/db/008959.sst. Size recorded in manifest 43157714, actual size 933888
at org.rocksdb.TtlDB.openCF(Native Method)
at org.rocksdb.TtlDB.open(TtlDB.java:132)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.openDB(RocksDBKeyedStateBackend.java:1054)
... 12 more
when I found this, it kill it manually, And start it again.Then it work well.
How this error happens,I can't find any message from google or somewhere
I found the error before this exception.
No space left on device
I think this problem cost this question

Quartz clustering in camel spring DSL

I am trying to achieve "requests recovery" in fail-over scenario in two different machine with their clock also sync.
My configuration as below:
step 1: camel-context.xml
I have defined the below route in camel-context.xml file.
<route id="quartz" trace="true">
<from uri="quartz2://cluster/quartz?cron=0+0/2+++*+?&durableJob=true&stateful=true&recoverableJob=true">
<route>
step 2: quartz.properties:
I have enabled
org.quartz.jobStore.isClustered = true
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.instanceName =ClusteredScheduler
Currently I am running same camel application in two different instances in my local and clustering is working fine . But when I try to test the "requests recovery" I am getting below exception.
Exception :
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: detected 1 failed or restarted instances.
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: Scanning for instance "6308270818"'s failed in-progress jobs.
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: ......Scheduled 1 recoverable job(s) for recovery.
[ClusteredScheduler-camelContext_Worker-1] WARN org.apache.camel.component.quartz2.CamelJob - Cannot find existing QuartzEndpoint with uri: quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true. Creating new endpoint instance.
[ClusteredScheduler-camelContext_Worker-1] ERROR org.apache.camel.component.quartz2.CamelJob - Failed to execute CamelJob.
**org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true due to: Trigger key cluster.quartz is already in used by Endpoint[quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true]**
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:545)
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:558)
at org.apache.camel.component.quartz2.CamelJob.lookupQuartzEndpoint(CamelJob.java:123)
at org.apache.camel.component.quartz2.CamelJob.execute(CamelJob.java:49)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.IllegalArgumentException: Trigger key cluster.quartz is already in used by Endpoint[quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true]
at org.apache.camel.component.quartz2.QuartzEndpoint.ensureNoDupTriggerKey(QuartzEndpoint.java:272)
at org.apache.camel.component.quartz2.QuartzEndpoint.addJobInScheduler(QuartzEndpoint.java:254)
at org.apache.camel.component.quartz2.QuartzEndpoint.doStart(QuartzEndpoint.java:202)
at org.apache.camel.support.ServiceSupport.start(ServiceSupport.java:61)
at org.apache.camel.impl.DefaultCamelContext.startService(DefaultCamelContext.java:2158)
at org.apache.camel.impl.DefaultCamelContext.doAddService(DefaultCamelContext.java:1016)
at org.apache.camel.impl.DefaultCamelContext.addService(DefaultCamelContext.java:977)
at org.apache.camel.impl.DefaultCamelContext.addService(DefaultCamelContext.java:973)
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:541)
... 5 more
After shutting down the instance1 which is currently excuting the job , instance 2 is trying to recover the job immediately but its failing to execute the job .It is picking the same job in next interval (which is fine).
My requirement is active node immediately recover the failed job.
Thanks in advance.
I think we can avoid the checking of ensureNoDupTriggerKey, if the recoverableJob is true. I just created a JIRA CAMEL-8076 for it.

Resources