Flink - unable to recover after yarn node termination - apache-flink

We are running flink on yarn. We were performing Disaster recovery Testing and as part of that, we manually terminated one of the nodes that had a flink application running. Once the instance was brought back up, the application went in for multiple attempts and each attempt had the following error :
AM Container for appattempt_1602902099413_0006_000027 exited with exitCode: -1000
Failing this attempt.Diagnostics: Could not obtain block: BP-986419965-xx.xx.xx.xx-1602902058651:blk_1073743332_2508
file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-986419965-10.61.71.85-1602902058651:blk_1073743332_2508 file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1053)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1036)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1015)at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647)at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926)at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982)at
java.io.DataInputStream.read(DataInputStream.java:100)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:90)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:64)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:125)at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)at
java.security.AccessController.doPrivileged(Native Method)at
javax.security.auth.Subject.doAs(Subject.java:422)at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)For
more detailed output, check the application tracking page: http://<>.compute.internal:8088/cluster/app/application_1602902099413_0006 Then click on links to logs of each attempt.
Could someone let us know what content is being stored in HDFS and if this could be redirected to S3?
Adding checkpoint related settings :
StateBackend rocksDbStateBackend = new RocksDBStateBackend("s3://Path", true);
streamExecutionEnvironment.setStateBackend(rocksDbStateBackend)
streamExecutionEnvironment.enableCheckpointing(10000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
streamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointTimeout(60000);
streamExecutionEnvironment.getCheckpointConfig().setMaxConcurrentCheckpoints(60000);
streamExecutionEnvironment.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
streamExecutionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);

I had the same issue,
I fixed with setting for hdfs-site
{
"Classification": "hdfs-site",
"Properties": {
"dfs.client.use.datanode.hostname": "true",
"dfs.replication": "2",
"dfs.namenode.replication.min": "2",
"dfs.namenode.maintenance.replication.min": "2"
}
}
I think hdfs lost data on node terminated, so i replication data on multi node.

Related

Race Condition in Sagemaker Batch Transform Job

We are facing this bug in our production environment. I have been looking for a solution for a while and I cannot seem to solve it. Any help would be appreciated.
We are using Sagemaker Batch Transform to perform inference on our machine learning models. Each job is supposed to create one instance using a docker image from our ECR container. This job then consumes a payload and starts processing it using a pytorch script. When the job is done, the script calls an API to store the results.
The issue is that when we check the cloud watch logs for a SINGLE job, we see that it is repeated. After the job is repeated multiple times, the individual instances of the same job may or may not finish and the whole operation returns with an error.
Basically, we see the following issue in our cloud watch logs and cannot seem to figure out what is causing this:
2022-04-24 19:41:47,865 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 19:52:09,522 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 20:12:11,834 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There are no errors but the cloud watch logs stop here. Sagemaker returns an error to the client.]
The following sample code is what we are using to run the jobs:
def inference_batch(self):
batch_input = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-input/batch.csv"
batch_output = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-output/"
job_name = f"{self.cnf.SAGEMAKER_MODEL}-{str(datetime.datetime.now().strftime('%Y-%m-%d-%H-%m-%S'))}"
transform_input = {
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': batch_input
}
},
'ContentType': 'text/csv',
'SplitType': 'Line',
}
transform_output = {
'S3OutputPath': batch_output
}
transform_resources = {
'InstanceType': self.cnf.SAGEMAKER_BATCH_INSTANCE,
'InstanceCount': 1
}
# self.sm_boto_client is an instance of boto3.Session(region_name="some-region).client("sagemaker")
self.sm_boto_client.create_transform_job(
TransformJobName=job_name,
ModelName=self.cnf.SAGEMAKER_MODEL,
TransformInput=transform_input,
TransformOutput=transform_output,
TransformResources=transform_resources
)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
print(f'Executing transform job {job_name}...')
while status['TransformJobStatus'] == 'InProgress':
time.sleep(5)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
if status['TransformJobStatus'] == 'Completed':
print(f'Batch transform job {job_name} successfully completed.')
else:
raise Exception(f'Batch transform job {job_name} failed.')

org.apache.flink.runtime.checkpoint.CheckpointException: Some tasks of the job have already finished

I want to stop a flink task by rest api, and I send request: http://192.168.215.165:8081/jobs/c952ba860604a2c32a7abb9eb5b42b0d/stop ,then I got resoponse :
{
"request-id": "29c559399243c817055ebbaf7431a8d2"
}
And then I send request: http://192.168.215.165:8081/jobs/c952ba860604a2c32a7abb9eb5b42b0d/savepoints/29c559399243c817055ebbaf7431a8d2,
I got the response:(part of)
{
"status": {
"id": "COMPLETED"
},
"operation": {
"failure-cause": {
"class": "java.util.concurrent.CompletionException",
"stack-trace": "java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Some tasks of the job have already finished and checkpointing with finished tasks is not enabled. Failure reason: Not all required tasks are currently running.\n\tat java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)\n\tat java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)\n\tat java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)\n\tat java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)\n\tat java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)\n\tat org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:246)\n\tat java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)\n\tat java.util.concurrent.
How can I stop a flink job by rest API please ?
A few alternatives:
set execution.checkpointing.checkpoints-after-tasks-finish.enabled: true in your configuration (this is a somewhat experimental feature that was added in 1.14, but it should work)
modify your job so that all of the tasks are still running at the time when the job is ready to be stopped
terminate the job without taking a savepoint

Docker, Debezium not streaming data from mssql to elasticsearch

I followed this examples to stream data from mysql to elasticsearch
https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
The example itself works great on my local machine.
But in my case I want to stream data from mssql (which is on another server, not docker) to elasticsearch.
So in the "docker-compose-es.yaml" file i removed "mysql" part and removed the mysql links.
And created my own connectors/sink for elastic and mssql:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "SQ",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
When adding these the kafka connect I/O is working hard and has over 40GB input see image below:
In the kafka logs it looks like its going through all the tables. Here is one of the table logs:
2021-06-17 10:20:10,414 - INFO [data-plane-kafka-request-handler-5:Logging#66] - [Partition MyServer.dbo.TemplateGroup-0 broker=1] Log loaded for partition MyServer.dbo.TemplateGroup-0 with initial high watermark 0
2021-06-17 10:20:10,509 - INFO [data-plane-kafka-request-handler-3:Logging#66] - Creating topic MyServer.dbo.TemplateMeter with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1))
2021-06-17 10:20:10,516 - INFO [data-plane-kafka-request-handler-3:Logging#66] - [KafkaApi-1] Auto creation of topic MyServer.dbo.TemplateMeter with 1 partitions and replication factor 1 is successful
2021-06-17 10:20:10,526 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(MyServer.dbo.TemplateMeter-0)
2021-06-17 10:20:10,528 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [Log partition=MyServer.dbo.TemplateMeter-0, dir=/kafka/data/1] Loading producer state till offset 0 with message format version 2
The database is only 2GB. I'm not sure why it has so high input.
No test_a index was created in elasticsearch when running this command:
curl http://localhost:9200/_aliases?pretty=true
Does anyone know how I troubleshoot from here or point me to the right direction?
Thanks in advance!
how I troubleshoot from here
docker compose logs?
Modify the log4j.properties of Kafka Connect and/or Elasitcsearch processes to get more logs?
Use a regular Kafka consumer to see if data is actually read into the TEST_A topic?
in the "docker-compose-es.yaml" ....
If Debezium is running in a container, then Elasticsearch is not available at localhost:9200
Change that value to http://elastic:9200, like shown in the es-sink.json

[flink]Task manager initialization failed

I am new to flink. I am trying to run the flink example on my local PC(windows).
However, after I run the start-cluster.bat, I login to the dashboard, it shows the task manager is 0.
I checked the log and seems it fails to initialize:
2020-02-21 23:03:14,202 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpec.FromConfig(TaskExecutorResourceUtils.java:72)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerSecurely$2(TaskManagerRunner.java:322)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerSecurely(TaskManagerRunner.java:321)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:287)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.cpu.cores' , default: null (fallback keys: []) is not set
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
at java.util.Arrays$ArrayList.forEach(Arrays.java:3880)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
... 7 more
2020-02-21 23:03:14,217 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
Basically, it looks like a required option 'taskmanager.cpu.cores' is not set. However, I can't find this property in flink-conf.yaml and in the document(https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html) either.
I am using flink 1.10.0. Any help would be highly appreciated!
That configuration option is intended for internal use only -- it shouldn't be user configured, which is why it isn't documented.
The windows start-cluster.bat is failing because of a bug introduced in Flink 1.10. See https://jira.apache.org/jira/browse/FLINK-15925.
One workaround is to use the bash script, start-cluster.sh, instead.
See also this mailing list thread: https://lists.apache.org/thread.html/r7693d0c06ac5ced9a34597c662bcf37b34ef8e799c32cc0edee373b2%40%3Cdev.flink.apache.org%3E

How can I read the blcokfile_xxxx

I tried to read the fabric produced blockfile_00000 which is at the directory /var/hyperledger/production/ledgersData/chains/chains/mychannel of the peer node.
But I can't read it through the method like:
configtxgen -profile TwoOrgsChannel -inspectBlock ./channel-artifacts/blockfile_000000
the error is
[common/tools/configtxgen] main -> CRIT 004 Error on inspectBlock: Could not read block ./channel-artifacts/blockfile_000000
using confitxlator
configtxlator proto_decode --input ./channel-artifacts/blockfile_000000 --type common.Block
the error is
configtxlator: error: Error decoding: error unmarshaling: proto: can't skip unknown wire type 6 for common.Block
I know the blockfile actually is chunk which is the collection of blocks, how to handle it?
configtxlator version
configtxlator:
Version: 1.2.0
Commit SHA: f6e72eb
Go version: go1.10
OS/Arch: linux/amd64
Any help would be greatly appreciated.
I use docker exec command into peer node and get block through peer channel fetch . Then, read the block by configtxlator. But how to read the transaction information.
the part log is(block 6):
"header": {
"data_hash": "kVFRQLFjY7+6l6QsL+jOgt5ICoCUlRG4VedgmBXv/mE=",
"number": "6",
"previous_hash": "GQ4w7x7MQB+Jvsa3neJcTNdU7aXdKVHySA7Va3SktOs="
},
There are APIs which can be used to query blocks for any given channel:
GetChainInfo returns the current block height for a given channel
GetBlockByNumber returns individual blocks by number (you get the latest block from the GetChainInfo API work backwards from there)
All of the SDKs have methods to invoke these APIs
configtxgen and configtxlator are meant for generating configuration transactions and translating configuration transactions into readable format respectively. Configuration Transactions include creation of channel, update anchor peers in the channel, setting the reader/writer of a channel etc. They are not meant for normal transactions which gets stored in blockfile_xxx.
You can use Hyperledger Explorer to view your block data.

Resources