org.apache.flink.runtime.checkpoint.CheckpointException: Some tasks of the job have already finished - apache-flink

I want to stop a flink task by rest api, and I send request: http://192.168.215.165:8081/jobs/c952ba860604a2c32a7abb9eb5b42b0d/stop ,then I got resoponse :
{
"request-id": "29c559399243c817055ebbaf7431a8d2"
}
And then I send request: http://192.168.215.165:8081/jobs/c952ba860604a2c32a7abb9eb5b42b0d/savepoints/29c559399243c817055ebbaf7431a8d2,
I got the response:(part of)
{
"status": {
"id": "COMPLETED"
},
"operation": {
"failure-cause": {
"class": "java.util.concurrent.CompletionException",
"stack-trace": "java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Some tasks of the job have already finished and checkpointing with finished tasks is not enabled. Failure reason: Not all required tasks are currently running.\n\tat java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)\n\tat java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)\n\tat java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925)\n\tat java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)\n\tat java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)\n\tat org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:246)\n\tat java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)\n\tat java.util.concurrent.
How can I stop a flink job by rest API please ?

A few alternatives:
set execution.checkpointing.checkpoints-after-tasks-finish.enabled: true in your configuration (this is a somewhat experimental feature that was added in 1.14, but it should work)
modify your job so that all of the tasks are still running at the time when the job is ready to be stopped
terminate the job without taking a savepoint

Related

Race Condition in Sagemaker Batch Transform Job

We are facing this bug in our production environment. I have been looking for a solution for a while and I cannot seem to solve it. Any help would be appreciated.
We are using Sagemaker Batch Transform to perform inference on our machine learning models. Each job is supposed to create one instance using a docker image from our ECR container. This job then consumes a payload and starts processing it using a pytorch script. When the job is done, the script calls an API to store the results.
The issue is that when we check the cloud watch logs for a SINGLE job, we see that it is repeated. After the job is repeated multiple times, the individual instances of the same job may or may not finish and the whole operation returns with an error.
Basically, we see the following issue in our cloud watch logs and cannot seem to figure out what is causing this:
2022-04-24 19:41:47,865 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 19:52:09,522 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 20:12:11,834 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There are no errors but the cloud watch logs stop here. Sagemaker returns an error to the client.]
The following sample code is what we are using to run the jobs:
def inference_batch(self):
batch_input = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-input/batch.csv"
batch_output = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-output/"
job_name = f"{self.cnf.SAGEMAKER_MODEL}-{str(datetime.datetime.now().strftime('%Y-%m-%d-%H-%m-%S'))}"
transform_input = {
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': batch_input
}
},
'ContentType': 'text/csv',
'SplitType': 'Line',
}
transform_output = {
'S3OutputPath': batch_output
}
transform_resources = {
'InstanceType': self.cnf.SAGEMAKER_BATCH_INSTANCE,
'InstanceCount': 1
}
# self.sm_boto_client is an instance of boto3.Session(region_name="some-region).client("sagemaker")
self.sm_boto_client.create_transform_job(
TransformJobName=job_name,
ModelName=self.cnf.SAGEMAKER_MODEL,
TransformInput=transform_input,
TransformOutput=transform_output,
TransformResources=transform_resources
)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
print(f'Executing transform job {job_name}...')
while status['TransformJobStatus'] == 'InProgress':
time.sleep(5)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
if status['TransformJobStatus'] == 'Completed':
print(f'Batch transform job {job_name} successfully completed.')
else:
raise Exception(f'Batch transform job {job_name} failed.')

Flink - unable to recover after yarn node termination

We are running flink on yarn. We were performing Disaster recovery Testing and as part of that, we manually terminated one of the nodes that had a flink application running. Once the instance was brought back up, the application went in for multiple attempts and each attempt had the following error :
AM Container for appattempt_1602902099413_0006_000027 exited with exitCode: -1000
Failing this attempt.Diagnostics: Could not obtain block: BP-986419965-xx.xx.xx.xx-1602902058651:blk_1073743332_2508
file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-986419965-10.61.71.85-1602902058651:blk_1073743332_2508 file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1053)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1036)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1015)at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647)at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926)at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982)at
java.io.DataInputStream.read(DataInputStream.java:100)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:90)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:64)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:125)at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)at
java.security.AccessController.doPrivileged(Native Method)at
javax.security.auth.Subject.doAs(Subject.java:422)at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)For
more detailed output, check the application tracking page: http://<>.compute.internal:8088/cluster/app/application_1602902099413_0006 Then click on links to logs of each attempt.
Could someone let us know what content is being stored in HDFS and if this could be redirected to S3?
Adding checkpoint related settings :
StateBackend rocksDbStateBackend = new RocksDBStateBackend("s3://Path", true);
streamExecutionEnvironment.setStateBackend(rocksDbStateBackend)
streamExecutionEnvironment.enableCheckpointing(10000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
streamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointTimeout(60000);
streamExecutionEnvironment.getCheckpointConfig().setMaxConcurrentCheckpoints(60000);
streamExecutionEnvironment.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
streamExecutionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);
I had the same issue,
I fixed with setting for hdfs-site
{
"Classification": "hdfs-site",
"Properties": {
"dfs.client.use.datanode.hostname": "true",
"dfs.replication": "2",
"dfs.namenode.replication.min": "2",
"dfs.namenode.maintenance.replication.min": "2"
}
}
I think hdfs lost data on node terminated, so i replication data on multi node.

How to check google cloud scheduler status?

I have two jobs and I want to execute the second one only when the first one has completed. Both are scheduled on cloud scheduler.
I am trying the get API to check the status of the first job but there is no data under the status field. Please note that I have tried to get this data when my first job was running.
{
"name": "projects/<project name>/locations/us-central1/jobs/<job name>",
"description": "Sample",
"appEngineHttpTarget": {
"httpMethod": "GET",
"appEngineRouting": {
"version": "test-v1",
"host": "test-v1.test.googleplex.com"
},
"relativeUri": "/api/v1/test",
"headers": {
"User-Agent": "AppEngine-Google; (+http://code.google.com/appengine)"
}
},
"userUpdateTime": "2020-07-17T11:44:16Z",
"state": "ENABLED",
"status": {},
"scheduleTime": "2020-07-18T11:00:00.834928Z",
"lastAttemptTime": "2020-07-17T11:44:30.439092Z",
"retryConfig": {
"maxRetryDuration": "0s",
"minBackoffDuration": "5s",
"maxBackoffDuration": "3600s",
"maxDoublings": 16
},
"schedule": "0 04 * * *",
"timeZone": "America/Los_Angeles",
"attemptDeadline": "18000s"
}
Where am I going wrong?
I was checking the documentation on this, and it looks like to get if a job is running you need to use state, as it will return the state of the job, but I could not find in the documentation a description on when the job is done.
It looks like once the job has finished, it will populate the field status, but status will give you a description of the http status from your job (so it will tell you if it failed or if it succeeded and specific information about the job failure or success)
So, in this case you would need to check if the state is ENABLED (this would tell you that the job is not paused or has other state) and if the status is populated (this would mean that the job finished and at the same time you will know if it succeeded or if it failed).

Why does gmail connector trigger skips new incoming email

In one of my Logic Apps I'm using gmail connector trigger "when a new email arrives", but it doesn't seem to be working.
I'm sending email that it should detect when the trigger is run, but the trigger history simply shows the trigger is skipping and therefor not firing the rest of the workflow.
How can I debug this issue?
The following code is the trigger section of the logic app:
"triggers": {
"When_a_new_email_arrives": {
"description": "",
"inputs": {
"host": {
"connection": {
"name": "#parameters('$connections')['gmail']['connectionId']"
}
},
"method": "get",
"path": "/Mail/OnNewEmail",
"queries": {
"fetchOnlyWithAttachments": false,
"from": "secret#secret.com",
"importance": "All",
"includeAttachments": false,
"label": "INBOX",
"starred": "All",
"subject": "something"
}
},
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2019-08-17T08:40:00Z"
},
"type": "ApiConnection"
}
}
It setup to runs ones every day, although for testing purposes I'm running the trigger manually.
Update 1
I've tried sending the mail from my own mail adress after configuring the from-parameter in the trigger and that works as intented. So I think the issue might be due to something about the senders original mail message. I did some digging, and pulled out the original raw mail from gmail. It contains some logging information from gmail servers. Appearently something called DMARC authentication have failed. I wonder if this has anything do the problem that is arising, maybe the gmail connector will not accept the senders identity.
Here's the part about DMARC in the raw mail message:
Authentication-Results: mx.google.com;
spf=pass (google.com: domain of source-company#company-product.com designates 85.236.67.1 as permitted sender) smtp.mailfrom=source-company#company-product.com;
dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=source-company.com
Could this be the reason the connector does not detect these mails?
for this issue I did some test but I haven't met this issue. In my logic app, I set the "How often do you want to check for items?" box as 10 minutes. I didn't run the trigger manually(I didn't click the "Run" button). Then I sent an email to my gmail, and after about 10 minutes, when the trigger went to check my gmail, the logic app run the actions under the trigger successfully. Apart from this, if I sent two emails to my gmail in these ten minutes, the trigger will not be triggered twice, it will be triggered just once.
I saw you mentioned set once every day in your description. So for example, if you completed the configuration of the logic app at 1:00 pm, and your gmail received an email at 2:00 pm, it will not run the actions under the trigger at once. The trigger will check your gmail at 1:00 pm tomorrow, so the actions under the trigger will also run at 1:00 pm tomorrow. But when you test this logic app, if you run the logic app manually, when your gmail received an email, it will triggered at once.
I wonder if this explanation will be helpful to your issue ?

Elasticsearch ver 1.4.2 polling configuration is not working. why?

Hi' I am trying to poll table data every 5 sec but it does not work
POST /_river/mytest_river/_meta
{
"type":"jdbc",
"jdbc":
{
"driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url":"jdbc:sqlserver://[my_ip];databaseName=mega",
"user":"sa","password":"******",
"sql":"SELECT [OrderID],[CustomerName],[UserFullName],[Status] FROM [Orders_Table]",
"poll":"5s",
"index": "mega",
"type": "orders_table"
}
}
What is wrong with my configuration?
You need to use "schedule" parameter as described here:
Time scheduled execution of JDBC river

Resources