App engine cron job restarts on failures

App engine cron job restarts on failures - google-app-engine

I have a Cron job that restart automatically if it fails with 503 error.
This is not the expected behavior. It consumes all my quotas.
Is there any way to stop my crawler if it returns with an internal server error like 50x?
I incur in a NullPointerException and the cron jobs restart automatically:
This is my cron.yaml file:
cron:
- description: "Daily crawler job"
url: /_ah/api/followFunAdmin/v1/cron/crawler
schedule: every 24 hours

Related

Flink application TaskManager timeout exception Flink 1.11.2 running on EMR 6.2.1

We are currently running a Flink application on EMR 6.2.1 which runs on Yarn. The flink version is 1.11.2
We're running at a parallelism of 180 with 65 task nodes in EMR. When we start up the yarn application we run the following:
flink-yarn-session --name NameOfJob --jobManagerMemory 2048 --taskManagerMemory 32000 --slots 4 --detached
When the flink job is deployed it takes about 15 minutes to start up and for all the SubTasks to start running. We see several restarts and the following exception
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id X timed out
The job then starts up but after about 20 hours of running the checkpoints start failing and again we see the exception above.
I have seen similar questions around maybe the job manager memory needs to be increased but I'm struggling to find guidance on by how much to increase it or what recommendations may be. The other issue is by the time we can look at the logs of the task manager which fails, it's been killed and the logs are gone.
Any guidance around whether increasing the job manager memory will help and by what increments we should look at doing it would be appreciated.

Flink try to recover checkpoint from deleted directories

After cleaning the s3 bucket, which is used to store checkpoints from old files (files that have been accessed for more than a month), when restarting or when restoring from actual checkpoints, some processes of job do not start due to some missing old files
Job works well and save actual checkpoins (save path s3://flink-checkpoints/check/af8b0712ae0c1f20d2226b86e6bddb60/chk-100274)
2022-04-24 03:58:32.892 Triggering checkpoint 100273 # 1653353912890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 03:58:55.317 Completed checkpoint 100273 for job af8b0712ae0c1f20d2226b86e6bddb60 (679053131 bytes in 22090 ms).
2022-04-24 04:03:32.892 Triggering checkpoint 100274 # 1653354212890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 04:03:35.844 Completed checkpoint 100274 for job af8b0712ae0c1f20d2226b86e6bddb60 (9606712 bytes in 2494 ms).
After one taskmanager switched off and job restarted
2022-04-24 04:04:40.936 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:14.150 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RESTARTING to RUNNING.
2022-04-24 04:05:14.198 Restoring job af8b0712ae0c1f20d2226b86e6bddb60 from latest valid checkpoint: Checkpoint 100274 # 1653354212890 for af8b0712ae0c1f20d2226b86e6bddb60.
after some time job failed because some process can't restore state
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:17.093 Process first events -> Sink: Sink to test-job (5/10) (4f9089b1015540eb6e13afe4c07fa97b) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_f1d5710fb330fd579d15b292e305802c_(5/10) from any of the 1 provided restore options.
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default), S3 Extended Request ID: 51f03da-default-default (Path: s3://flink-checkpoints/check/e3d82336005fc40be9af536938716199/shared/64452a30-c8a0-454f-8164-34d9e70142e0)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default)
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
if I completely cancel the job and start a new one with the savepoint set to the path of the last checkpoint, I get the same errors.
Why when working with checkpoint from af8b0712ae0c1f20d2226b86e6bddb60 folder job tries to get some files from e3d82336005fc40be9af536938716199 folder and what are the rules for clearing old checkpoints from storage?
UPDATE
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.

This is something that was quite ambiguous for a long time and has recently been addressed in Flink 1.15. I would recommend reading https://flink.apache.org/news/2022/05/05/1.15-announcement.html on the section 'Clarification of checkpoint and savepoint semantics', including the section where the comparison between checkpoints and savepoints is made.
The behaviour you've experienced depends on your checkpointing setup (aligned vs unaligned).

By default cancelling job removes old checkpoints. There is a configuration flag to control it execution.checkpointing.externalized-checkpoint-retention. As mentioned by Martijn, normally you would resort to savepoints for controlled job upgrades / restarts.

I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.

how to keep apache flink task and submit record when restart jobmanager

I am using apache flink 1.10 to batch compute my stream data, today I move my apache flink kubernetes(v1.15.2) pod from machine 1 to machine 2 and find all submit task record and task list disappear, what's happening? the summit record is in the memory? what should I to keep my submit record and task list when restart the kubernetes pod of apache flink? I just found checkpoint persistant but nothing about tasks.
If lose the running task history, I must upload my task jar and recreate all task, so many task should to recreate if lose the history, is there any possible to resume the task automaticlly?

The configurations that might not be set are:
Job Manager
jobmanager.archive.fs.dir: hdfs:///completed-jobs
History Server
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
Please look at for more details: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/historyserver.html#configuration

How to get the name of failed flink jobs

Our flink cluster sometimes restarts and all jobs will be restarted. Occasionally, some job failed to restart and failed count increases on the panel. However, it cannot let us know which jobs failed.
When total job count grows, it becomes harder to find out the stopped job. Does anyone know how can I get the names of the failed jobs?

You could write a simple script for that which will give you the list of job names which have failed.
I am using this command to get a list of failed job.
$yarn application -list -appStates KILLED
Set up alert when your cluster restarts and post restart check the jobs that haven't restarted and you could have alerts for those as well.

What are the possible causes?

In Google Cloud Platform Cloud SQL, the following error log is generated every second. What are the possible causes?
Google App Engine has stopped, so I don't think it is the cause.
Aborted connection 1000 to db: ‘unconnected’ user: ‘user-name’ host: ‘localhost’ (Got an error reading communication packets)`

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

App engine cron job restarts on failures - google-app-engine

Related

Flink application TaskManager timeout exception Flink 1.11.2 running on EMR 6.2.1

Flink try to recover checkpoint from deleted directories

how to keep apache flink task and submit record when restart jobmanager

How to get the name of failed flink jobs

What are the possible causes?

Categories

Resources