Why is apache flink checkpoint size very large?

Why is apache flink checkpoint size very large? - apache-flink

I've simple Apache Flink job:
**DataSource (Apache Kafka) - Filter - KeyBy - CEP Pattern (with timer) - PatternProcessFucntion - KeyedProcessFunction (*here I've ValueState(Boolean) and registering timer on 5 minutes. If a valueState not null I'll update valueState (nothing to send in collector) and update timer. If a valueState is null, I'll save in state TRUE, then send input event in collector and setting timer. When onTimer method is ready, I'll clean my ValueState*) - Sink (Apache Kafka)**.
Job settings:
**Checkpointing interval: 5000ms**
**Incremental checkpointing: true**
**Semantic: Exactly Once**
**State Backend: RocksDB**
**Parallelism: 4**
Logically my job is working perfectly, but I've some problems.
I had two tests on my cluster (2 job manager and 3 task manager):
**First test:**
I started my job and connected to an empty Apache Kafka topic then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
2)State size = 340 kb at each 5sec interval
3)All status was completed (blue).
**Second test:**
I started sending json-messages with other keys (from "1" to Integer.MAX_VALUE) in Apache Kafka topic. Sending speed was: 1000 messages/sec then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 1 - 6 minutes
**My Question #1: Why is this time growing? It is bad or OK?**
2) State size was constantly growing. I sent messages in Kafka for about 10 minutes (1000 x 60 x 10 = 600000 messages). After sending State size was 100mb - 150mb.
3)After sending I waited about an one hour and saw that:
Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
State size was: 100mb - 150mb at each 5sec interval.
**My question #2: Why doesn't it decrease? After all I checked my job logs and saw 600000 records: ValueState for **key** was cleared (OnTimer method was successfully) and job logics (see description my KeyedProcessFunction) was working great**
What was I trying to do?
1)setting pause between checkpoints
2)disable incremental checkpoints
3)enable async checkpoints (in flink-conf.yml)
It doesn't give any changes!!!
**My question #3: What should I do?? Because on industrial server speed is: *10 millions messages/hour* and checkpoint size is increases instantly.**

Related

Flink job is interrupted after 10 minutes

I have a flink job with a global window and custom process.
The Process is failed after ~10 minutes on the next error:
java.io.InterruptedIOException
This is my job:
SingleOutputStreamOperator<CustomEntry> result = stream
.keyBy(r -> r.getId())
.window(GlobalWindows.create())
.trigger(new CustomTriggeringFunction())
.process(new CustomProcessingFunction());
The CustomProcessingFunction is run for a long time (more then 10 minutes), but after 10 minutes, the process is stoped and failed on InterruptedIOException
Is it possible t increase the timeout of flink job?

From Flink's point of view, that's an unreasonably long period of time for a user function to run. What is this window process function doing that takes more than 10 minutes? Perhaps you can restructure this to use the async i/o operator instead, so you aren't completely blocking the pipeline.
That said, 10 minutes is the default checkpoint timeout interval, and you're preventing checkpoints from being able to complete while this function is running. So you could experiment with increasing execution.checkpointing.timeout.
If the job is failing because checkpoints are timing out, that will help. Or you could increase execution.checkpointing.tolerable-failed-checkpoints from its default (0).

Flink 1.8, parallelism > 1, source never outputs values

I have a cluster with:
1 TaskManager
1 StandaloneJob / JobManager
Config: taskmanager.numberOfTaskSlots: 1
If I set default.parallelism: 4 on a job with the Flink PubSub source, I keep getting this error when starting my "job cluster"/taskmanager:
[analytics-job-cluster-7bd4586ccb-s5hmp job] 2019-05-01 16:22:30,888 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/4) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
However, if I point the same job at a bunch of files, it works perfectly. What does this mean?

So, the issue is that You need the numberOfTaskSlots equal to Your parallelism basically. So in this case If You have only 1 TaskManager with only 1 TaskSlot Flink will not be able to start the job properly as there is simply not enough slots for it. If You set the numberOfTaskSlots for the given TaskManager equal to the parallelism, then it should work well.

How to identify reason of OverQuotaError when sending emails?

I send emails with cron job and task queue usage. The job is executed every 15 minutes and the queue used has the following setup:
- name: send-emails
rate: 1/m
max_concurrent_requests: 1
retry_parameters:
task_retry_limit: 0
But quite often apiproxy_errors.OverQuotaError exception happens. I am checking Quota Details and see that I am still within the daily quota (Recipients Emailed, Attachment Data Sent etc.), and I believe I couldn't be over maximum per minute limit, since the the rate I use is just 1 task per minute (i.e. send no more than 1 mail per minute).
Where am I wrong and what should I check?

How many emails are you sending? You have not set a bucket-size, so it defaults to 5. Your rate sets how often the bucket is replenished. So, with your current configuration, you can send 5 emails every minute. That means if you are sending more than 75 emails to the queue every 15 minutes, the queue will fill up, and eventually go over quota.

I have not tried this myself, but when you catch the apiproxy_errors.OverQuotaError exception, does the message contain any detail as to why it is over quota/which quota has been exceeded?
try:
send_mail_here
except apiproxy_errors.OverQuotaError, message:
logging.error(message)

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

GAE Task Queues how to make the delay?

In Task Queues code is executed to connect to the server side
through URL Fetch.
My file queue.yaml.
queue:
- Name: default
rate: 10 / m
bucket_size: 1
In such settings, Tusk performed all at once, simultaneously.
Specificity is that between the requests should be delayed at least 5
sec. Task must perform on stage with a difference> 5 sec. (but
does not parallel).
What are the values set in queue.yaml?

You can't specify minimum delays between tasks in queue.yaml, currently; you should do it (partly) in your own code. For example, if you specify a bucket size of 1 (so that more than one task should never be executing at once) and make sure the tasks runs for at least 5 seconds (get a start=time.time() at the start, time.sleep(time.time()-(5+start)) at the end) this should work. If it doesn't, have each task record in the store the timestamp it finished, and when it start check if the last task ended less than 5 seconds ago, and in that case terminate immediately.

The other way could be store the task data in table. In your task-queue add a id parameter. Fetch 1st task from table and pass its id to task queue processing servlet. In servlet at the end delay for 5 second and feth next task, pass its id and.... so on.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why is apache flink checkpoint size very large? - apache-flink

Related

Flink job is interrupted after 10 minutes

Flink 1.8, parallelism > 1, source never outputs values

How to identify reason of OverQuotaError when sending emails?

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

GAE Task Queues how to make the delay?

Categories

Resources