Flink job is interrupted after 10 minutes - apache-flink

I have a flink job with a global window and custom process.
The Process is failed after ~10 minutes on the next error:
java.io.InterruptedIOException
This is my job:
SingleOutputStreamOperator<CustomEntry> result = stream
.keyBy(r -> r.getId())
.window(GlobalWindows.create())
.trigger(new CustomTriggeringFunction())
.process(new CustomProcessingFunction());
The CustomProcessingFunction is run for a long time (more then 10 minutes), but after 10 minutes, the process is stoped and failed on InterruptedIOException
Is it possible t increase the timeout of flink job?

From Flink's point of view, that's an unreasonably long period of time for a user function to run. What is this window process function doing that takes more than 10 minutes? Perhaps you can restructure this to use the async i/o operator instead, so you aren't completely blocking the pipeline.
That said, 10 minutes is the default checkpoint timeout interval, and you're preventing checkpoints from being able to complete while this function is running. So you could experiment with increasing execution.checkpointing.timeout.
If the job is failing because checkpoints are timing out, that will help. Or you could increase execution.checkpointing.tolerable-failed-checkpoints from its default (0).

Related

Why is apache flink checkpoint size very large?

I've simple Apache Flink job:
**DataSource (Apache Kafka) - Filter - KeyBy - CEP Pattern (with timer) - PatternProcessFucntion - KeyedProcessFunction (*here I've ValueState(Boolean) and registering timer on 5 minutes. If a valueState not null I'll update valueState (nothing to send in collector) and update timer. If a valueState is null, I'll save in state TRUE, then send input event in collector and setting timer. When onTimer method is ready, I'll clean my ValueState*) - Sink (Apache Kafka)**.
Job settings:
**Checkpointing interval: 5000ms**
**Incremental checkpointing: true**
**Semantic: Exactly Once**
**State Backend: RocksDB**
**Parallelism: 4**
Logically my job is working perfectly, but I've some problems.
I had two tests on my cluster (2 job manager and 3 task manager):
**First test:**
I started my job and connected to an empty Apache Kafka topic then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
2)State size = 340 kb at each 5sec interval
3)All status was completed (blue).
**Second test:**
I started sending json-messages with other keys (from "1" to Integer.MAX_VALUE) in Apache Kafka topic. Sending speed was: 1000 messages/sec then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 1 - 6 minutes
**My Question #1: Why is this time growing? It is bad or OK?**
2) State size was constantly growing. I sent messages in Kafka for about 10 minutes (1000 x 60 x 10 = 600000 messages). After sending State size was 100mb - 150mb.
3)After sending I waited about an one hour and saw that:
Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
State size was: 100mb - 150mb at each 5sec interval.
**My question #2: Why doesn't it decrease? After all I checked my job logs and saw 600000 records: ValueState for **key** was cleared (OnTimer method was successfully) and job logics (see description my KeyedProcessFunction) was working great**
What was I trying to do?
1)setting pause between checkpoints
2)disable incremental checkpoints
3)enable async checkpoints (in flink-conf.yml)
It doesn't give any changes!!!
**My question #3: What should I do?? Because on industrial server speed is: *10 millions messages/hour* and checkpoint size is increases instantly.**

"Connection closed" occurs when executing a agent

"Connection closed" occurs when executing a function for data pre-processing.
The data pre-processing is as follows.
Import data points of about 30 topics from the database.( Data for 9 days every 1 minute,
60 * 24 * 9 * 30 = 388,800 values)
Convert data to a pandas dataframe for pre-processing such as missing value or resampling (this process takes the longest time)
Data processing
In the above data pre-processing, the following error occurs.
volttron.platform.vip.rmq_connection ERROR: Connection closed unexpectedly, reopening in 30 seconds.
This error is probably what the VOLTTRON platform does to manage the agent.
Since it takes more than 30 seconds in step 2, an error occurs and the VOLTTRON platform automatically restarts the agent.
Because of this, the agent cannot perform data processing normally.
Does anyone know how to avoid this?
If this is happening during agent instantiation I would suggest moving the pre-processing out of the init or configuration steps to a function with the #core.receiver("onstart") decorator. This will stop the agent instantiation and configuration steps from timing out. The listener agent's on start method can be used as an example.

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

GAE Task Queues how to make the delay?

In Task Queues code is executed to connect to the server side
through URL Fetch.
My file queue.yaml.
queue:
- Name: default
rate: 10 / m
bucket_size: 1
In such settings, Tusk performed all at once, simultaneously.
Specificity is that between the requests should be delayed at least 5
sec. Task must perform on stage with a difference> 5 sec. (but
does not parallel).
What are the values set in queue.yaml?
You can't specify minimum delays between tasks in queue.yaml, currently; you should do it (partly) in your own code. For example, if you specify a bucket size of 1 (so that more than one task should never be executing at once) and make sure the tasks runs for at least 5 seconds (get a start=time.time() at the start, time.sleep(time.time()-(5+start)) at the end) this should work. If it doesn't, have each task record in the store the timestamp it finished, and when it start check if the last task ended less than 5 seconds ago, and in that case terminate immediately.
The other way could be store the task data in table. In your task-queue add a id parameter. Fetch 1st task from table and pass its id to task queue processing servlet. In servlet at the end delay for 5 second and feth next task, pass its id and.... so on.

Task Queue API: ETA and Countdown

I love the new TaskQueue API.
I have a question about the ETA/Countdown, if I set it a new Task to execute 10 minutes in the future and it is the only item in the queue - will it execute in roughly 10 minutes or will it execute straight away?
It will execute no sooner than 10 minutes from now (it may execute later if the queue is full, naturally).

Resources