Flink 1.8, parallelism > 1, source never outputs values - apache-flink

I have a cluster with:
1 TaskManager
1 StandaloneJob / JobManager
Config: taskmanager.numberOfTaskSlots: 1
If I set default.parallelism: 4 on a job with the Flink PubSub source, I keep getting this error when starting my "job cluster"/taskmanager:
[analytics-job-cluster-7bd4586ccb-s5hmp job] 2019-05-01 16:22:30,888 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/4) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
However, if I point the same job at a bunch of files, it works perfectly. What does this mean?

So, the issue is that You need the numberOfTaskSlots equal to Your parallelism basically. So in this case If You have only 1 TaskManager with only 1 TaskSlot Flink will not be able to start the job properly as there is simply not enough slots for it. If You set the numberOfTaskSlots for the given TaskManager equal to the parallelism, then it should work well.

Related

Flink UI Subtask vs BackPressure Tabs Incorrect Info

Task - Operator or group of operators chaining.
Sub Task - Parallel instances of task. Every subtask will run on a seperate thread.
Slot - For resource slicing, can execute one or more tasks.
I have 5 TM with 16 slots each do total 80 slots, having 135 Tasks to be scheduled.
Task X - 40 subtasks which is seen when i open Subtasks tab, but when i open Backpressure Tab it shows 80 subtasks. Why?
If the above understanding is correct it should show 40 subtasks in both the tabs.

Why is apache flink checkpoint size very large?

I've simple Apache Flink job:
**DataSource (Apache Kafka) - Filter - KeyBy - CEP Pattern (with timer) - PatternProcessFucntion - KeyedProcessFunction (*here I've ValueState(Boolean) and registering timer on 5 minutes. If a valueState not null I'll update valueState (nothing to send in collector) and update timer. If a valueState is null, I'll save in state TRUE, then send input event in collector and setting timer. When onTimer method is ready, I'll clean my ValueState*) - Sink (Apache Kafka)**.
Job settings:
**Checkpointing interval: 5000ms**
**Incremental checkpointing: true**
**Semantic: Exactly Once**
**State Backend: RocksDB**
**Parallelism: 4**
Logically my job is working perfectly, but I've some problems.
I had two tests on my cluster (2 job manager and 3 task manager):
**First test:**
I started my job and connected to an empty Apache Kafka topic then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
2)State size = 340 kb at each 5sec interval
3)All status was completed (blue).
**Second test:**
I started sending json-messages with other keys (from "1" to Integer.MAX_VALUE) in Apache Kafka topic. Sending speed was: 1000 messages/sec then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 1 - 6 minutes
**My Question #1: Why is this time growing? It is bad or OK?**
2) State size was constantly growing. I sent messages in Kafka for about 10 minutes (1000 x 60 x 10 = 600000 messages). After sending State size was 100mb - 150mb.
3)After sending I waited about an one hour and saw that:
Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
State size was: 100mb - 150mb at each 5sec interval.
**My question #2: Why doesn't it decrease? After all I checked my job logs and saw 600000 records: ValueState for **key** was cleared (OnTimer method was successfully) and job logics (see description my KeyedProcessFunction) was working great**
What was I trying to do?
1)setting pause between checkpoints
2)disable incremental checkpoints
3)enable async checkpoints (in flink-conf.yml)
It doesn't give any changes!!!
**My question #3: What should I do?? Because on industrial server speed is: *10 millions messages/hour* and checkpoint size is increases instantly.**

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

Performance Tuning Mystery

So I have a SSIS package with a performance problem. I had done 4 runs so far.
Run 1 - Run whole package . It takes 58 seconds . Performance problem replicated.
Run 2 - Run whole package with logging enabled. 66 seconds .
01-18-Package 66 10:32:26 10:33:32
Task1 2 10:32:26 10:32:28
Task2 1 10:32:28 10:32:29
Task3 2 10:32:29 10:32:31
Task4 1 10:32:31 10:32:32
Task5 1 10:32:31 10:32:32
Data Flow 59 10:32:32 10:33:31
Task 7 1 10:33:31 10:33:32
The bottleneck appears to be the Data Flow.
Run 3. Execute Data flow on its own using right click and execute task. It takes 8 seconds. What ? Running the package with only the data flow task, with play button gives me 9.6 seconds.
Run 4 . Strip out everything from package apart from data flow and run with logging. 52 seconds.
Is the problem the data flow or is this a memory issue ? What should be my logical next step in this investigation ? Logging is not the issue and the Data Flow on its own is not the problem. There is a lookup in the data flow that may use some memory if that is an issue.
[Find FaultID [33]] Information: The Find FaultID processed 540345
rows in the cache. The processing time was 1.623 seconds. The cache
used 19452420 bytes of memory.

GAE Task Queues how to make the delay?

In Task Queues code is executed to connect to the server side
through URL Fetch.
My file queue.yaml.
queue:
- Name: default
rate: 10 / m
bucket_size: 1
In such settings, Tusk performed all at once, simultaneously.
Specificity is that between the requests should be delayed at least 5
sec. Task must perform on stage with a difference> 5 sec. (but
does not parallel).
What are the values set in queue.yaml?
You can't specify minimum delays between tasks in queue.yaml, currently; you should do it (partly) in your own code. For example, if you specify a bucket size of 1 (so that more than one task should never be executing at once) and make sure the tasks runs for at least 5 seconds (get a start=time.time() at the start, time.sleep(time.time()-(5+start)) at the end) this should work. If it doesn't, have each task record in the store the timestamp it finished, and when it start check if the last task ended less than 5 seconds ago, and in that case terminate immediately.
The other way could be store the task data in table. In your task-queue add a id parameter. Fetch 1st task from table and pass its id to task queue processing servlet. In servlet at the end delay for 5 second and feth next task, pass its id and.... so on.

Resources