Flink consumer lag after union streams updated in different frequency - apache-flink

We are using Flink 1.2.1, and we are consuming from 2 kafka streams by union one stream to another and process the unioned stream.
e.g.
stream1.union(stream2)
However, stream2 has more than 100 times more volume than the stream1, and we are experiencing is there are huge consuming lag(more than 3 days of data) for stream2, but very little lag in stream1.
We have already 9 partitions, but 1 as Parallelism, would increase paralelism solve the consuming lag for stream2, or we shouldn't do union in this case at all.

The .union() shouldn't be contributing to the time lag, AFAIK.
And yes, increasing parallelism should help, if in fact the lag in processing is due to your consuming operators (or sink) being CPU constrained.
If the problem is with something at the sink end which can't be helped by higher parallelism (e.g. you are writing to a DB, and it's at its maximum ingest rate), then increasing the sink parallelism won't help, of course.

Yes, try increasing the parallelism for the stream2 source - it should help:
env.addSource(kafkaStream2Consumer).setParallelism(9)
At the moment you have a bottleneck of 1 core, which needs to keep up with consuming stream2 data. In order to fully utilise parallelism of Kafka, FlinkKafkaConsumer parallelism should be >= the number of topic partitions it is consuming from.

Related

How to handle the case for watermarks when num of kafka partitions is larger than Flink parallelism

I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.

Why does the kinesis shard iterator falls behind when using BoundedOutOfOrdernessTimestampExtractor

I'm using KDA with a flink job which should analyse messages emitted by a different IOT device sources. There is a kinesis stream with 4 shards with each of them contains more or less the same amount of data (there are no hot shards). The kinesis stream gets filled by AWS Greengrass Streammanager which is using an increasing sequence number as partition key. Each message contains a single value (something like temperature = 5).
As with this setup the stream read by the kinesis consumer in flink is unordered. But I need to preserve the order of the messages. To do so I have written a small buffer function which is more or less the logic from CepOperator to buffer messages and restore the order. Therefore the stream is keyed by the id of a message. Let's say a temperature message has always a unique id and therefore the stream is keyed by this id.
To create the respective watermarks I'm using the FlinkKinesisConsumer and register there a BoundedOutOfOrdernessTimestampExtractor. If I now use a out-of-orderness time of 10 seconds everything works fine except that I have almost 50% of late arrivals which is not the desired behaviour. But now if I increase the time to 60 seconds the iterator of the kinesis stream falls significantly behind (linear growing over time). The documentation of the Kinesis Consumer does say a little about the settings here. I have also tried to register a JobManagerWatermarkTracker but it seems that it does not change the behaviour.
I do not understand the circumstances why a higher out of orderness leads the iterator to fall behind increasingly but a smaller time setting drops a significant amount of messages. What measures do I need to take to find the proper settings or is my implementation wrong?
UPDATE:
While investigating the issue I have found that if the JobManagerWatermarkTracker isn't properly configured (I still don't understand how to configure) the alignment to the global watermark stops subtasks from reading from the kinesis stream which causes the iterator to fall back. I have calculated a delta how much "latency" a dropped event has and set this as and out-of-orderness (in this case 60 secs). With deactivating the JobManagerWatermarkTracker everything work as expected.
Furthermore it seems that the AWS Greengrass Streammanager isn't optimal for such use cases as it distributes the load evenly across shards but with an increasing number of shards this isn't optimal since one temperature datapoint might be spread across all shards of a stream. That introduces a lot unnecessary latency. I appreciate any input howto configure the JobManagerWatermarkTracker

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Why do my Flink SQL queries have very different checkpoint sizes?

When using Flink Table SQL in my project, I found that if there was any GROUP BY clause in my SQL, the size of the checkpoint will increase vastly.
For example,
INSERT INTO COMPANY_POST_DAY
SELECT
sta_date,
company_id,
company_name
FROM
FCBOX_POST_COUNT_VIEW
The checkpoint size would be less than 500KB.
But when use like this,
INSERT INTO COMPANY_POST_DAY
SELECT
sta_date,
company_id,
company_name,
sum(ed_post_count)
FROM
FCBOX_POST_COUNT_VIEW
GROUP BY
sta_date, company_id, company_name, TUMBLE(procTime, INTERVAL '1' SECOND)
The checkpoint size would be more than 70MB, even when there is no any message processed. Like this,
But When using DataStream API and the keyBy instead of Table SQL GROUP BY,the checkpoint size would be normal, less than 1MB.
Why?
-------updated at 2019-03-25--------
After doing some tests and reading source code, we found that the reason for this was RocksDB.
When using RockDB as the state backend, the size of the checkpoint will be more than about 5MB per key, and when using filesystem as the state backend, the size of the checkpoint will fall down to less than 100KB per key.
Why do RocksDB need so much space to hold the state? When should we chooose RocksDB?
First of all, I would not consider 70 MB as huge state. There are many Flink jobs with multiple TBs of state. Regarding the question why the state sizes of both queries differ:
The first query is a simple projection query, which means that every record can be independently processed. Hence, the query does not need to "remember" any records but only the stream offsets for recovery.
The second query performs a window aggregation and needs to remember an intermediate result (the partial sum) for every window until time progressed enough such that the result is final and can be emitted.
Since Flink SQL queries are translated into DataStream operators, there is not much difference between a SQL query and implementing the aggregation with keyBy().window(). Both run pretty much the same code.
Update: The cause of the increased state has been identified to be caused by the RocksDBStateBackend. This overhead is not per-key overhead but overhead per stateful operator. Since the RocksDBStateBackend is meant to hold state sizes of multiple GBs to TBs, an overhead of a few MB is negligible.

How to decide Kafka Cluster size

I am planning to decide on how many nodes should be present on Kafka Cluster. I am not sure about the parameters to take into consideration. I am sure it has to be >=3 (with replication factor of 2 and failure tolerance of 1 node).
Can someone tell me what parameters should be kept in mind while deciding the cluster size and how they effect the size.
I know of following factors but don't know how it quantitatively effects the cluster size. I know how it qualitatively effect the cluster size. Is there any other parameter which effects cluster size?
1. Replication factor (cluster size >= replication factor)
2. Node failure tolerance. (cluster size >= node-failure + 1)
What should be cluster size for following scenario while consideration of all the parameters
1. There are 3 topics.
2. Each topic has messages of different size. Message size range is 10 to 500kb. Average message size being 50kb.
3. Each topic has different partitions. Partitions are 10, 100, 500
4. Retention period is 7 days
5. There are 100 million messages which gets posted every day for each topic.
Can someone please point me to relevant documentation or any other blog which may discuss this. I have google searched it but to no avail
As I understand, getting good throughput from Kafka doesn't depend only on the cluster size; there are others configurations which need to be considered as well. I will try to share as much as I can.
Kafka's throughput is supposed to be linearly scalabale with the numbers of disk you have. The new multiple data directories feature introduced in Kafka 0.8 allows Kafka's topics to have different partitions on different machines. As the partition number increases greatly, so do the chances that the leader election process will be slower, also effecting consumer rebalancing. This is something to consider, and could be a bottleneck.
Another key thing could be the disk flush rate. As Kafka always immediately writes all data to the filesystem, the more often data is flushed to disk, the more "seek-bound" Kafka will be, and the lower the throughput. Again a very low flush rate might lead to different problems, as in that case the amount of data to be flushed will be large. So providing an exact figure is not very practical and I think that is the reason you couldn't find such direct answer in the Kafka documentation.
There will be other factors too. For example the consumer's fetch size, compressions, batch size for asynchronous producers, socket buffer sizes etc.
Hardware & OS will also play a key role in this as using Kafka in a Linux based environment is advisable due to its pageCache mechanism for writing data to the disk. Read more on this here
You might also want to take a look at how OS flush behavior play a key role into consideration before you actually tune it to fit your needs. I believe it is key to understand the design philosophy, which makes it so effective in terms of throughput and fault-tolerance.
Some more resource I find useful to dig in
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
https://grey-boundary.io/load-testing-apache-kafka-on-aws/
https://cwiki.apache.org/confluence/display/KAFKA/Performance+testing
I had recently worked with kafka and these are my observations.
Each topic is divided into partitions and all the partitions of a topic are distributed across kafka brokers; first of all these help to save topics whose size is larger than the capacity of a single kafka broker and also they increase the consumer parallelism.
To increase the reliability and fault tolerance,replications of the partitions are made and they do not increase the consumer parallelism.The thumb rule is a single broker can host only a single replica per partition. Hence Number of brokers must be >= No of replicas
All partitions are spread across all the available brokers,number of partitions can be irrespective of number of brokers but number of partitions must be equal to the number of consumer threads in a consumer group(to get best throughput)
The cluster size should be decided keeping in mind the throughput you want to achieve at consumer.
The total MB/s per broker would be:
Data/Day = (100×10^6 Messages / Day ) × 0.5MB = 5TB/Day per Topic
That gives us ~58MB/s per Broker. Assuming that the messages are equally split between partitions, for the total cluster we get: 58MB/s x 3 Topics = 178MB/s for all the cluster.
Now, for the replication, you have: 1 extra replica per topic. Therefore this becomes 58MB/sec/broker INCOMING original data + 58MB/sec/broker OUTGOING replication data + 58MB/sec/broker INCOMING replication data.
This gets about ~136MB/s per broker ingress and 58MB/s per broker egress.
The systems load will get very high and this is without taking into consideration any stream processing.
The system load could be handled by increasing the number of brokers and splitting your topics to more specific partitions.
If your data are very important, then you may want a different (high) replication factor. Fault tolerance is also an important factor for deciding the replication.
For example, if you had very very important data, apart from the N active brokers (with the replicas) that are managing your partitions, you may require to add stand-by followers in different areas.
If you require very low latency, then you may want to further increase your partitions (by adding additional keys). The more keys you have, the fewer messages you will have on each partition.
For low latency, you may want a new cluster (with the replicas) that manages only that special topic and no additional computation is done to other topics.
If a topic is not very important, then you may want to lower the replication factor of that particular topic and be more elastic to some data loss.
When building a Kafka cluster, the machines supporting your infrastructure should be equally capable. That is since the partitioning is done with round-robin style, you expect that each broker is capable of handling the same load, therefore the size of your messages does not matter.
The load from stream processing will also have a direct impact. A good software to manage your kafka monitor and manage your streams is Lenses, which I personally favor a lot since it does an amazing work with processing real-time streams

Resources