I am using flink cep sql with blink planner.
Here is my sql
select * from test_table match_recognize (
partition by agent_id,room_id,call_type
order by row_time // process time
measures
last(BF.create_time) as create_time,
last(AF.connect_time) as connect_time
one row per match after match SKIP PAST LAST ROW
pattern (BF+ AF) WITHIN INTERVAL '10' SECOND
define
BF as BF.connect_time = 0,
AF as AF.connect_time > 0
) as T ;
The test_table is a kafka table
I set table.exec.state.ttl=10000 and run my program then I keep sending message.
As I both set state ttl and cep interval to 10s, the state's size should be a fixed number after 10 seconds when I started it.
But the fact is that the state keep growing for at least 15 minutes. Besides, jvm triggered twice full gc.
Are there any configurations I haven't configured?
You cannot use checkpoint sizes to estimate state size -- they are not related in any straightforward way. Checkpoints can include unpredictable amounts of in-flight, expired, or uncompacted data -- none of which would be counted as active state.
I'm afraid there isn't any good tooling available for measuring exactly how much state you actually have. But if you are using RocksDB, then you can enable these metrics
state.backend.rocksdb.metrics.estimate-live-data-size
state.backend.rocksdb.metrics.estimate-num-keys
which will give you a reasonably accurate estimate (but you may pay a performance penalty for turning them on).
As for your concern about CEP state -- you should be fine. Anytime you have a pattern that uses WITHIN, CEP should be able to clean the state automatically.
Related
We are using Flink SQL to define a stream processing pipeline that computes a SUM aggregation over a fixed window of 5 minutes. The query looks something like this:
INSERT INTO BigTableTable
SELECT CONCAT_WS('#', user, attribute, bucket) as rowkey, ROW(sum_attribute), cell_timestamp
FROM (
SELECT
user,
attribute,
DATE_FORMAT(window_end, 'yyyy-MM-dd:HH') AS bucket,
SUM(attribute_to_sum) AS sum_attribute,
window_end as cell_timestamp
FROM TABLE(TUMBLE(TABLE inputTable, DESCRIPTOR(event_time), INTERVAL '5' MINUTES))
GROUP BY user, attribute, window_start, window_end)
We are using EmbeddedRocksDBStateBackend with incremental checkpointing and also unaligned checkpoints.
We enabled some native rocksdb metrics to monitor the live state size, and also monitor the checkpoint size and we are seeing that these keep growing over time. This is a bit counterintuitive since our understanding is that the window state will be wipped out once the window closes and fires.
We have also enabled the state ttl in table api with: tableEnvironment.getConfig.setIdleStateRetention(Duration.ofMinutes(10)), but this doesn't seem to change anything.
Here are the two graphs for state size and checkpoint size:
State Live Size
Checkpoint Size
Why does the state keep growing and is not bounded by the number of keys in the inout stream?
Update:
Here's the same two metrics with aligned checkpoints Checkpoint Size State Live Size
We right now have an existing running flink job which contains keyed states whose max parallelism is set to 128. As our data grows, we are concerned that 128 is not enough any more in the future. I want to know if we have a way to change the max parallelism by modifying the savepoint? Or is there any way to do that?
You can use the State Processor API to accomplish this. You will read the state from a savepoint taken from the current job, and write that state into a new savepoint with increased max parallelism. https://nightlies.apache.org/flink/flink-docs-stable/docs/libs/state_processor_api/
Your job should perform well if the maximum parallelism is (roughly) 4-5 times the actual parallelism. When the max parallelism is only somewhat higher than the actual parallelism, then you have some slots processing data from just one key group, and others handling two key groups, and that imbalance wastes resources.
But going unnecessarily high will exact a performance penalty if you are using the heap-based state backend. That's why the default is only 128, and why you don't want to set it to an extremely large value.
I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.
Suppose I have a flink pipeline as such:
kafka_source -> maps/filters/keyBy/timewindow(1 minute) -> sinkCassandra
By the time the grouped messages hit the sinkCassandra operation, am I guaranteed that no other slots won't also concurrently run the maps/filters/keyBy/timewindow(1 minute) part of the pipeline?
Or is it possible to have some other slot run the middle pipeline while another set is running the sinkCassandra operation?
EDIT ( Added more requirements based on comment conversation ):
What I'm trying to do is effectively do a lookup based on flink data key from the datastore, and do an update and flush the updated data back.
The reason why I'm dodging using kafka_source -> maps/filters -> keyBy/TimeWindow/statefulReduce -> sinkCassandra is because the state can potentially get huge ( 1 day to 7 days where I can place 7 days as the max time bounding ) and I don't necessarily know the time window for each key. This would mean a HUGE state even with rocksdb.
Another potential option that I'm looking at is kafka_source -> maps/filters -> keyBy/sinkCass where within the custom sink operation, I would first check in some sort of in-memory buffer if I have the key that I want to update. If not, I go ahead and fetch from Cassandra. Every 5 seconds ( or every N seconds ), I would grab whatever's in the buffer and flush into Cassandra. To limit memory, I can do an in-memory least recently used hashmap ( I don't necessarily want to flush b/c multiple keys will show up again! )
Unless you have explicitly configured something unusual, each slot will contain one parallel slice of the complete pipeline -- each slot will have a kafka source instance connected to a disjoint subset of the kafka partitions, as well as the maps/filters/keyBy/window, and the cassandra sink.
All of those parallel sub-pipelines (slots) will be running concurrently. Furthermore, within each slot, each of the operators will also be running concurrently. The sink and the middle part of your pipeline are already running concurrently, but they are competing for the resources of the slot that contains them both. You can configure your task managers to have more cores per slot if you are concerned about starvation.
EDiT (responding to add'l info about requirements):
You can safely assume that for any given flink data key, after a keyBy, only one instance of each operator will process events for that key. That principle is fundamental to Flink's design. If I understand correctly what you are contemplating, that's the only guarantee you need.
When using Flink Table SQL in my project, I found that if there was any GROUP BY clause in my SQL, the size of the checkpoint will increase vastly.
For example,
INSERT INTO COMPANY_POST_DAY
SELECT
sta_date,
company_id,
company_name
FROM
FCBOX_POST_COUNT_VIEW
The checkpoint size would be less than 500KB.
But when use like this,
INSERT INTO COMPANY_POST_DAY
SELECT
sta_date,
company_id,
company_name,
sum(ed_post_count)
FROM
FCBOX_POST_COUNT_VIEW
GROUP BY
sta_date, company_id, company_name, TUMBLE(procTime, INTERVAL '1' SECOND)
The checkpoint size would be more than 70MB, even when there is no any message processed. Like this,
But When using DataStream API and the keyBy instead of Table SQL GROUP BY,the checkpoint size would be normal, less than 1MB.
Why?
-------updated at 2019-03-25--------
After doing some tests and reading source code, we found that the reason for this was RocksDB.
When using RockDB as the state backend, the size of the checkpoint will be more than about 5MB per key, and when using filesystem as the state backend, the size of the checkpoint will fall down to less than 100KB per key.
Why do RocksDB need so much space to hold the state? When should we chooose RocksDB?
First of all, I would not consider 70 MB as huge state. There are many Flink jobs with multiple TBs of state. Regarding the question why the state sizes of both queries differ:
The first query is a simple projection query, which means that every record can be independently processed. Hence, the query does not need to "remember" any records but only the stream offsets for recovery.
The second query performs a window aggregation and needs to remember an intermediate result (the partial sum) for every window until time progressed enough such that the result is final and can be emitted.
Since Flink SQL queries are translated into DataStream operators, there is not much difference between a SQL query and implementing the aggregation with keyBy().window(). Both run pretty much the same code.
Update: The cause of the increased state has been identified to be caused by the RocksDBStateBackend. This overhead is not per-key overhead but overhead per stateful operator. Since the RocksDBStateBackend is meant to hold state sizes of multiple GBs to TBs, an overhead of a few MB is negligible.