state clean up behavior with flink interval join - apache-flink

I am reading at
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/joins/#interval-joins,
It has following example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.order_id
AND o.order_time BETWEEN s.ship_time - INTERVAL '4' HOUR AND s.ship_time
I got following two questions:
If o.order_time and s.ship_time are normal time column, not event time attribute, then all the states will be saved in Flink, like normal regular inner join does? So that, maybe big size states will be kept in Flink
If o.order_time and s.ship_time are event time attributes, then flink will rely on watermark to do state clean up? so that small size states will be kept in Flink

Yes, that's correct. The reason Flink SQL has the notion of time attributes is so that suitable streaming queries can have their state automatically cleaned up, and an interval join is an example of such a query. Time windows and temporal joins on versioned tables also work in a similar way.

Related

How to join two streams without time window in flink?

I am getting data from two streams. I want to join these two streams based on a key. For example, consider two streams. Data in stream A can come first. Sometimes data in stream B can come first. The joining data in the streams can come at any time. Because of this nature, I can't use a windowed join. Is it possible to join two unbounded streams in flink?
I believe a non-windowed Join will behave as you want: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/joins/#regular-joins
If you are using the DataStream API instead of the SQL API, a CoFlatMap operator implementing a shared state that keeps the elements from both sides and joins them when there is an update, would allow you to implement this behavior as well.
Take into account that this requires keeping both sides in state forever, which can make grow your state infinitely.
The note in the Flink SQL documentation advises looking at setting a state TTL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/config/#table-exec-state-ttl. That would be the DataStream equivalent: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl The problem is that if some records in the state expire and there is an update that would require to be joined with the expired element, the result will be incorrect.

Inconsistent results when joining multiple tables in Flink

We've 4 CDC sources defined of which we need to combine the data into one result table. We're creating a table for each source using the SQL API, eg:
"CREATE TABLE IF NOT EXISTS PAA31 (\n" +
" WRK_SDL_DEF_NO STRING,\n" +
" HTR_FROM_DT BIGINT,\n" +
...
" update_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,\n" +
" PRIMARY KEY (WRK_SDL_DEF_NO) NOT ENFORCED,\n" +
" WATERMARK FOR update_time AS update_time\n" +
") WITH ('value.format' = 'debezium-json' ... )";
After we've defined each table, we create a new table by running the following query:
"SELECT PAA30.WRK_SDL_DEF_NO as id,\n" +
" PAA33.DSB_TX as description,\n" +
...
"FROM PAA30\n" +
"INNER JOIN PAA33 ON PAA30.WRK_SDL_DEF_NO = PAA33.WRK_SDL_DEF_NO AND PAA33.LGG_CD = 'NL' \n" +
"INNER JOIN PAA31 ON PAA30.WRK_SDL_DEF_NO = PAA31.WRK_SDL_DEF_NO \n" +
"INNER JOIN PAA32 ON PAA30.WRK_SDL_DEF_NO = PAA32.WRK_SDL_DEF_NO";
Note some rows have been left out for formatting reasons.
The issue we're running into is that executing this exact job results in inconsistent outcomes where sometimes we have 1750 resulting rows (correct), however most of the times the resulting rows is less and random.
This is the plan overview for the job in Flink. The amount of records sent from the sources are all correct, however the amount of records sent of the 1st join statement is not:
Flink Job Execution Plan and numbers
What could be the cause and how can we have consistent joining of all data sources?
I see that your pipeline includes an event time window with a processing time trigger, and does watermarking with zero tolerance for out-of-order events. These could be causing problems.
Flink can only produce completely correct, deterministic results for streaming workloads that involve event time logic if there are no late events. Late events can occur whenever processing time logic interferes with the watermarking, e.g.,
if the watermark generator is incorrectly configured, and doesn't account for the actual out-of-orderness
if idleness detection is used, and an inactive stream becomes re-activated
after a restart (or recovery, or rescaling) occurs
Just guessing, however. Would need to see more details to give a more informed answer. A minimal, reproducible example would be ideal.
Update:
It's also the case the streaming jobs won't emit their last set of results unless something is done to provoke them to do so. In this case you could, for example, use
./bin/flink stop $JOB_ID --drain --savepointPath /tmp/flink-savepoints
to force a large watermark to be emitted that will close the last window.
Update 2:
Regular joins don't produce results with time attributes or watermarks. This is because it's impossible to guarantee that the results will be emitted in any particular order, so meaningful watermarking isn't possible. Normally it's not possible to apply event time windowing after such a join.
Update 3:
Having now studied the latest code, this obviously doesn't have anything to do with Watermarks.
If I understand correctly, the issue is that while the results always include what should be produced, there are varying numbers of additional output records. I can suggest two possible causes:
(1) When Flink is used with Debezium server there's the possibility of duplicate events. I don't think this is the explanation, but it is something to be aware of.
(2) The result of the join is non-deterministic (it varies from run to run). This is happening because the various input streams are racing against each other, and the exact order in which related events from different streams are ingested is affecting how the results are produced.
The result of the join is a changelog stream. I suspect that when the results are perfect, no retractions occurred, while in the other cases some preliminary results are produced that are later updated.
If you examine the ROW_KIND information in the output stream you should be able to confirm if this guess is correct.
I'm not very familiar with the pulsar connector, but I'm guessing you should be using the upsert_pulsar sink.
We've been able to get consistent results, even for bigger datasets, by enabling MiniBatch Aggregation
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "500 ms");
configuration.setString("table.exec.mini-batch.size", "5000");
This seems to fix the consistency issue for both the local filesystem connector as well as for the Flink Pulsar connector.
From these findings, it seems Flink was having issues with the overhead of state management for our throughput. We'll still need to assess realistic CDC initial load processing, but so far enabling MiniBatch Aggregation seems promising
Thanks #david-anderson for thinking with us and trying to figure this out.

When will flink clean up idle state in flink cep sql?

I am using flink cep sql with blink planner.
Here is my sql
select * from test_table match_recognize (
partition by agent_id,room_id,call_type
order by row_time // process time
measures
last(BF.create_time) as create_time,
last(AF.connect_time) as connect_time
one row per match after match SKIP PAST LAST ROW
pattern (BF+ AF) WITHIN INTERVAL '10' SECOND
define
BF as BF.connect_time = 0,
AF as AF.connect_time > 0
) as T ;
The test_table is a kafka table
I set table.exec.state.ttl=10000 and run my program then I keep sending message.
As I both set state ttl and cep interval to 10s, the state's size should be a fixed number after 10 seconds when I started it.
But the fact is that the state keep growing for at least 15 minutes. Besides, jvm triggered twice full gc.
Are there any configurations I haven't configured?
You cannot use checkpoint sizes to estimate state size -- they are not related in any straightforward way. Checkpoints can include unpredictable amounts of in-flight, expired, or uncompacted data -- none of which would be counted as active state.
I'm afraid there isn't any good tooling available for measuring exactly how much state you actually have. But if you are using RocksDB, then you can enable these metrics
state.backend.rocksdb.metrics.estimate-live-data-size
state.backend.rocksdb.metrics.estimate-num-keys
which will give you a reasonably accurate estimate (but you may pay a performance penalty for turning them on).
As for your concern about CEP state -- you should be fine. Anytime you have a pattern that uses WITHIN, CEP should be able to clean the state automatically.

Stream Joins for Large Time Windows with Flink

I need to join two event sources based on a key. The gap between the events can be up to 1 year(ie. event1 with id1 may arrive today and the corresponding event2 with id1 from the second event source may arrive a year later). Assume I want to just stream out the joined event output.
I am exploring the option of using Flink with the RocksDB backend(I came across Table APIs which appear to suit my use case). I am not able to find references architectures that do this kind of long window joins. I am expecting the system to process about 200M events a day.
Questions:
Are there any obvious limitations/pitfalls of using Flink for this kind of Long Window joins?
Any recommendations on handling this kind of long window joins
Related: I am also exploring using Lambda with DynamoDB as the state to do stream joins(Related Question). I will be using managed AWS services if this info is relevant.
The obvious challenge of this use case are the large join window size of one year and the high ingestion rate which can result in a huge state size.
The main question here is whether this is a 1:1 join, i.e., whether a record from stream A joins exactly (or at most) once with a record from stream B. This is important, because if you have a 1:1 join, you can remove a record from the state as soon as it was joined with another record and you don't need to keep it around for the full year. Hence, your state only stores records that were not joined yet. Assuming that the majority of records is quickly joined, your state might remain reasonable small.
If you have a 1:1 join, the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API are not what you want. They are implemented as m:n joins because every record might join with more than one record of the other input. Hence they keep all records for the full window interval, i.e., for one year in your use case. If you have a 1:1 join, you should implement the join yourself as a KeyedCoProcessFunction.
If every record can join multiple times within one year, there's no way around buffering these records. In this case, you can use the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API.

Is there a way to clear Postgres stats xact_commit and xact_rollback only?

I am working with a postgres database that is being monitored by icinga2, and one of our monitors is looking at the commit ratio of a database:
select
round(100.*sd.xact_commit/(sd.xact_commit+sd.xact_rollback), 2) AS dcommitratio,
d.datname,
r.rolname AS rolname
FROM pg_stat_database sd
JOIN pg_database d ON (d.oid=sd.datid)
JOIN pg_roles r ON (r.oid=d.datdba)
WHERE sd.xact_commit+sd.xact_rollback<>0;
The problem is that an application recently had a bug (now fixed!) that increased the count of rollbacks considerably, so that the commit ratio is now only 78%, and it is triggering alarms every day.
I could run pg_stats_clear(), but is there a way to clear out these two counters only? I don't want to clear out any other necessary stats inadvertently, like any being used by the autovaccuum or the query optimizer. Or, is pg_stats_clear() considered safe to run?
Unfortunately it is all-or-nothing with resetting PostgreSQL statistics.
But I'd say that your monitoring system is monitoring the wrong thing anyway. Rather than monitoring the absolute values of xact_commit and xact_rollback, you should monitor the changes in the values since the last check.
Otherwise you will not detect a potential problem in a timely fashion: if there have been many months of normal operation, it will take a long time of misbehavior to change the ratio perceptibly.

Resources