Flink: sharing state between functions - apache-flink

as we know there's no sharing state mechanism in Flink at the moment, but I suppose we can achieve it. Suppose we have a Flink job (with a single input source) and we want to know what happened at the end of it, in order to adjust the job processing steps.
I have thought:
Sinking a state into a broadcast source, then consuming it to update the state of functions
Using external services to store and retrieve it:
sink state to a db, and use an async function to retrieve it amid the job flow
use state func to update/read from external services amid the job flow
store state in a redis table and retrieve it amid the job flow
I think the first should be the more suitable, as other requires extra setup and extend the complexity to other systems.
What's your opinion on those options?
Are there other ways?
Thanks

If you use Stateful Functions then it's easy to send a message from the final processing step back to the upstream operator(s).
If you're OK with potentially losing this state if it's in-flight and your job restarts (so it's a hint re adjusting job processing, versus a requirement), then you can use an IterativeStream to send it back upstream. That would remove the need for Kafka or some other external feedback system. See also How does Flink treat checkpoints and state within IterativeStream?

I used kafka. Whenever the state is changed, as a side output I sent it to Kafka sink, and other tasks which subscribed to same topic is being notified.

Related

Need advice on migrating from Flink DataStream Job to Flink Stateful Functions 3.1

I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how
Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)

Initialize Flink Job

We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data.
We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.
The State Processor API is a good solution. It provides a sort of savepoint connector that you use in a DataSet job to read/modify/update the savepoints that you use in your DataStream jobs.
It's a pretty simple change (definitely not "significant") to support a -preload mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload option.

What is the best way to have a cache of an external database in Flink?

The external database consists of a set of rules for each key, these rules should be applied on each stream element in the Flink job. Because it is very expensive to make a DB call for each element and retrieve the rules, I want to fetch the rules from the database at initialization and store it in a local cache.
When rules are updated in the external database, a status change event is published to the Flink job which should be used to fetch the rules and refresh this cache.
What is the best way to achieve what I've described? I looked into keyed state but initializing all keys and refreshing the keys on update doesn't seem possible.
I think you can make use of BroadcastProcessFunction or KeyedBroadcastProcessFunction to achieve your use case. A detailed blog available here
In short: You can define the source such as Kafka or any other and then publish the rules to Kafka that you want the actual stream to consume. Connect the actual data stream and rules stream. Then the processBroadcastElement will stream the rules where you can update the state. Finally the updated state (rules) can be retrieved in the actual event streaming method processElement.
Points to consider: Broadcast state will be kept on the heap always, not in state store (RocksDB). So, it has to be small enough to fit in memory. Each slot will copy all of the broadcast state into its checkpoints, so all checkpoints and savepoints will have n (parallelism) copies of the broadcast state.
A few different mechanisms in Flink may be relevant to this use case, depending on your detailed requirements.
Broadcast State
Jaya Ananthram has already covered the idea of using broadcast state in his answer. This makes sense if the rules should be applied globally, for every key, and if you can find a way to collect and broadcast the updates.
Note that the Context in the processBroadcastElement() of a KeyedBroadcastProcessFunction method contains the method applyToKeyedState(StateDescriptor<S, VS> stateDescriptor, KeyedStateFunction<KS, S> function). This means you can register a KeyedStateFunction that will be applied to all states of all keys associated with the provided stateDescriptor.
State Processor API
If you want to bootstrap state in a Flink savepoint from a database dump, you can do that with this library. You'll find a simple example of using the State Processor API to bootstrap state in this gist.
Change Data Capture
The Table/SQL API supports Debezium, Canal, and Maxwell CDC streams, and Kafka upsert streams. This may be a solution. There's also flink-cdc-connectors.
Lookup Joins
Flink SQL can do temporal lookup joins against a JDBC database, with a configurable cache. Not sure this is relevant.
In essence David's answer summarizes it well. If you are looking for more detail: not long ago, I gave a webinar [1] on this topic including running code examples. [2]
[1] https://www.youtube.com/watch?v=cJS18iKLUIY
[2] https://github.com/knaufk/enrichments-with-flink

Is there a way to programmatically check if a Flink streaming job started from a savepoint before executing the stream?

Before calling execute on the StreamExecutionEnvironment and starting the stream job, is there a way to programmatically find out whether or not the job was restored from a savepoint? I need to know such information so that I can set the offset of a Kafka source depending on it while building the job graph.
It seems that the FlinkConnectorKafkaBase class which has a method initializeState has access to such information (code). However, there is no way to intercept the FunctionInitializationContext and retrieve the isRestored() value since initializeState is a final method. Also, the initializeState method gets called after the job graph is executed and so I don't think there is a feasible solution associated to it.
Another attempt I made was to find a Flink job parameter that indicates whether or not the job was started from a savepoint. However, I don't think such parameter exists.
You can get the effect you are looking for by simply doing this:
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.setStartFromEarliest();
If you use setStartFromEarliest then Flink will ignore the offsets stored in Kafka, and instead begin reading from the earliest record. Moreover, even if you use setStartFromEarliest, if Flink is resuming from a checkpoint or savepoint, it will instead use the offsets stored in that snapshot.
Note that Flink does its own Kafka offset management, and when recovering from a checkpoint ignores the offsets stored in Kafka. Flink does this as a part of providing exactly-once guarantees, which requires knowing exactly how much of the input was consumed to produce the results present in the rest of the state captured in a checkpoint or savepoint. For this reason, Flink always stores the offsets as part of every state snapshot (checkpoint or savepoint).
This is documented here and here.
As for your original question about initializeState, this is available if you implement the CheckpointedFunction interface, but it's quite rare to actually need this.

Translating change log into actual state in Apache Camel

I have two systems, call them A and B.
When some significant object changes in A, A sends it to B through Apache Camel.
However, I encountered one case, when A actually has change log of an object, while B must reflect only actual state of the object.
Moreover, change log in A can contain "future" records.
It means, that object's state change is scheduled on some moment in the future.
A user of the system A can edit this change log, remove change records, add new change records with any timestamp (in the past and in the future) and even update existing changes.
Of course, A sends these change records to B, but B needs only actual state of the object.
Note, that I can query objects from A, but A is a performance-critical system, and I therefore I am not going to query something, as it can cause additional load.
Also, API for querying data from A is overcomplicated and I would like to avoid it whenever possible.
I can see two problems here.
First is realizing whether the particular change in change log record may cause changing of the actual state.
I am going to store change log in an intermediate database.
As a change log record comes, I am going to add/remove/update it in the intermediate database and then calculate actual state of the object and send this state to B.
Second is tracking change schedule.
I couldn't invent anything except for running a periodical job in a constant interval (say, 15 minutes).
This job would scan all records that fall in the time interval from the last invocation until the current invocation.
What I like Apache Camel for is its component-based approach, when you only need to connect endpoints and get everyting work, with only a little amount of coding.
Is there any pre-existing primitives for this problem both in Apache Camel and in EIP?
I am actually working on a very similar use-case, where system A sends Snapshot and updates which require translation before sending to system B.
First, you need to trigger the mechanism for giving you the initial state (the "snapshot") from system A, the timer: component can start the one-time startup logic.
Now, you will receive the snapshot data (you didn't specify how, perhaps it is an ftp file or a jms endpoint). Validate the data, split it into items, and store each item of data in a local in-memory cache:, as Sergey suggests in his comment, keyed uniquely. Use an expire policy that is logical (e.g. 48 hours).
From there, continually process the "update" data from the ftp: endpoint. For each update, you will need to match the update with the data in the cache: and determine what (and when) needs to be sent to system B.
Data that needs to be sent to system B later will need be to be persisted in memory or in a database.
Finally, you need a scheduling mechanism to determine every 15 minutes if new data should be sent, you can easily use timer: or quartz: for this.
In summary, you can build this integration from following components: timer, cache, ftp, quartz plus some custom beans / processors to perform the custom logic.
The main challenges are handling data that is cached and then updated and working out the control mechanisms for what should happen on the initial connect, a disconnect, or if your camel application is restarted.
Good luck ;)

Resources