Flink job dynamic input parameters - apache-flink

One parameter for my flink job is dynamic and i have an api so as to fetch the dynamic value. Can i call the api in source everytime so as to fetch data based on the parameter? Is it the correct way? Will it cause any trouble in flink job?

So, If I understand correctly the idea is that You first get some key from dynamoDB and then use that to query external service from the source.
I think that should be possible in general, but there are few things to have in mind when doing that.
Not sure about performance of such solution. Are You going to query database constantly? Or somehow just get changes ? There are several things to consider here to have good performance of the source.
It may be hard to provide any strong guarantees for such setup, but that depends on the charcteristics of the setup itself. I.e. how are You going to handle failures? How often will key in database change? Will the data be accessible via URL after the key in DB changes ? You probably can keep the last read key in state, so that when the job fails and key in DB changes You can try to read the data for the previous key (for which the job has failed) but that depends on the questions above.
Finally, depending on the characteristics of the setup, it may be possible to use existing Flink operators to achieve that. For example, You can technically stream changes from Database (using one of existing connectors depending on DB) and then use that data in AsyncIO to query external URL, so that finally You have a stream of data from URL witout creating Your own source.

Related

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

Using sink in Apache Flink for read purposes?

I am new to Apache Flink(and stackoverflow), and I wanted to know the best practice for dealing with the following scenario:
I am currently consuming real-time message using a KafkaSource from someone else's application. Some of these messages will need to undergo a transformation if the keys in these messages exist in a local database that I have created and have access to. This transformed message then needs to be sent to a KafkaSink one by one.
In order to check if a message needs to be transformed, I need to see if the key for that specific message exists in my local database (I have to query my local database for each message to check for its key).
What is an efficient way to do this?
I have 2 ideas:
Open a connection to the local database and perform a query to see if the record exists in my local database for that message. Repeat this for each message in the stream.
Extend the flink RichSinkFunction and open a connection through that and use the invoke method to perform the query. Use this RichSink to repeat this for each message in the stream.
Performance Concern: I only want to open a connection to the local database once. I think Method #1 would open and close a connection per message while Method #2 would open and close a connection only once.
More generally, is it appropriate to create a RichSink to just run some queries in your local database for read purposes? I am not going to be using this RichSink to actually write any data to the local database.
Thanks!
The preferred approach to access external systems from Flink is to use an AsyncFunction: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/asyncio/
That is, if your database can handle the load and be fast enough to keep up with the stream throughput. If not, you'll want to implement some kind of CDC stream from your database and store its contents locally as Flink state. Then, have a ConnectedStream so they both can share state in a CoMap or CoFlatMap operator.
ConnectedStream and AsyncFunction is the preferred way of approaching this kind of problems.
In case you don't have access to all Flink abstractions (like if you have some existing framework on top of Flink) but you can instantiate FlatMapFunction you can resort to RichFlatMapFunction - you'd maintain just a few connection to database this way if you use open method to instantiate it.

A copy of database in the Front-End side (Store) to avoid fetching data multiple times

I have a large database, so I decided to create a copy of this database on the front-end side in my Store (I am using Vuex with Vuejs), i.e. each table in the database has a corresponding array in the Store, and every time I have an update I persist it to the real database and to my copy (which exists in the Store) at the same time to avoid fatch the data again after each update.
Is it a good idea or not (in terms of performance) ?
I think its totally fine to create a cache of your data in the client side, and it is even recommended. But you do need to be aware of some things:
Make sure you update the client side only when the data has been successfully saved in the database. So the client will not see false information.
If you have multiple users who can change the data, and you need to use the data in real time - make sure that you send the updated data to other users as well.
Check which tables from the database you actually need at any time - maybe you need different tables at different views of the application, so you do not have to save all the database all the time. It can help to reduce memory usage
Consider using lazy loading, which means that you load the tables only when you need them, and then save it in the cache. The next time you need to use this table you won`t load it from the server, but instead use the cached data.
When you put the data in vuex store, vue will consider this data reactive, which can cause performance issues - especially if you have a lot of data. If you have data that you know will not be changed, or data that is rarely changed - consider using Object.freeze() which basically tells vue not to put any watchers on this object. This could help improve performance issues by much.
EDIT:
if you are concerned about performance issues I would implement the cache using lazy loading, and Object.freeze() which means you will not be able to change the data in the client side - so for every change you should send the update to the server and receive the full updated table in - so you will assign the new value to your cache with Object.freeze(). That way you don't have to request the table from the server for every usage, only for updates. This will help to keep good performance.

What is the best way to have a cache of an external database in Flink?

The external database consists of a set of rules for each key, these rules should be applied on each stream element in the Flink job. Because it is very expensive to make a DB call for each element and retrieve the rules, I want to fetch the rules from the database at initialization and store it in a local cache.
When rules are updated in the external database, a status change event is published to the Flink job which should be used to fetch the rules and refresh this cache.
What is the best way to achieve what I've described? I looked into keyed state but initializing all keys and refreshing the keys on update doesn't seem possible.
I think you can make use of BroadcastProcessFunction or KeyedBroadcastProcessFunction to achieve your use case. A detailed blog available here
In short: You can define the source such as Kafka or any other and then publish the rules to Kafka that you want the actual stream to consume. Connect the actual data stream and rules stream. Then the processBroadcastElement will stream the rules where you can update the state. Finally the updated state (rules) can be retrieved in the actual event streaming method processElement.
Points to consider: Broadcast state will be kept on the heap always, not in state store (RocksDB). So, it has to be small enough to fit in memory. Each slot will copy all of the broadcast state into its checkpoints, so all checkpoints and savepoints will have n (parallelism) copies of the broadcast state.
A few different mechanisms in Flink may be relevant to this use case, depending on your detailed requirements.
Broadcast State
Jaya Ananthram has already covered the idea of using broadcast state in his answer. This makes sense if the rules should be applied globally, for every key, and if you can find a way to collect and broadcast the updates.
Note that the Context in the processBroadcastElement() of a KeyedBroadcastProcessFunction method contains the method applyToKeyedState(StateDescriptor<S, VS> stateDescriptor, KeyedStateFunction<KS, S> function). This means you can register a KeyedStateFunction that will be applied to all states of all keys associated with the provided stateDescriptor.
State Processor API
If you want to bootstrap state in a Flink savepoint from a database dump, you can do that with this library. You'll find a simple example of using the State Processor API to bootstrap state in this gist.
Change Data Capture
The Table/SQL API supports Debezium, Canal, and Maxwell CDC streams, and Kafka upsert streams. This may be a solution. There's also flink-cdc-connectors.
Lookup Joins
Flink SQL can do temporal lookup joins against a JDBC database, with a configurable cache. Not sure this is relevant.
In essence David's answer summarizes it well. If you are looking for more detail: not long ago, I gave a webinar [1] on this topic including running code examples. [2]
[1] https://www.youtube.com/watch?v=cJS18iKLUIY
[2] https://github.com/knaufk/enrichments-with-flink

How to update redis after updating database?

I cache some data in redis, and reading data from redis if it's exists, otherwise reading data from database and write the data in redis.
I find that there are several ways to update redis after updating database.For example:
set keys in redis to expired
update redis immediately after updating datebase.
put data in MQ and use consumer to update redis.
I'm a little confused and don't know how to choose.
Could you tell me the advantage and disadvantage of each way and it's better to tell me other ways to update redis or recommend some blog about this problem.
Actual data store and cache should be synchronized using the third approach you've already described in your question.
As you add data to your definitive store (i.e. your SQL database), you need to enqueue this data to some service bus or message queue, and let some asynchronous service do the whole synchronization using some kind of background process.
You don't want get into this cases (when not using a service bus and asynchronous service):
Make your requests or processes slower because the user needs to wait until the data is both stored in your database and cache.
Have the risk of a fail during the caching process and not being able to have a retry policy (which is usually a built-in feature in a service bus or some message queues). Also, this failure can end up in a partial or complete cache corruption and you won't be able to automatically and easily schedule some task to fix this situation.
About using Redis key expiration, it's a good idea. Since Redis can expire keys using its built-in mechanism, you shouldn't implement key expiration from the whole background process. If a key exists is because it's still valid.
BTW, you won't be always on this case (if a key isn't expired it means that it shouldn't be overwritten). It might depend on your actual domain.
You can create an api to interact with your redis server, then use SQL CLR to call the call api

Resources