Control stream in Flink SQL - apache-flink

With the stream API, I can write a RichCoFlatMapFunction that accept a control stream and a data stream, the control stream contains the elements for start or stop or change parameter of the calculation, I know I can store the current control settings in states, and check the value when process data stream.
But what's the way to do the similar thing with Flink SQL?
I cannot use join as data stream and control stream is not able to join together.
The solution we come up with is to store the control settings by application itself.
The idea is:
Broadcast the control stream to a map operator, and store the control settings to a java singleton objects in its map() method, as the map operator will run with the default parallelism, we assume that it will run on all JVMs for that job, so that we make sure every JVM will initialize and keep updating the control settings in the singleton object.
With SQL, for every UDAF or UDF we can access the control settings through accessing the java singleton objects.
But I am not sure if my assumption is correct and this is a feasible solution.

I don't think that is a good idea. SQL was not designed for such use cases. Instead a SQL query is optimized and executed as specified. Changing the behavior of a query is not intended. Besides the design perspective, it would also not perform well because you would need do remote state look-ups to distributed queryable state for each record that you process. This adds of course latency.
To me your use case sounds more like an application than SQL query. For that the DataStream API would be the right choice. What you can do, is to embed SQL (or Table API) queries into an application, i.e., do the pre and post processing with SQL and have an operator with an control/data stream pattern in the middle.

Related

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

Using sink in Apache Flink for read purposes?

I am new to Apache Flink(and stackoverflow), and I wanted to know the best practice for dealing with the following scenario:
I am currently consuming real-time message using a KafkaSource from someone else's application. Some of these messages will need to undergo a transformation if the keys in these messages exist in a local database that I have created and have access to. This transformed message then needs to be sent to a KafkaSink one by one.
In order to check if a message needs to be transformed, I need to see if the key for that specific message exists in my local database (I have to query my local database for each message to check for its key).
What is an efficient way to do this?
I have 2 ideas:
Open a connection to the local database and perform a query to see if the record exists in my local database for that message. Repeat this for each message in the stream.
Extend the flink RichSinkFunction and open a connection through that and use the invoke method to perform the query. Use this RichSink to repeat this for each message in the stream.
Performance Concern: I only want to open a connection to the local database once. I think Method #1 would open and close a connection per message while Method #2 would open and close a connection only once.
More generally, is it appropriate to create a RichSink to just run some queries in your local database for read purposes? I am not going to be using this RichSink to actually write any data to the local database.
Thanks!
The preferred approach to access external systems from Flink is to use an AsyncFunction: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/asyncio/
That is, if your database can handle the load and be fast enough to keep up with the stream throughput. If not, you'll want to implement some kind of CDC stream from your database and store its contents locally as Flink state. Then, have a ConnectedStream so they both can share state in a CoMap or CoFlatMap operator.
ConnectedStream and AsyncFunction is the preferred way of approaching this kind of problems.
In case you don't have access to all Flink abstractions (like if you have some existing framework on top of Flink) but you can instantiate FlatMapFunction you can resort to RichFlatMapFunction - you'd maintain just a few connection to database this way if you use open method to instantiate it.

Flink Sorting A Global Window On A Bounded Stream

I've built a flink application to consume data directly from Kafka but in the event of a system failure or a need to re-process this data, I need to instead consume the data from a series of files in S3. The order in which messages are processed is very important, so I'm trying to figure out how I can sort this bounded stream before pushing these messages through my existing application.
I've tried inserting the stream into a temporary table using the Table API but the sort operator always uses a maximum parallelism of 1 despite sorting on two keys. Can I leverage these keys somehow to increase this parallelism ?
I've been thinking of using a keyed global window but I'm not sure how to trigger on a bounded stream and sort the window. Is Flink a good choice for this kind of batch processing and would it be a good idea to write this using the old Dataset API?
Edit
After some experimentation, I've decided that Flink isn't the correct solution and Spark is just more feature rich in this particular use case. Im trying to consume and sort over 1.5tb of data in each job. Unfortunately some of these partitions contain maybe 100G or more and everything must be in order before I can break those groups up further, which makes sorting this data in the operators difficult.
My requirements are simple, ingest the data from S3 and sort by channel ID before flushing it to disk. Having to think about windows and timestamp assignors just complicates a relatively simple task that can be achieved in 4 lines of Spark code.
Have you considered using the HybridSource for your use case, since this is exactly for what is was designed? https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/
The DataSet API is deprecated and I would not recommend to use it.

Flink job dynamic input parameters

One parameter for my flink job is dynamic and i have an api so as to fetch the dynamic value. Can i call the api in source everytime so as to fetch data based on the parameter? Is it the correct way? Will it cause any trouble in flink job?
So, If I understand correctly the idea is that You first get some key from dynamoDB and then use that to query external service from the source.
I think that should be possible in general, but there are few things to have in mind when doing that.
Not sure about performance of such solution. Are You going to query database constantly? Or somehow just get changes ? There are several things to consider here to have good performance of the source.
It may be hard to provide any strong guarantees for such setup, but that depends on the charcteristics of the setup itself. I.e. how are You going to handle failures? How often will key in database change? Will the data be accessible via URL after the key in DB changes ? You probably can keep the last read key in state, so that when the job fails and key in DB changes You can try to read the data for the previous key (for which the job has failed) but that depends on the questions above.
Finally, depending on the characteristics of the setup, it may be possible to use existing Flink operators to achieve that. For example, You can technically stream changes from Database (using one of existing connectors depending on DB) and then use that data in AsyncIO to query external URL, so that finally You have a stream of data from URL witout creating Your own source.

Querying Data from Apache Flink

I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.

Resources