How to implement `allowed lateless` in flink sql? - apache-flink

As I know, allowed lateless is only implemented in DataStream API, but not implemented in Flink Sql table api.
So, my question is: does it have some workaround or something else can implement this
allowed lateless function in flink sql table api?

You could consider using the CURRENT_WATERMARK function in Flink SQL as an intermediate step.
For example, if you want to filter out late data you can use:
WHERE
CURRENT_WATERMARK(ts) IS NULL
OR ts > CURRENT_WATERMARK(ts)
For all other topics on handling late events you can track https://issues.apache.org/jira/browse/FLINK-10031

Related

How Can I deal with Idleness in kafka source for watermarks using Table or SQL api?

We can handle the idleness in datastream api via this code block:
WatermarkStrategy
.<Tuple2<Long,String>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
.withIdleness(Duration.ofMinutes(1));
But is something similar possible in flink table/SQL api?
You can set table.exec.source.idle-timeout. See https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/config/#table-exec-source-idle-timeout for more info.

API source support in flink

Does flink provide a way to poll from an api periodically and create a datastream object out of it for further processing?
We currently push the messages to kafka and read through kafka. Is there any way to poll the api directly through flink?
I'm not aware of such a source connector for Flink, but it would be relatively straightforward to implement one. There are examples out there that do just this but with a database query; one of those might serve as a template for getting started.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Is it ok to Access database inside FlatMapFunction in a Flink App?

I am consuming a kafka topic as a datastream and using a FlatMapFunction to process the data. The processing consist of enriching the instances that comes from the stream with more data that a get from database executing a query in other to collect the output but, it feels it is not the best approach.
Reading the docs i know that i can create a DataSet from a database query but i only saw examples for Batch Processing.
Can i perform merge/reduce (or other operation) with a DataStream and a DataSet to accomplish that ?
Can i get any performance improvement using a DataSet instead accessing directly the database?
There are various approaches one can take for accomplishing this kind of enrichment with Flink's DataStream API.
(1) If you just want to fetch all the data on a one-time basis, you can use a stateful RichFlatmapFunction that does the query in its open() method.
(2) If you want to do a query for every stream element, then you could do that synchronously in a FlatmapFunction, or look at AsyncIO for a more performant approach.
(3) For best performance while also getting up-to-date values from the external database, look at streaming in the database change stream and doing a streaming join with a CoProcessFunction. Something like http://debezium.io/ could be useful here.

Querying Data from Apache Flink

I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.

Resources