Flink Statefun connections to Flink Table API - apache-flink

We are interested in connecting to a regular Flink Streaming application from new Stateful Functions 🎉, ideally using the Table API. The idea is to consult tables registered in Flink from Statefun, is this possible, and what is the right way to do it?
My idea so far has been to initialize my table stream in some main function and register a stateful function provider to connect to the table:
#AutoService(StatefulFunctionModule.class)
public class Module implements StatefulFunctionModule {
#Override
public void configure(Map<String, String> globalConfiguration, Binder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// ingest a DataStream from an external source
DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...);
// SQL query with an inlined (unregistered) table
Table myTable = tableEnv.fromDataStream(ds, "user, product, amount");
tableEnv.createTemporaryView("my_table", myTable);
TableFunctionProvider tableProvider = new TableFunctionProvider();
binder.bindFunctionProvider(FnEnrichmentCallback.TYPE, tableProvider);
//continue registering my other messages
//...
}
}
The stateful function provider would return a FnTableQuery which simply queries the table whenever it receives a message:
public class TableFunctionProvider implements StatefulFunctionProvider {
#Override
public StatefulFunction functionOfType(FunctionType type) {
return new FnTableQuery();
}
}
The query function object would then operate as an actor for every established process, and simply query the table when invoked:
public class FnTableQuery extends StatefulMatchFunction {
static final FunctionType TYPE = new FunctionType(Identifiers.NAMESPACE, "my-table");
private Table myTable;
#Override
public void configure(MatchBinder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
myTable = tableEnv.from("my_table");
binder
.otherwise(this::catchAll);
}
private void catchAll(Context context, Object message) {
context.send(FnEnrichmentCallback.TYPE, myTable.select("max(amount)").toString(), message);
}
}
I apologize in advance if this approach doesn't make sense, because I don't know if:
Flink and Statefun applications can work together outside the realm of sources/sinks, especially since this particular function is stateless and the table is stateful
We can query Flink tables like this, I have only queried them as an intermediate object to send to a sink or datastream
It makes sense to initialize things in Module.configure, and if both the stateful function provider and its match function are called once per parallel worker

The Apache Flink community does have in mind to support Flink DataStreams as StateFun ingress / egresses in the future.
What this would mean is that you can take the result streams of using the Flink Table API / Flink CEP / DataStream API etc., and invoke functions using the events in the streams.

Related

Update external Database in RichCoFlatMapFunction

I have a RichCoFlatMapFunction
DataStream<Metadata> metadataKeyedStream =
env.addSource(metadataStream)
.keyBy(Metadata::getId);
SingleOutputStreamOperator<Output> outputStream =
env.addSource(recordStream)
.assignTimestampsAndWatermarks(new RecordTimeExtractor())
.keyBy(Record::getId)
.connect(metadataKeyedStream)
.flatMap(new CustomCoFlatMap(metadataTable.listAllAsMap()));
public class CustomCoFlatMap extends RichCoFlatMapFunction<Record, Metadata, Output> {
private transient Map<String, Metadata> datasource;
private transient ValueState<String, Metadata> metadataState;
#Inject
public void setDataSource(Map<String, Metadata> datasource) {
this.datasource = datasource;
}
#Override
public void open(Configuration parameters) throws Exception {
// read ValueState
metadataState = getRuntimeContext().getState(
new ValueStateDescriptor<String, Metadata>("metadataState", Metadata.class));
}
#Override
public void flatMap2(Metadata metadata, Collector<Output> collector) throws Exception {
// if metadata record is removed from table, removing the same from local state
if(metadata.getEventName().equals("REMOVE")) {
metadataState.clear();
return;
}
// update metadata in ValueState
this.metadataState.update(metadata);
}
#Override
public void flatMap1(Record record, Collector<Output> collector) throws Exception {
Metadata metadata = this.metadataState.value();
// if metadata is not present in ValueState
if(metadata == null) {
// get metadata from datasource
metadata = datasource.get(record.getId());
// if metadata found in datasource, add it to ValueState
if(metadata != null) {
metadataState.update(metadata);
Output output = new Output(record.getId(), metadataState.getName(),
metadataState.getVersion(), metadata.getType());
if(metadata.getId() == 123) {
// here I want to update metadata into another Database
// can I do it here directly ?
}
collector.collect(output);
}
}
}
}
Here, in flatmap1 method, I want to update a database. Can I do that operation in flatmap1, I am asking this because it involves some wait time to query DB and then update db.
While it in principle it is possible to do this, it's not a good idea. Doing synchronous i/o in a Flink user function causes two problems:
You are tying up considerable resources that are spending most of their time idle, waiting for a response.
While waiting, that operator is creating backpressure that prevents checkpoint barriers from making progress. This can easily cause occasional checkpoint timeouts and job failures.
It would be better to use a KeyedCoProcessFunction instead, and emit the intended database update as a side output. This can then be handled downstream either by a database sink or by using a RichAsyncFunction.

Does Flink DataStream have api like mapPartition?

I want to use a non serializable object in stream.map() like this
stream.map { i =>
val obj = new SomeUnserializableClass()
obj.doSomething(i)
}
It is very inefficient, because I create many SomeUnserializableClass instance. Actually, it can be created only once in each worker.
In Spark, I can use mapPartition to do this. But in flink stream api, I don't known.
If you are dealing with a non serializable class what I recommend you is to create a RichFunction. In your case a RichMapFunction.
A Rich operator in Flink has a open method that is executed in the taskmanager just one time as initializer.
So the trick is to make your field transient and instantiate it in your open method.
Check below example:
public class NonSerializableFieldMapFunction extends RichMapFunction {
transient SomeUnserializableClass someUnserializableClass;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.someUnserializableClass = new SomeUnserializableClass();
}
#Override
public Object map(Object o) throws Exception {
return someUnserializableClass.doSomething(o);
}
}
Then your code will looks like:
stream.map(new NonSerializableFieldMapFunction())
P.D: I'm using java syntax, please adapt it to scala.

how to get the operation cost in Flink using cost estimator class provided in Flink

I want to do performance analysis of Flink CEP engine and I came across these classes
org.apache.flink.optimizer.costs.CostEstimator;
org.apache.flink.optimizer.costs.Costs;
org.apache.flink.optimizer.costs.DefaultCostEstimator;
But the issue is that I don't know how to use either of this class. Can someone provide me with a code or insinuation regarding, how can I find the costs estimation for operators { join for example} in Flink.
Below is the code for a join that I am performing in Flink
DataStream<JoinedEvent> joinedEventDataStream = stream1.join(stream2).where(new KeySelector<RRIntervalStreamEvent, Long>() {
#Override
public Long getKey(RRIntervalStreamEvent rrIntervalStreamEvent) throws Exception {
return rrIntervalStreamEvent.getTime();
}
})
.equalTo(new KeySelector<qrsIntervalStreamEvent, Long>() {
#Override
public Long getKey(qrsIntervalStreamEvent qrsIntervalStreamEvent) throws Exception {
return qrsIntervalStreamEvent.getTime();
}
})
.window(TumblingEventTimeWindows.of(Time.milliseconds(1000)))
.apply(new JoinFunction<RRIntervalStreamEvent, qrsIntervalStreamEvent, JoinedEvent>() {
#Override
public JoinedEvent join(RRIntervalStreamEvent rr, qrsIntervalStreamEvent qrs) throws Exception {
//getting the cost -- just checking
// costs.getCpuCost();
return new JoinedEvent(rr.getTime(),rr.getSensor_id(),qrs.getSensor_id(),rr.getRRInterval(),qrs.getQrsInterval());
}
});
how can I compute the cost for this join?
The cost classes belong to the optimizer of the DataSet API (Flink's batch processing API) while the CEP library is built on the DataStream API. The DataStream API does not leverage the DataSet API.
The CEP library and the DataSet optimizer are completely unrelated. Hence, it is not possible to use this code to estimate the cost of a CEP pattern. I'm also not aware of another built-in method to estimate the cost of a CEP pattern (or any other DataStream program).

getting numOfRecordsIn using counters in Flink

I want to show numRecordsIn for an operator in Flink and for doing this I have been following ppt by data artisans at here. code for the counter is given below
public static class mapper extends RichMapFunction<String,String>{
public Counter counter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("numRecordsIn");
}
#Override
public String map(String s) throws Exception {
counter.inc();
System.out.println("counter val " + counter.toString());
return null;
}
}
The problem is that how do I specify which operator I want to show number_of_Records_In?
Metric counter are exposed via Flink's metric system. In order to take a look at them, you have to configure a metric reporter. A description how to register a metric reporter can be found here.
Flink includes a number of built-in metrics, including numRecordsIn. So if that's what you want to measure, there's no need to write any code to implement that particular measurement. Similarly for numRecordsInPerSecond, and a host of others.
The code you asked about causes the numRecordsIn counter to be incremented for the operator in which the metric is being used.
A good way to better understand the metrics system is to bring up a simple streaming job and look at the metrics in Flink's web ui. I also found it really helpful to query the monitoring REST api while a job was running.

Is there any way to get the taskManager Id within a map in Apache Flink?

Using custom partitioning in Apache Flink, we specify a key for each record to be assigned to a particular taskmanager.
Consider we broadcast a dataset to all of the nodes, taskmanagers. Is there any While in a map or faltmap to get the taskmanagef Id or not?
A custom partitioner does not assign records to a TaskManager but to a specific parallel task instance of the subsequent operator (a TM can execute multiple parallel task instances of the same operator).
You can access the ID of a parallel task instance, be extending a RichFunction, e.g., extend a RichMapFunction instead of implementing a MapFunction. Rich functions are available for all transformation. A RichFunction gives access to the RuntimeContext which tells you the ID of the parallel task instance:
public static class MyMapper extends RichMapFunction<Long, Long> {
#Override
public void open(Configuration config) {
int pId = getRuntimeContext().getIndexOfThisSubtask();
}
#Override
public Long map(Long value) throws Exception {
// ...
}
}

Resources