Flink Table-API and DataStream ProcessFunction - apache-flink

I want to join a big table, impossible to be contained in TM memory and a stream (kakfa). I successfully joined both on my tests, mixing table-api with datastream api. I did the following:
val stream: DataStream[MyEvent] = env.addSource(...)
stream
.timeWindowAll(...)
.trigger(...)
.process(new ProcessAllWindowFunction[MyEvent, MyEvent, TimeWindow] {
var tableEnv: StreamTableEnvironment = _
override def open(parameters: Configuration): Unit = {
//init table env
}
override def process(context: Context, elements: Iterable[MyEvent], out: Collector[MyEvent]): Unit = {
val table = tableEnv.sqlQuery(...)
elements.map(e => {
//do process
out.collect(...)
})
}
})
It is working, but I have never seen anywhere this type of implementation. Is it ok ? what would be the drawback ?

One should not use StreamExecutionEnvironment or TableEnvironment within a Flink function. An environment is used to construct a pipeline that is submitted to the cluster.
Your example submits a job to the cluster within a cluster's job.
This might work for certain use cases but is generally discouraged. Imagine your outer stream contains thousands of events and your function would create a job for every event, it could potentially DDoS your cluster.

Related

Savepoint - Operators could not matched in Apache Flink

I'm trying to stop my job with savepoint, then start it again using the same savepoint. For my case, I update my job, and create new version for it with new jar. Here is my code example;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map()
}
}
object MyJob {
def run () = {
val data = new Reader().read()
data.keyBy(id).process(new MyStateFunc).uid("my-uid") // then write to kafka
}
}
In this case, i did stop job with savepoint, then start it using the same savepoint with the same jar. Then, I add a filter to my Reader like this;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map().filter() // FILTER ADDED HERE
}
}
I stop my job with savepoint, it works. Then i've tried to deploy job with new version(new filter method) using the same savepoint, it can not match the operators and job does not deploys. Why?
Unless you explicitly provide UIDs for all of your stateful operators before taking a savepoint, then after changing the topology of your job, Flink will no longer be able to figure out which state in the savepoint belongs to which operator.
I see that you have a UID on your keyed process function ("my-uid"). But you also need to have UIDs on the Kafka source and the sink, and anything else that's stateful. These UIDs need to be attached to the stateful operators themselves and need to be unique within the job (but not across all jobs). (Furthermore, each state descriptor needs to assign a name to each piece of state, using a name that is unique within the operator.)
Typically one does something like this
env
.addSource(...)
.name("KafkaSource")
.uid("KafkaSource")
results.addSink(...)
.name("KafkaSink")
.uid("KafkaSink")
where the name() method is used to supply the text that appears in the web UI.

Is there a way to asynchronously modify state in Flink KeyedProcessFunction?

I have two sources, kafka and hbase. In Kafka, there is a data stream in only 24 hours. In Hbase, there is an aggregated data from the beginning. My purpose is that the two data merge on stream processing, when stream input(Kafka) of some session is occurred. I tried a couple of methods but it is not satisfied because of performance.
After some searching, I have an idea with state in keyed process function. The idea is down below. (caching using state of keyed process function)
make input to keyed process function using session information
check keyed process's state
if state is not initialized -> then query from hbase and initialize into state -> go to 5
else (state is initialized) -> go to 5
do business logic using state
During coding the idea, I have faced performance issue that querying to hbase is slow with sync way. So, I tried async version but it's complicated.
I have faced two issues. One of them is thread-safe issue between processElement and hbase Async worker thread, the other is Context of the process function is expired after end of processElement function (not end of hbase Async worker).
val sourceStream = env.addsource(kafkaConsumer.setStartFromGroupOffsets())
sourceStream.keyBy(new KeySelector[InputMessage, KeyInfo]() {
override def getKey(v: InputMessage): KeyInfo = v.toKeyInfo()
})
.process(new KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]() {
var state: MapState[String, (String, Long)] = _
override def open(parameters: Configuration): Unit = {
val conn = ConnectionFactory.createAsyncConnection(hbaseConfInstance).join
table = conn.getTable(TableName.valueOf("tablename"))
state = getRuntimeContext.getMapState(stateDescripter)
}
def request(action: Consumer[CacheResult] ): Unit = {
if ( !state.isEmpty ) {
action.accept(new CacheResult(state))
}
else { // state is empty, so load from hbase
table.get(new Get(key)).thenAccept((hbaseResult: Result) => {
// this is called by worker thread
hbaseResult.toState(state) // convert from hbase result into state
action.accept(new CacheResult(state))
}
}
}
override def processElement(value: InputMessage
, ctx: KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]#Context
, out: Collector[OUTPUTTYPE]): Unit = {
val businessAction = new Consumer[CacheResult]() {
override def accept(t: CacheResult): Unit = {
// .. do business logic here.
out.collect( /* final result */ )
}
}
request(businessAction)
}
}).addSink()
Is there any suggestion to make KeyedProcessFunction available with async call in third party?
Or any other idea to approach using mixed-up Kafka and Hbase in Flink?
I think your general assumptions are wrong. I faced similar issue but regarding quite different problem and didn't resolve it yet. Keeping state in the program is contradictory with async function and Flink prevents using state in async code by its design (which is a good thing). If you want to make your function async, then you must get rid of the state. To achieve your goal, you probably need to redesign your solution. I don't know all details regarding your problem, but you can think of splitting your process into more pipelines. E.g. you can create pipeline consuming data from hbase and passing it into kafka topic. Then another pipeline can consume data sent by pipeline gathering data from hbase. In such approach you don't have to care about the state becasue each pipeline is doing its own thing, just consuming data and passing it further.

multiple FlinkKinesisProducer as sink for a datastream

I have multi level KDA with Flink applications in different accounts. I have the use case where I need to look at record contents to determine what AWS account to push the data to (kinesis stream in that account).
link shows its possible to select stream name based on record contents, I need to support multiple kinesis producers for pushing to diff AWS accounts.
Any help ?
As an alternative, you can use Side Outputs to configure a dedicated sink (and therefore, FlinkKinesisProducer) for a different AWS account.
You can make it as follows:
val stream: DataStream[T] = ...
val account1OutputTag = OutputTag[T]("aws-account-1-output")
...
val accountNOutputTag = OutputTag[T]("aws-account-N-output")
val mainDataStream = stream
.process(new ProcessFunction[T, T] {
override def processElement(
value: T,
ctx: ProcessFunction[T, T]#Context,
out: Collector[T]): Unit = {
// emit data to regular output
out.collect(value)
// emit data to a corresponding side output
ctx.output(accountKOutputTag, value)
}
})
...
val account1SideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(account1OutputTag)
.addSink(account1KinesisProducer)
...
val accountNSideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(accountNOutputTag)
.addSink(accountNKinesisProducer)

Flink Statefun connections to Flink Table API

We are interested in connecting to a regular Flink Streaming application from new Stateful Functions 🎉, ideally using the Table API. The idea is to consult tables registered in Flink from Statefun, is this possible, and what is the right way to do it?
My idea so far has been to initialize my table stream in some main function and register a stateful function provider to connect to the table:
#AutoService(StatefulFunctionModule.class)
public class Module implements StatefulFunctionModule {
#Override
public void configure(Map<String, String> globalConfiguration, Binder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// ingest a DataStream from an external source
DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...);
// SQL query with an inlined (unregistered) table
Table myTable = tableEnv.fromDataStream(ds, "user, product, amount");
tableEnv.createTemporaryView("my_table", myTable);
TableFunctionProvider tableProvider = new TableFunctionProvider();
binder.bindFunctionProvider(FnEnrichmentCallback.TYPE, tableProvider);
//continue registering my other messages
//...
}
}
The stateful function provider would return a FnTableQuery which simply queries the table whenever it receives a message:
public class TableFunctionProvider implements StatefulFunctionProvider {
#Override
public StatefulFunction functionOfType(FunctionType type) {
return new FnTableQuery();
}
}
The query function object would then operate as an actor for every established process, and simply query the table when invoked:
public class FnTableQuery extends StatefulMatchFunction {
static final FunctionType TYPE = new FunctionType(Identifiers.NAMESPACE, "my-table");
private Table myTable;
#Override
public void configure(MatchBinder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
myTable = tableEnv.from("my_table");
binder
.otherwise(this::catchAll);
}
private void catchAll(Context context, Object message) {
context.send(FnEnrichmentCallback.TYPE, myTable.select("max(amount)").toString(), message);
}
}
I apologize in advance if this approach doesn't make sense, because I don't know if:
Flink and Statefun applications can work together outside the realm of sources/sinks, especially since this particular function is stateless and the table is stateful
We can query Flink tables like this, I have only queried them as an intermediate object to send to a sink or datastream
It makes sense to initialize things in Module.configure, and if both the stateful function provider and its match function are called once per parallel worker
The Apache Flink community does have in mind to support Flink DataStreams as StateFun ingress / egresses in the future.
What this would mean is that you can take the result streams of using the Flink Table API / Flink CEP / DataStream API etc., and invoke functions using the events in the streams.

Apache Flink DataStream API doesn't have a mapPartition transformation

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

Resources