I have multi level KDA with Flink applications in different accounts. I have the use case where I need to look at record contents to determine what AWS account to push the data to (kinesis stream in that account).
link shows its possible to select stream name based on record contents, I need to support multiple kinesis producers for pushing to diff AWS accounts.
Any help ?
As an alternative, you can use Side Outputs to configure a dedicated sink (and therefore, FlinkKinesisProducer) for a different AWS account.
You can make it as follows:
val stream: DataStream[T] = ...
val account1OutputTag = OutputTag[T]("aws-account-1-output")
...
val accountNOutputTag = OutputTag[T]("aws-account-N-output")
val mainDataStream = stream
.process(new ProcessFunction[T, T] {
override def processElement(
value: T,
ctx: ProcessFunction[T, T]#Context,
out: Collector[T]): Unit = {
// emit data to regular output
out.collect(value)
// emit data to a corresponding side output
ctx.output(accountKOutputTag, value)
}
})
...
val account1SideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(account1OutputTag)
.addSink(account1KinesisProducer)
...
val accountNSideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(accountNOutputTag)
.addSink(accountNKinesisProducer)
Related
I'm trying to stop my job with savepoint, then start it again using the same savepoint. For my case, I update my job, and create new version for it with new jar. Here is my code example;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map()
}
}
object MyJob {
def run () = {
val data = new Reader().read()
data.keyBy(id).process(new MyStateFunc).uid("my-uid") // then write to kafka
}
}
In this case, i did stop job with savepoint, then start it using the same savepoint with the same jar. Then, I add a filter to my Reader like this;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map().filter() // FILTER ADDED HERE
}
}
I stop my job with savepoint, it works. Then i've tried to deploy job with new version(new filter method) using the same savepoint, it can not match the operators and job does not deploys. Why?
Unless you explicitly provide UIDs for all of your stateful operators before taking a savepoint, then after changing the topology of your job, Flink will no longer be able to figure out which state in the savepoint belongs to which operator.
I see that you have a UID on your keyed process function ("my-uid"). But you also need to have UIDs on the Kafka source and the sink, and anything else that's stateful. These UIDs need to be attached to the stateful operators themselves and need to be unique within the job (but not across all jobs). (Furthermore, each state descriptor needs to assign a name to each piece of state, using a name that is unique within the operator.)
Typically one does something like this
env
.addSource(...)
.name("KafkaSource")
.uid("KafkaSource")
results.addSink(...)
.name("KafkaSink")
.uid("KafkaSink")
where the name() method is used to supply the text that appears in the web UI.
I want to join a big table, impossible to be contained in TM memory and a stream (kakfa). I successfully joined both on my tests, mixing table-api with datastream api. I did the following:
val stream: DataStream[MyEvent] = env.addSource(...)
stream
.timeWindowAll(...)
.trigger(...)
.process(new ProcessAllWindowFunction[MyEvent, MyEvent, TimeWindow] {
var tableEnv: StreamTableEnvironment = _
override def open(parameters: Configuration): Unit = {
//init table env
}
override def process(context: Context, elements: Iterable[MyEvent], out: Collector[MyEvent]): Unit = {
val table = tableEnv.sqlQuery(...)
elements.map(e => {
//do process
out.collect(...)
})
}
})
It is working, but I have never seen anywhere this type of implementation. Is it ok ? what would be the drawback ?
One should not use StreamExecutionEnvironment or TableEnvironment within a Flink function. An environment is used to construct a pipeline that is submitted to the cluster.
Your example submits a job to the cluster within a cluster's job.
This might work for certain use cases but is generally discouraged. Imagine your outer stream contains thousands of events and your function would create a job for every event, it could potentially DDoS your cluster.
I have two sources, kafka and hbase. In Kafka, there is a data stream in only 24 hours. In Hbase, there is an aggregated data from the beginning. My purpose is that the two data merge on stream processing, when stream input(Kafka) of some session is occurred. I tried a couple of methods but it is not satisfied because of performance.
After some searching, I have an idea with state in keyed process function. The idea is down below. (caching using state of keyed process function)
make input to keyed process function using session information
check keyed process's state
if state is not initialized -> then query from hbase and initialize into state -> go to 5
else (state is initialized) -> go to 5
do business logic using state
During coding the idea, I have faced performance issue that querying to hbase is slow with sync way. So, I tried async version but it's complicated.
I have faced two issues. One of them is thread-safe issue between processElement and hbase Async worker thread, the other is Context of the process function is expired after end of processElement function (not end of hbase Async worker).
val sourceStream = env.addsource(kafkaConsumer.setStartFromGroupOffsets())
sourceStream.keyBy(new KeySelector[InputMessage, KeyInfo]() {
override def getKey(v: InputMessage): KeyInfo = v.toKeyInfo()
})
.process(new KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]() {
var state: MapState[String, (String, Long)] = _
override def open(parameters: Configuration): Unit = {
val conn = ConnectionFactory.createAsyncConnection(hbaseConfInstance).join
table = conn.getTable(TableName.valueOf("tablename"))
state = getRuntimeContext.getMapState(stateDescripter)
}
def request(action: Consumer[CacheResult] ): Unit = {
if ( !state.isEmpty ) {
action.accept(new CacheResult(state))
}
else { // state is empty, so load from hbase
table.get(new Get(key)).thenAccept((hbaseResult: Result) => {
// this is called by worker thread
hbaseResult.toState(state) // convert from hbase result into state
action.accept(new CacheResult(state))
}
}
}
override def processElement(value: InputMessage
, ctx: KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]#Context
, out: Collector[OUTPUTTYPE]): Unit = {
val businessAction = new Consumer[CacheResult]() {
override def accept(t: CacheResult): Unit = {
// .. do business logic here.
out.collect( /* final result */ )
}
}
request(businessAction)
}
}).addSink()
Is there any suggestion to make KeyedProcessFunction available with async call in third party?
Or any other idea to approach using mixed-up Kafka and Hbase in Flink?
I think your general assumptions are wrong. I faced similar issue but regarding quite different problem and didn't resolve it yet. Keeping state in the program is contradictory with async function and Flink prevents using state in async code by its design (which is a good thing). If you want to make your function async, then you must get rid of the state. To achieve your goal, you probably need to redesign your solution. I don't know all details regarding your problem, but you can think of splitting your process into more pipelines. E.g. you can create pipeline consuming data from hbase and passing it into kafka topic. Then another pipeline can consume data sent by pipeline gathering data from hbase. In such approach you don't have to care about the state becasue each pipeline is doing its own thing, just consuming data and passing it further.
I want to first manipulate static data using dataset API and then use DataStream API to run a streaming job. If I write code on IDE, it works perfectly. But when I try running on local flink jobmanager (all parallelism 1), the streaming code never executes!
For example, the following code is not working:
val parallelism = 1
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(parallelism)
val envStatic = ExecutionEnvironment.getExecutionEnvironment
envStatic.setParallelism(parallelism)
val myStaticData = envStatic.fromCollection(1 to 10)
val myVal: Int = myStaticData.reduce(_ + _).collect().head
val theStream = env.fromElements(1).iterate( iteretion => {
val result = iteretion.map(x => x + myVal)
(result, result)
})
theStream.print()
env.execute("static and streaming together")
What should I try to get this thing working?
Logs:execution logs for above program
Execution plan: plan
Seems to be a-cyclic.
If you have a Flink job which consists of multiple sub jobs, e.g. triggered by count, collect or print, then you cannot submit the job via the web interface. The web interface only supports a single Flink job.
Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.