Flink: Dataset and Datastream API in one program. Is it possible? - apache-flink

I want to first manipulate static data using dataset API and then use DataStream API to run a streaming job. If I write code on IDE, it works perfectly. But when I try running on local flink jobmanager (all parallelism 1), the streaming code never executes!
For example, the following code is not working:
val parallelism = 1
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(parallelism)
val envStatic = ExecutionEnvironment.getExecutionEnvironment
envStatic.setParallelism(parallelism)
val myStaticData = envStatic.fromCollection(1 to 10)
val myVal: Int = myStaticData.reduce(_ + _).collect().head
val theStream = env.fromElements(1).iterate( iteretion => {
val result = iteretion.map(x => x + myVal)
(result, result)
})
theStream.print()
env.execute("static and streaming together")
What should I try to get this thing working?
Logs:execution logs for above program
Execution plan: plan
Seems to be a-cyclic.

If you have a Flink job which consists of multiple sub jobs, e.g. triggered by count, collect or print, then you cannot submit the job via the web interface. The web interface only supports a single Flink job.

Related

How to print the total number of lines in files using flink

I am reading lines from parquet for that I am using source functions similar to this one , however when I try counting number of lines being processed, nothing is printed although the job is completed :
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
lazy val stream: DataStream[Group] = env.addSource(new ParquetSourceFunction)
stream.map(_ => 1)
.timeWindowAll(Time.seconds(180))
.reduce( _ + _).print()
The problem is the fact that You are using ProcessingTime, so basically whenever You are using the EventTime when the file is finished Flink is emitting a watemark with Long.Max value so that all windows are closed, but this does not happen when working with ProcessingTime, so simply speaking Flink doesn't wait for Your window to close and that's why You are not getting any valuable results.
You may want to try to switch to DataSet API, which should be more appropriate for the task You want to achieve.
Alternatively, You may try to play with EventTime and assign static Watermark, since Flink at the end will still emit watermark with Long.Max value.

Flink Table-API and DataStream ProcessFunction

I want to join a big table, impossible to be contained in TM memory and a stream (kakfa). I successfully joined both on my tests, mixing table-api with datastream api. I did the following:
val stream: DataStream[MyEvent] = env.addSource(...)
stream
.timeWindowAll(...)
.trigger(...)
.process(new ProcessAllWindowFunction[MyEvent, MyEvent, TimeWindow] {
var tableEnv: StreamTableEnvironment = _
override def open(parameters: Configuration): Unit = {
//init table env
}
override def process(context: Context, elements: Iterable[MyEvent], out: Collector[MyEvent]): Unit = {
val table = tableEnv.sqlQuery(...)
elements.map(e => {
//do process
out.collect(...)
})
}
})
It is working, but I have never seen anywhere this type of implementation. Is it ok ? what would be the drawback ?
One should not use StreamExecutionEnvironment or TableEnvironment within a Flink function. An environment is used to construct a pipeline that is submitted to the cluster.
Your example submits a job to the cluster within a cluster's job.
This might work for certain use cases but is generally discouraged. Imagine your outer stream contains thousands of events and your function would create a job for every event, it could potentially DDoS your cluster.

multiple FlinkKinesisProducer as sink for a datastream

I have multi level KDA with Flink applications in different accounts. I have the use case where I need to look at record contents to determine what AWS account to push the data to (kinesis stream in that account).
link shows its possible to select stream name based on record contents, I need to support multiple kinesis producers for pushing to diff AWS accounts.
Any help ?
As an alternative, you can use Side Outputs to configure a dedicated sink (and therefore, FlinkKinesisProducer) for a different AWS account.
You can make it as follows:
val stream: DataStream[T] = ...
val account1OutputTag = OutputTag[T]("aws-account-1-output")
...
val accountNOutputTag = OutputTag[T]("aws-account-N-output")
val mainDataStream = stream
.process(new ProcessFunction[T, T] {
override def processElement(
value: T,
ctx: ProcessFunction[T, T]#Context,
out: Collector[T]): Unit = {
// emit data to regular output
out.collect(value)
// emit data to a corresponding side output
ctx.output(accountKOutputTag, value)
}
})
...
val account1SideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(account1OutputTag)
.addSink(account1KinesisProducer)
...
val accountNSideOutputStream: DataStream[T] = mainDataStream
.getSideOutput(accountNOutputTag)
.addSink(accountNKinesisProducer)

Where to find the output for the standalone

I have the following flink work count, when I run it in my IDE, it prints the word count correctly as follows
(hi,2)
(are,1)
(you,1)
(how,1)
But I when I run it in the cluster, I didn't find the output.
1. Start cluster using start-cluster.sh
2. Open the webui at http://localhost:8081
3. In the Submit new Job page, Submit the jar, and then input the entry class and then click the Submit button to submit the job
4. The job is done successfully, but I didn't find the output in the TaskManager or JobManager Logs on the UI.
I would ask where I can find the output
The word count application is:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* Wordcount example
*/
object WordCount {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi", "how are you", "hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("\\s+"))
val mappedWords = words.map(value => (value, 1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
sum.collect().foreach(println)
}
}
In the log directory of each taskmanager machine you should find both *.log and *.out files. Whatever your job has printed will go to the .out files. This is what is displayed in the "stdout" tab for each taskmanager in the web UI -- though if this file is very large, the browser may struggle to fetch and display it.
Update: Apparently the Flink's batch environment handles printing differently from the streaming one. When I use the CLI to submit this batch job, the output appears in the terminal, and not in the .out files as it would for a streaming job.
I suggest you change your example to do something like this at the end to collect the results in a file:
...
sum.writeAsText("/tmp/test")
env.execute()

Apache Flink DataStream API doesn't have a mapPartition transformation

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

Resources