Unbounded Collection based stream in Flink - apache-flink

Is it possible to create an unbounded collection streams in flink. Like in a map if we add a element flink should process as in the socket stream. It should not exit once the initial elements are read.

You can create a custom SourceFunction that never terminates (until cancel() is called, and emits elements as they appear. You'd want to have a class that looks something like:
class MyUnboundedSource extends RichParallelSourceFunction<MyType> {
...
private transient volatile boolean running;
...
#Override
public void run(SourceContext<MyType> ctx) throws Exception {
while (running) {
// Call some method that returns the next record, if available.
MyType record = getNextRecordOrNull();
if (record != null) {
ctx.collect(record);
} else {
Thread.sleep(NO_DATA_SLEEP_TIME());
}
}
}
#Override
public void cancel() {
running = false;
}
}
Note that you'd need to worry about saving state for this to support at least once or exactly once generation of records.

Related

Pre-shuffle aggregation in Flink

We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.

Flink - how to aggregate in state

I have a keyd stream of data that looks like:
{
summary:Integer
uid:String
key:String
.....
}
I need to aggregate the summary values in some time range, and once I achieved a specifc number , to flush the summary and all the of the UID'S that influenced the summary to database/log file.
after the first flush , I want to discare all the uid's from the memory , and just flush every new item immediatelly.
So I tried this aggregate function.
public class AggFunc implements AggregateFunction<Item, Acc, Tuple2<Integer,List<String>>>{
private static final long serialVersionUID = 1L;
#Override
public Acc createAccumulator() {
return new Acc());
}
#Override
public Acc add(Item value, Acc accumulator) {
accumulator.inc(value.getSummary());
accumulator.addUid(value.getUid);
return accumulator;
}
#Override
public Tuple2<Integer,List<String>> getResult(Acc accumulator) {
List<String> newL = Lists.newArrayList(accumulator.getUids());
accumulator.setUids(Lists.newArrayList());
return Tuple2.of(accumulator.getSum(), newL);
}
#Override
public Acc merge(Acc a, Acc b) {
.....
}
}
and in the aggregate process function , I flush the list to state, and if I need to save to dataBase I'm clearing the state and save flag in the state to indicate it.
But it seems crooked to me. And I'm not sure if that would work well for me.
Is there a better solution to this situation?
Work with a state inside a rich function. Keep adding the uid in your state and when the window triggers to flush the values. This page from the official documentation has an example.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/state/state.html#using-keyed-state
For your case a ListState will work well.
EDIT:
The solution above is for non-window case. for window case simply use the aggrgation with apply function that can have a rich window function

APACHE FLINK AggregateFunction with tumblingWindow to count events but also send 0 if no event occurred

I need to count events within a tumbling window. But I also want to send events with 0 Value if there were no events within the window.
Something like.
windowCount: 5
windowCount: 0
windowCount: 0
windowCount: 3
windowCount: 0
...
import com.google.protobuf.Message;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.skydivin4ng3l.cepmodemon.models.events.aggregate.AggregateOuterClass;
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
return AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
and used here
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
TimeCharacteristics are ingestionTime
I read about a TiggerFunction which might detect if the aggregated Stream has received an event after x time but i am not sure if that is the right way to do it.
I expected the aggregation to happen even is there would be no events at all within the window. Maybe there is a setting i am not aware of?
Thx for any hints.
I chose Option 1 as suggested by #David-Anderson:
Here is my Event Generator:
public class EmptyEventSource implements SourceFunction<MonitorOuterClass.Monitor> {
private volatile boolean isRunning = true;
private final long delayPerRecordMillis;
public EmptyEventSource(long delayPerRecordMillis){
this.delayPerRecordMillis = delayPerRecordMillis;
}
#Override
public void run(SourceContext<MonitorOuterClass.Monitor> sourceContext) throws Exception {
while (isRunning) {
sourceContext.collect(MonitorOuterClass.Monitor.newBuilder().build());
if (delayPerRecordMillis > 0) {
Thread.sleep(delayPerRecordMillis);
}
}
}
#Override
public void cancel() {
isRunning = false;
}
}
and my adjusted AggregateFunction:
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
if(((MonitorOuterClass.Monitor)event).equals(MonitorOuterClass.Monitor.newBuilder().build())) {
return accumulator;
}
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
AggregateOuterClass.Aggregate newAggregate = AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
return newAggregate;
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
Used them Like this:
DataStream<MonitorOuterClass.Monitor> someEntryStream = env.addSource(currentConsumer);
DataStream<MonitorOuterClass.Monitor> triggerStream = env.addSource(new EmptyEventSource(delayPerRecordMillis));
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.union(triggerStream)
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
Flink's windows are created lazily, when the first event is assigned to a window. Thus empty windows do not exist, and can't produce results.
In general there are three ways to workaround this issue:
Put something in front of the window that adds events to the stream, ensuring that every window has something in it, and then modify your window processing to ignore these special events when computing their results.
Use a GlobalWindow along with a custom Trigger that uses processing time timers to trigger the window (with no events flowing, the watermark won't advance, and event time timers won't fire until more events arrive).
Don't use the window API, and implement your own windowing with a ProcessFunction instead. But here you'll still face the issue of needing to use processing time timers.
Update:
Having now made an effort to implement an example of option 2, I cannot recommend it. The issue is that even with a custom Trigger, the ProcessAllWindowFunction will not be called if the window is empty, so it is necessary to always keep at least one element in the GlobalWindow. This appears then to require implementing a rather hacky Evictor and ProcessAllWindowFunction that collaborate to retain and ignore a special element in the window -- and you also have to somehow get that element into the window in the first place.
If you're going to do something hacky, option 1 appears to be much simpler.

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

How to properly canalize multithreaded message flow in a single threaded service?

In a WPF application, I have a 3rd party library that is publishing messages.
The messages are like :
public class DialectMessage
{
public string PathAndQuery { get; private set; }
public byte[] Body { get; private set; }
public DialectMessage(string pathAndQuery, byte[] body)
{
this.PathAndQuery = pathAndQuery;
this.Body = body;
}
}
And I setup the external message source from my app.cs file :
public partial class App : Application
{
static App()
{
MyComponent.MessageReceived += MessageReceived;
MyComponent.Start();
}
private static void MessageReceived(Message message)
{
//handle message
}
}
These messages can be publishing from multiple thread at a time, making possible to call the event handler multiple times at once.
I have a service object that have to parse the incoming messages. This service implements the following interface :
internal interface IDialectService
{
void Parse(Message message);
}
And I have a default static instance in my app.cs file :
private readonly static IDialectService g_DialectService = new DialectService();
In order to simplify the code of the parser, I would like to ensure only one message at a time is parsed.
I also want to avoid locking in my event handler, as I don't want to block the 3rd party object.
Because of this requirements, I cannot directly call g_DialectService.Parse from my message event handler
What is the correct way to ensure this single threaded execution?
My first though is to wrap my parsing operations in a Produce/Consumer pattern. In order to reach this goal, I've try the following :
Declare a BlockingCollection in my app.cs :
private readonly static BlockingCollection<Message> g_ParseOperations = new BlockingCollection<Message>();
Change the body of my event handler to add an operation :
private static void MessageReceived(Message message)
{
g_ParseOperations.Add(message);
}
Create a new thread that pump the collection from my app constructor :
static App()
{
MyComponent.MessageReceived += MessageReceived;
MyComponent.Start();
Task.Factory.StartNew(() =>
{
Message message;
while (g_ParseOperations.TryTake(out message))
{
g_DialectService.Parse(message);
}
});
}
However, this code does not seems to work. The service Parse method is never called.
Moreover, I'm not sure if this pattern will allow me to properly shutdown the application.
What have I to change in my code to ensure everything is working?
PS: I'm targeting .Net 4.5
[Edit] After some search, and the answer of ken2k, i can see that I was wrongly calling trytake in place of take.
My updated code is now :
private readonly static CancellationTokenSource g_ShutdownToken = new CancellationTokenSource();
private static void MessageReceived(Message message)
{
g_ParseOperations.Add(message, g_ShutdownToken.Token);
}
static App()
{
MyComponent.MessageReceived += MessageReceived;
MyComponent.Start();
Task.Factory.StartNew(() =>
{
while (!g_ShutdownToken.IsCancellationRequested)
{
var message = g_ParseOperations.Take(g_ShutdownToken.Token);
g_DialectService.Parse(message);
}
});
}
protected override void OnExit(ExitEventArgs e)
{
g_ShutdownToken.Cancel();
base.OnExit(e);
}
This code acts as expected. Messages are processed in the correct order. However, as soon I exit the application, I get a "CancelledException" on the Take method, even if I just test the IsCancellationRequested right before.
The documentation says about BlockingCollection.TryTake(out T item):
If the collection is empty, this method immediately returns false.
So basically your loop exits immediately. What you may want is to call the TryTake method with a timeout parameter instead, and exit your loop when a mustStop variable becomes true:
bool mustStop = false; // Must be set to true on somewhere else when you exit your program
...
while (!mustStop)
{
Message yourMessage;
// Waits 500ms if there's nothing in the collection. Avoid to consume 100% CPU
// for nothing in the while loop when the collection is empty.
if (yourCollection.TryTake(out yourMessage, 500))
{
// Parses yourMessage here
}
}
For your edited question: if you mean you received a OperationCanceledException, that's OK, it's exactly how methods that take a CancellationToken object as parameter must behave :) Just catch the exception and exit gracefully.

Resources