Flink Streaming: Data stream that gets controlled by control stream - apache-flink

I have a question which is a variation of this question: Flink: how to store state and use in another stream?
I have two streams:
val ipStream: DataStream[IPAddress] = ???
val routeStream: DataStream[RoutingTable] = ???
I want to find out which route which package uses. Usually this can be done with:
val ip = IPAddress("10.10.10.10")
val table = RoutingTable(Seq("10.10.10.0/24", "5.5.5.0/24"))
val route = table.lookup(ip) // == "10.10.10.0/24"
The problem here is that I cannot really key the stream here since that requires both the complete table as well as the ip address (and keys must be computed isolated).
For every element from the ipStream, I need the latest routeStream element. Right now I'm using a hack that all of that is processed non-parallel:
ipStream
.connect(routeStream)
.keyBy(_ => 0, _ => 0)
.flatMap(new MyRichCoFlatMapFunction) // with ValueState[RoutingTable]
This sounds like the use case for a broadcast strategy. However, the routeStream will be updated and is not fixed in a file. The question remains: Is there a way to have two streams, one of which contains changing control data for the other stream?

Since I solved the issue, I might aswell write an answer here :)
I keyed the two streams like this:
The RoutingTable stream was keyed using the first byte of the network route
The IPAddress was also keyed by the first byte of the address
This works under the condition that IP packages are getting generally routed in the net with the same /8 prefix, which can be assumed for most traffic.
Then, by having a stateful RichCoFlatMap one can build up the routing table state as key. When receiving a new IP package, do a lookup in the routing table. Now there are two possible scenarios:
No matching route has been found. We could store the package for later here but discarding it works aswell.
If a route has been found, output the tuple of [IPAddress, RoutingTableEntry].
This way, we have two streams where one of them has changing control data for the other stream.

Related

Flink - processing consecutive events within time constraint

I have a use case and I think I need some help on how to approach it.
Because I am new to streaming and Flink I will try to be very descriptive in what I am trying to achieve. Sorry if I am not using to formal and correct language.
My code will be in java but I do not care to get code in python or just pseudo code or approach.
TL:DR
Group events of same key that are within some time limit.
Out of those events, create a result event only from the 2 most closest (time domain) events.
This require (I think) opening a window for each and every event that comes.
If you'll look ahead at the batch solution you will understand best my problem.
Background:
I have data coming from sensors as a stream from Kafka.
I need to use eventTime because that data comes unrecorded. The lateness that will give me 90% of events is about 1 minute.
I am grouping those events by some key.
What I want to do:
Depending on some event's fields - I would like to "join/mix" 2 events into a new event ("result event").
The first condition is that those consecutive events are WITHIN 30 seconds from each other.
The next conditions are simply checking some fields values and than deciding.
My psuedo solution:
open a new window for EACH event. That window should be of 1 minute.
For every event that comes within that minute - I want to check it's event time and see if it is 30 seconds from the initial window event. If yes - check for other condition and omit a new result stream.
The Problem - When a new event comes it needs to:
create a new window for itself.
Join only ONE window out of SEVERAL possible windows that are 30 seconds from it.
The question:
Is that possible?
In other words my connection is between two "consecutive" events only.
Thank you very much.
Maybe showing the solution for **BATCH case will show what I am trying to do best:**
for i in range(grouped_events.length):
event_A = grouped_events[i]
event_B = grouped_events[i+1]
if event_B.get("time") - event_A.get("time") < 30:
if event_B.get("color") == event_A.get("color"):
if event_B.get("size") > event_A.get("size"):
create_result_event(event_A, event_B)
My (naive) tries so far with Flink in java
**The sum function is just a place holder for my function to create a new result object...
First solution is just doing a simple time window and summing by some field
Second is trying to do some process function on the window and maybe there iterate throw all events and check for my conditions?
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.sum("size")
.print();
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new processFunction());
public static class processFunction extends ProcessWindowFunction<ThreeEvent, Tuple3<Long, Long, Float>, Long, TimeWindow> {
#Override
public void process(Long key, Context context, Iterable<ThreeEvent> threeEvents, Collector<Tuple3<Long, Long, Float>> out) throws Exception {
Float sumOfSize = 0F;
for (ThreeEvent f : threeEvents) {
sumOfSize += f.getSize();
}
out.collect(new Tuple3<>(context.window().getEnd(), key, sumOfTips));
}
}
You can, of course, use windows to create mini-batches that you sort and analyze, but it will be difficult to handle the window boundaries correctly (what if the events that should be paired land in different windows?).
This looks like it would be much more easily done with a keyed stream and a stateful flatmap. Just use a RichFlatMapFunction and use one piece of keyed state (a ValueState) that remembers the previous event for each key. Then as each event is processed, compare it to the saved event, produce a result if that should happen, and update the state.
You can read about working with flink's keyed state in the flink training and in the flink documentation.
The one thing that concerns me about your use case is whether or not your events may arrive out-of-order. Is it the case that to get correct results you would need to first sort the events by timestamp? That isn't trivial. If this is a concern, then I would suggest that you use Flink SQL with MATCH_RECOGNIZE, or the CEP library, both of which are designed for doing pattern recognition on event streams, and will take care of sorting the stream for you (you just have to provide timestamps and watermarks).
This query may not be exactly right, but hopefully conveys the flavor of how to do something like this with match recognize:
SELECT * FROM Events
MATCH_RECOGNIZE (
PARTITION BY userId
ORDER BY eventTime
MEASURES
A.userId as userId,
A.color as color,
A.size as aSize,
B.size as bSize
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B)
DEFINE
A AS true,
B AS ( timestampDiff(SECOND, A.eventTime, B.eventTime) < 30)
AND A.color = B.color
AND A.size < B.size )
);
This can also be done quite naturally with CEP, where the basis for comparing consecutive events is to use an iterative condition, and you can use a within clause to handle the time constraint.

FFmpeg filter for selecting video/audio streams

I am trying to create a node (a collection of nodes is fine too), that takes in many streams and an index, and outputs one stream specified by the index. Basically, I want to create a mux node, something like:
Node : Stream ... Number -> Stream
FFmpeg's filter graph API seems to have two filters for doing that: streamselect (for video) and astreamselect (for audio). And for the most part, they seem to do what I want:
[in0][in1][in2]streamselect=inputs=3:map=1[out]
This stream will take in three video streams, and output the second one in1.
You can use a similar filter for audio streams:
[in0][in1]astreamselect=inputs=2:map=0[out]
Which will take in two streams and output the first one in0.
The question is, can I create a filter that takes in a list of both audio and video streams and outputs the stream based only on the stream index? So something like:
[v0][v1][a0][a1][a2]avstreamselect=inputs=5:map=3[out]
Which maps a1 to out?
If it helps I am using the libavfilter C API rather than the command line interface.
While it may not be possible with one filter1, it is certainly possible to do this by combining multiple filters, one for either audio or video (depending on which one you are selecting), and a bunch of nullsink or anullsink filters for the rest of them.
For example, the would be filter:
[v0][v1][a0][a1]avstreamselect=inputs=4:map=2[out]
which takes in two video streams and two audio streams, and returns the third stream (the first audio stream), can be written as:
[a0][a1]astreamselect=inputs=2:map=0[out];
[v0]nullsink;[v1]nullsink
Here, we run select for the first stream, and all of the remaining ones are mapped to sinks. This idea could potentially be generalized to only use nullsink, anullsink, copy, and acopy, for example, we could have also written it with 4 nodes:
[a0]acopy[out];
[a1]anullsink;
[v0]nullsink;
[v1]nullsink
1I still don't know if it is or not. Feel free to remove this if it actually is possible.

Flink trigger on a custom window

I'm trying to evaluate Apache Flink for the use case we're currently running in production using custom code.
So let's say there's a stream of events each containing a specific attribute X which is a continuously increasing integer. That is a bunch of contiguous events have this attributes set to N, then the next batch has it set to N+1 etc.
I want to break the stream into windows of events with the same value of X and then do some computations on each separately.
So I define a GlobalWindow and a custom Trigger where in onElement method I check the attribute of any given element against the saved value of the current X (from state variable) and if they differ I conclude that we've accumulated all the events with X=CURRENT and it's time to do computation and increase the X value in the state.
The problem with this approach is that the element from the next logical batch (with X=CURRENT+1) has been already consumed but it's not a part of the previous batch.
Is there a way to put it back somehow into the stream so that it is properly accounted for the next batch?
Or maybe my approach is entirely wrong and there's an easier way to achieve what I need?
Thank you.
I think you are on a right track.
Trigger specifies when a window can be processed and results for a window can be emitted.
The WindowAssigner is the part which says to which window element will be assigned. So I would say you also need to provide a custom implementation of WindowAssigner that will assign same window to all elements with equal value of X.
A more idiomatic way to do this with Flink would be to use stream.keyBy(X).window(...). The keyBy(X) takes care of grouping elements by their particular value for X. You then apply any sort of window you like. In your case a SessionWindow may be a good choice. It will fire for each key after that key hasn't been seen for some configurable period of time.
This approach will be much more robust with regard to unordered data which you must always assume in a stream processing system.

LabVIEW: How to exchange lots of variables between loops?

I have two loops:
One loop gets data from a device and processes it. Scales received variables, calculates extra data.
Second loop visualizes the data and stores it.
There are lots of different variables that need to passed between those two loops - about 50 variables. I need the second loop to have access only to the newest values of the data. It needs to be able to read those variables any time they are needed to be visualized.
What is the best way to share such vector between two loops?
There are various ways of sharing data.
The fastest and simplest is a local variable, however that is rather uncontrolled, and you need to make sure to write them at one place (plus you need an indicator).
One of the most advanced options is creating a class for your data, and use an instance (if you create a by-ref class, otherwise it won't matter), and create a public 'GET' method.
In between you have sevaral other options:
queues
semaphores
property nodes
global variables
shared variables
notifiers
events
TCP-IP
In short there is no best way, it all depends on your skills and application.
As long as you're considering loops within the SAME application, there ARE good and bad ideas, though:
queues (OK, has most features)
notifiers (OK)
events (OK)
FGVs (OK, but keep an eye on massively parallel access hindering exec)
semaphores (that's not data comms)
property nodes (very inefficient, prone to race cond.)
global variables (prone to race cond.)
shared variables (badly implemented by NI, prone to race cond.)
TCP-IP (slow, awkward, affected by firewall config)
The quick and dirty way to do this is to write each value to an indicator in the producer loop - these indicators can be hidden offscreen, or in a page of a tab control, if you don't want to see them - and read a local variable of each one in the consumer loop. However if you have 50 different values it may become hard to maintain this code if you need to change or extend it.
As Ton says there are many different options but my suggestion would be:
Create a cluster control, with named elements, containing all your data
Save this cluster as a typedef
Create a notifier using this cluster as the data type
Bundle the data into the cluster (by name) and write this to the notifier in the producer loop
Read the cluster from the notifier in the consumer loop, unbundle it by name and do what you want with each element.
Using a cluster means you can easily pass it to different subVIs to process different elements if you like, and saving as a typedef means you can add, rename or alter the elements and your code will update to match. In your consumer loop you can use the timeout setting of the notifier read to control the loop timing, if you want. You can also use the notifier to tell the loops when to exit, by force-destroying it and trapping the error.
Two ways:
Use a display loop with SEQ (Single Element Queue)
Use a event structure with User Event. (Do not put two event structures in same loop!! Use another)
Use an enum with case structure and variant to cast the data to expected type.
(Notifier isn't reliable to stream data, because is a lossy scheme. Leave this only to trigger small actions)
If all of your variables can be bundled together in a single cluster to send at once, then you should use a single element queue. If your requirements change later such that the transmission cannot be lossy, then it's a matter of changing the input to the Obtain Queue VI (with a notifier you'd have to swap out all of the VIs). Setting up individual indicators and local variables would be pretty darn tedious. Also, not good style.
If the loops are inside of the same VI then:
The simplest solution would be local variables.
Little bit better to use shared variables.
Better is to use functional global variables (FGVs)
The best solution would be using SEQ (Single Element Queue).
Anyway for better understanding please go trough this paper.

Only one write() call sends data over socket connection

First stackoverflow question! I've searched...I promise. I haven't found any answers to my predicament. I have...a severely aggravating problem to say the least. To make a very long story short, I am developing the infrastructure for a game where mobile applications (an Android app and an iOS app) communicate with a server using sockets to send data to a database. The back end server script (which I call BES, or Back End Server), is several thousand lines of code long. Essentially, it has a main method that accepts incoming connections to a socket and forks them off, and a method that reads the input from the socket and determines what to do with it. Most of the code lies in the methods that send and receive data from the database and sends it back to the mobile apps. All of them work fine, except for the newest method I have added. This method grabs a large amount of data from the database, encodes it as a JSON object, and sends it back to the mobile app, which also decodes it from the JSON object and does what it needs to do. My problem is that this data is very large, and most of the time does not make it across the socket in one data write. Thus, I added one additional data write into the socket that informs the app of the size of the JSON object it is about to receive. However, after this write happens, the next write sends empty data to the mobile app.
The odd thing is, when I remove this first write that sends the size of the JSON object, the actual sending of the JSON object works fine. It's just very unreliable and I have to hope that it sends it all in one read. To add more oddity to the situation, when I make the size of the data that the second write sends a huge number, the iOS app will read it properly, but it will have the data in the middle of an otherwise empty array.
What in the world is going on? Any insight is greatly appreciated! Below is just a basic snippet of my two write commands on the server side.
Keep in mind that EVERYWHERE else in this script the read's and write's work fine, but this is the only place where I do 2 write operations back to back.
The server script is on a Ubuntu server in native C using Berkeley sockets, and the iOS is using a wrapper class called AsyncSocket.
int n;
//outputMessage contains a string that tells the mobile app how long the next message
//(returnData) will be
n = write(sock, outputMessage, sizeof(outputMessage));
if(n < 0)
//error handling is here
//returnData is a JSON encoded string (well, char[] to be exact, this is native-C)
n = write(sock, returnData, sizeof(returnData));
if(n < 0)
//error handling is here
The mobile app makes two read calls, and gets outputMessage just fine, but returnData is always just a bunch of empty data, unless I overwrite sizeof(returnData) to some hugely large number, in which case, the iOS will receive the data in the middle of an otherwise empty data object (NSData object, to be exact). It may also be important to note that the method I use on the iOS side in my AsyncSocket class reads data up to the length that it receives from the first write call. So if I tell it to read, say 10000 bytes, it will create an NSData object of that size and use it as the buffer when reading from the socket.
Any help is greatly, GREATLY appreciated. Thanks in advance everyone!
It's just very unreliable and I have to hope that it sends it all in one read.
The key to successful programming with TCP is that there is no concept of a TCP "packet" or "block" of data at the application level. The application only sees a stream of bytes, with no boundaries. When you call write() on the sending end with some data, the TCP layer may choose to slice and dice your data in any way it sees fit, including coalescing multiple blocks together.
You might write 10 bytes two times and read 5 then 15 bytes. Or maybe your receiver will see 20 bytes all at once. What you cannot do is just "hope" that some chunks of bytes you send will arrive at the other end in the same chunks.
What might be happening in your specific situation is that the two back-to-back writes are being coalesced into one, and your reading logic simply can't handle that.
Thanks for all of the feedback! I incorporated everyone's answers into the solution. I created a method that writes to the socket an iovec struct using writev instead of write. The wrapper class I'm using on the iOS side, AsyncSocket (which is fantastic, by the way...check it out here -->AsyncSocket Google Code Repo ) handles receiving an iovec just fine, and behind the scenes apparently, as it does not require any additional effort on my part for it to read all of the data correctly. The AsyncSocket class does not call my delegate method didReadData now until it receives all of the data specified in the iovec struct.
Again, thank you all! This helped greatly. Literally overnight I got responses for an issue I've been up against for a week now. I look forward to becoming more involved in the stackoverflow community!
Sample code for solution:
//returnData is the JSON encoded string I am returning
//sock is my predefined socket descriptor
struct iovec iov[1];
int iovcnt = 0;
iov[0].iov_base = returnData;
iov[0].iov_len = strlen(returnData);
iovcnt = sizeof(iov) / sizeof(struct iovec);
n = writev(sock, iov, iovcnt)
if(n < 0)
//error handling here
while(n < iovcnt)
//rebuild iovec struct with remaining data from returnData (from position n to the end of the string)
You should really define a function write_complete that completely writes a buffer to a socket. Check the return value of write, it might also be a positive number, but smaller than the size of the buffer. In that case you need to write the remaining part of the buffer again.
Oh, and using sizeof is error-prone, too. In the above write_complete function you should therefore print the given size and compare it to what you expect.
Ideally on the server you want to write the header (the size) and the data atomically, I'd do that using the scatter/gather calls writev() also if there is any chance multiple threads can write to the same socket concurrently you may want to use a mutex in the write call.
writev() will also write all the data before returning (if you are using blocking I/O).
On the client you may have to have a state machine that reads the length of the buffer then sits in a loop reading until all the data has been received, as large buffers will be fragmented and come in in various sized blocks.

Resources