How to manipulate 3 DataStream in on Flink job? - apache-flink

We have 3 java pojos,
class Foo{
int id;
String name;
List<Bar1> list1;
List<Bar2> list2;
}
class Bar1{
int id;
String field_x;
String field_y;
}
class Bar2{
int id;
String field_a;
String field_b;
}
And we have 3 DataStreams in our Flink job,
class Test{
public static void main(...){
DataStream<Foo> ds1 = ...;
DataStream<Bar1> ds2 = ...;
DataStream<Bar2> ds3 = ...;
}
}
For each id, there will be only one Foo object, while Bar1 and Bar2 object could be multiple.
What we want to do is, for each Foo in ds1, find all Bar1 with the same id in ds2 and put them into list1, find all Bar2 with the same id in ds3 and put them into list2.
What is the best way to go?

Flink's DataStream operators support up to two input streams.
There are two common ways to implement operations on three streams:
with two binary operations. This is very simple in your case since Bar1 and Bar2 are not related to each other. This would look roughly as follows:
DataStream<Foo> withList1 = ds1
.connect(ds2).keyBy("id", "id")
.process(
// your processing logic
new CoProcessFunction<Foo, Bar1, Foo>(){...});
DataStream<Foo> withList1AndList2 = withList1
.connect(ds3).keyBy("id", "id")
.process(
// your processing logic
new CoProcessFunction<Foo, Bar2, Foo>(){...});
by unioning all three streams into a single stream with a common data type (for example a POJO with three fields foo, bar1, and bar2 of which only one field is used and using an operator with a single input to process the unioned stream.
// map Foo to CommonType
DataStream<CommonType> common1 = ds1.map(new MapFunction<Foo, CommonType>(){...});
// map Bar1 to CommonType
DataStream<CommonType> common2 = ds2.map(new MapFunction<Bar1, CommonType>(){...});
// map Bar2 to CommonType
DataStream<CommonType> common3 = ds3.map(new MapFunction<Bar2, CommonType>(){...});
DataStream<Foo> withList1AndList2 = ds1.union(ds2, ds3)
.keyBy("id")
.process(
// your processing logic
new KeyedProcessFunction<CommonType, Foo>(){...});
You can also just union ds2 and ds3 and use a binary operator.
The bigger problem might be to identify when all Bar1 and Bar2 events were received such that you can emit a result. Again, there a few options (depending on your use case).
if Foo knows for how many Bar1 and Bar2 it needs to wait, the solution is obvious.
if Foo does not know for how many events to wait, you can try to send a notification that signals that the last Bar1 or Bar2 was sent.
you can also work with a time out if you know that all Bar1 or Bar2 should arrive within x seconds/minutes/etc.

Related

Unexported function aliases in module

Decades of research & software lead to many synonyms for functions that users could expect to "just work" in my module. There are roughly 80 aliases I want to support, yet not export. For example, imagine :bar, :qux, "bar", "qux" to be aliases for Base.sqrt as an input to struct Bar.
Question: What is the recommended way to create unexported aliases for functions in a module?
I have read the module documentation and this SO question question and searched codebases on GitHub. Below are my attemps, Bar2 is faster, but goes against "avoid globals in the namespace".
# approach 1: add to namespace, but don't export
module Foo1
struct Bar1 fun::Function end
Bar1(fun::Union{Symbol, String}) = Bar1(getfield(#__MODULE__, Symbol(fun)))
const baz = const qux = √ # ...
export Bar1
end
# approach 2: package internal dictionary, not in namespace
module Foo2
struct Bar2 fun::Function end
Bar2(fun::Union{Symbol, String}) = Bar2(alias[Symbol(fun)])
const alias = Base.ImmutableDict(:baz => √, :qux => √) #...
export Bar2
end
using .Foo1, .Foo2
Bar1(:baz).fun == Bar1("qux").fun == Bar2(:baz).fun == Bar2("qux").fun # true

Swift Group Sequential Objects in Array

I have the following array of objects:
[Car, Train, Train, Motorbike, Train, Car, Car]
I would like to group any object of the same type that follow one another resulting into the following:
[Car, [Train, Train], Motorbike, Train, [Car, Car]]
or even better:
[Car, GroupOfVehicles(type: .train, vehicles:[Train, Train]), Motorbike, Train, GroupOfVehicles(type: .car, vehicles:[Car, Car])
I have found some SO posts about grouping by property but I need to keep the sequential nature of the array intact.
For those interested in my objective: I would like to present objects in a table one below the other, but if there is a group of similar sequential objects, they should show all on the same 'row' and thus the user will scroll through them horizontally.
Below is a solution based on an array of Any but if the types all conform to a common protocol or has a common superclass then this hopefully can be solved in a more effective way.
var grouped: [[Any]] = []
var previous: String = ""
motors.forEach {
let vehicleType = "\(type(of: $0))"
if vehicleType != previous {
grouped.append([$0])
previous = vehicleType
} else {
grouped[grouped.count - 1].append($0)
}
}
You can use use chunked(by:) function from open source swift package Swift Alogrithms
import Algorithms
protocol X {} // Assuming that you have protocol or superclass
struct A: X {}; struct B: X {}; struct C: X {}
let arr: [X] = [A(), A(), B(), C(), C(), C(), B(), A()]
let chunks = arr.chunked { type(of: $0) == type(of: $1) }
// You should have something like this [ArraySlice([A(), A()]), ArraySlice([B()]), ArraySlice([C(), C(), C()]), ArraySlice([B()]), ArraySlice([A()])]
And you can transform this collection further as you need.

Fetching a struct and other data in one query

I have the following table schema:
user
-----
id uuid
name string
user_model
------
id uuid
user_id uuid
model_id uuid
role int
model
_____
id uuid
name string
model_no string
I have the following code which fetches the data from the "model" table.
underlyingModel = &model{}
var model IModel
model = underlyingModel
role := 0
db.Table("model").Joins('INNER JOIN user_model ON user.id = user_model.uuid')
.Joins('INNER JOIN model ON user.id = model_id').Find(&model);
In my actual code, the model can be many different struct types with different fields, they're all behind the IModel interface.
What I want to do is to fetch that extra role field from the user_model in one query. Something like .Find(&model, &role).
Is it possible using Gorm?
One possible solution is to create an anonymous struct to put the results in, with a combination of the Select() method.
var selectModel struct {
ID string //I'm assuming uuid matches the string
Name string
ModelNo string
Role int
}
db.Table("model").
Joins("INNER JOIN user_model ON user.id = user_model.uuid").
Joins("INNER JOIN model ON user.id = model_id").
Select("model.id, model.name, model.model_no, user_model.role").
Find(&selectModel);
Basically, you create an anonymous struct with selectModel variable, containing all the fields you want to return. Then, you need to do a select statement because you need some fields that are not part of the model table.
Here you can find more info on Smart Select Fields in form.
EDIT:
Based on additional info from the comments, there is a solution that might work.
Your IModel interface could have two methods in its signature, one to extract a string for the SELECT part of the SQL query, and the other one to get a pointer of the selectModel that you would use in the Find method.
type IModel interface {
SelectFields() string
GetSelectModel() interface{}
}
The implementation would go something like this:
func (m *model) SelectFields() string {
return "model.id, model.name, model.model_no, user_model.role"
}
func (m *model) GetSelectModel() interface{} {
return &m.selectModel
}
type model struct {
selectModel
ID uint64
Age int
}
type selectModel struct {
Name string
Email string
}
Then, your query could look something like this:
var m IModel
m = model{}
db.Table("model").
Joins("INNER JOIN user_model ON user.id = user_model.uuid").
Joins("INNER JOIN model ON user.id = model_id").
Select(m.GetSelectFields()).
Find(m.GetSelectModel());

Why does Flink emit duplicate records on a DataStream join + Global window?

I'm learning/experimenting with Flink, and I'm observing some unexpected behavior with the DataStream join, and would like to understand what is happening...
Let's say I have two streams with 10 records each, which I want to join on a id field. Let's assume that for each record in one stream had a matching one in the other, and the IDs are unique in each stream. Let's also say I have to use a global window (requirement).
Join using DataStream API (my simplified code in Scala):
val stream1 = ... // from a Kafka topic on my local machine (I tried with and without .keyBy)
val stream2 = ...
stream1
.join(stream2)
.where(_.id).equalTo(_.id)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.apply {
(row1, row2) => // ...
}
.print()
Result:
Everything is printed as expected, each record from the first stream joined with a record from the second one.
However:
If I re-send one of the records (say, with an updated field) from one of the stream to that stream, two duplicate join events get emitted 😞
If I repeat that operation (with or without updated field), I will get 3 emitted events, then 4, 5, etc... 😞
Could someone in the Flink community explain why this is happening? I would have expected only 1 event emitted each time. Is it possible to achieve this with a global window?
In comparison, the Flink Table API behaves as expected in that same scenario, but for my project I'm more interested in the DataStream API.
Example with Table API, which worked as expected:
tableEnv
.sqlQuery(
"""
|SELECT *
| FROM stream1
| JOIN stream2
| ON stream1.id = stream2.id
""".stripMargin)
.toRetractStream[Row]
.filter(_._1) // just keep the inserts
.map(...)
.print() // works as expected, after re-sending updated records
Thank you,
Nicolas
The issue is that records are never removed from your global window. So you trigger the join operation on the global window, whenever a new record has arrived, but the old records are still present.
Thus, to get it running in your case, you'd need to implement a custom evictor. I expanded your example in a minimal working example and added the evictor, which I will explain after the snippet.
val data1 = List(
(1L, "myId-1"),
(2L, "myId-2"),
(5L, "myId-1"),
(9L, "myId-1"))
val data2 = List(
(3L, "myId-1", "myValue-A"))
val stream1 = env.fromCollection(data1)
val stream2 = env.fromCollection(data2)
stream1.join(stream2)
.where(_._2).equalTo(_._2)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.evictor(new Evictor[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)], GlobalWindow](){
override def evictBefore(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {}
override def evictAfter(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {
import scala.collection.JavaConverters._
val lastInputTwoIndex = elements.asScala.zipWithIndex.filter(e => e._1.getValue.isTwo).lastOption.map(_._2).getOrElse(-1)
if (lastInputTwoIndex == -1) {
println("Waiting for the lookup value before evicting")
return
}
val iterator = elements.iterator()
for (index <- 0 until size) {
val cur = iterator.next()
if (index != lastInputTwoIndex) {
println(s"evicting ${cur.getValue.getOne}/${cur.getValue.getTwo}")
iterator.remove()
}
}
}
})
.apply((r, l) => (r, l))
.print()
The evictor will be applied after the window function (join in this case) has been applied. It's not entirely clear how your use case exactly should work in case you have multiple entries in the second input, but for now, the evictor only works with single entries.
Whenever a new element comes into the window, the window function is immediately triggered (count = 1). Then the join is evaluated with all elements having the same key. Afterwards, to avoid duplicate outputs, we remove all entries from the first input in the current window. Since, the second input may arrive after the first inputs, no eviction is performed, when the second input is empty. Note that my scala is quite rusty; you will be able to write it in a much nicer way. The output of a run is:
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
4> ((1,myId-1),(3,myId-1,myValue-A))
4> ((5,myId-1),(3,myId-1,myValue-A))
4> ((9,myId-1),(3,myId-1,myValue-A))
evicting (1,myId-1)/null
evicting (5,myId-1)/null
evicting (9,myId-1)/null
A final remark: if the table API offers already a concise way of doing what you want, I'd stick to it and then convert it to a DataStream when needed.

Array, Object, Memory. Actionscript

I have a question related to memory. I will give an example to make it clear how everything works now.
I have 2 arrays:
var ArrayNew:Array = new Array();
var ArrayOld:Array = new Array();
Also i have a class to store my objects (3 properties). For example:
public Id {get; set;}
public Name {get; set;}
public Type {get; set;}
The thing is, that i'm filling the ArrayNew with new objects every (for example 12 hours):
ArrayNew.push(x, x, x)
.....
ArrayNew.push(x, x, x)
It may be about ~200 records or even more.
After that i make this:
ArrayOld = ArrayNew;
ArrayNew = null;
So the thing is, how memory works in this situation and what happens with objects? Does ArrayOld = ArrayNew make a copy of objects (cause now it works)? Does ArrayNew=null delete created objects?
I wish you undearstand the situation. :)
The objects stored in arrayOld get garbage collected if there are no other references to them. The ones from arrayNew are not copied - they are referenced by arrayOld after the assignment.
It's to say that that after:
arrayNew[0].name = 'a random silly text';
arrayOld = arrayNew;
arrayOld[0].name = 'another silly string';
trace(arrayNew[0]);
You'd get:
another silly string
Style note: Normally you don't start variable/object names with capitals, it's reserved for classes.
If I understand you correctly you want to know what happened to ArrayOld.
My code:
var arr_1:Array = ["Hello world!"];
var arr_2:Array = ["My name is Stas!"];
arr_2 = arr_1;
arr_1 = null;
trace(arr_2);// Hello world!
If I made a mistake with the understanding of the issue do explain it properly.
ArrayOld = ArrayNew is simply making ArrayOld reference the same thing as ArrayNew at that point. The actual data in memory is not copied.
ArrayNew = null is simply assigning null value to the ArrayNew reference. It doesn't delete the data ArrayNew previously referenced, nor does it affect other references to that data (such as ArrayOld).
At this point, the original data that ArrayNew used to reference has not changed in any way, you've just handed off what variable refers to it.
At this point if you did ArrayOld = null, then the original data in memory no longer has any reference to it and it will eventually be purged by garbage collection, but not right away. It will happen "later" at a time the runtime decides is convenient.

Resources