Why does Flink emit duplicate records on a DataStream join + Global window? - apache-flink

I'm learning/experimenting with Flink, and I'm observing some unexpected behavior with the DataStream join, and would like to understand what is happening...
Let's say I have two streams with 10 records each, which I want to join on a id field. Let's assume that for each record in one stream had a matching one in the other, and the IDs are unique in each stream. Let's also say I have to use a global window (requirement).
Join using DataStream API (my simplified code in Scala):
val stream1 = ... // from a Kafka topic on my local machine (I tried with and without .keyBy)
val stream2 = ...
stream1
.join(stream2)
.where(_.id).equalTo(_.id)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.apply {
(row1, row2) => // ...
}
.print()
Result:
Everything is printed as expected, each record from the first stream joined with a record from the second one.
However:
If I re-send one of the records (say, with an updated field) from one of the stream to that stream, two duplicate join events get emitted 😞
If I repeat that operation (with or without updated field), I will get 3 emitted events, then 4, 5, etc... 😞
Could someone in the Flink community explain why this is happening? I would have expected only 1 event emitted each time. Is it possible to achieve this with a global window?
In comparison, the Flink Table API behaves as expected in that same scenario, but for my project I'm more interested in the DataStream API.
Example with Table API, which worked as expected:
tableEnv
.sqlQuery(
"""
|SELECT *
| FROM stream1
| JOIN stream2
| ON stream1.id = stream2.id
""".stripMargin)
.toRetractStream[Row]
.filter(_._1) // just keep the inserts
.map(...)
.print() // works as expected, after re-sending updated records
Thank you,
Nicolas

The issue is that records are never removed from your global window. So you trigger the join operation on the global window, whenever a new record has arrived, but the old records are still present.
Thus, to get it running in your case, you'd need to implement a custom evictor. I expanded your example in a minimal working example and added the evictor, which I will explain after the snippet.
val data1 = List(
(1L, "myId-1"),
(2L, "myId-2"),
(5L, "myId-1"),
(9L, "myId-1"))
val data2 = List(
(3L, "myId-1", "myValue-A"))
val stream1 = env.fromCollection(data1)
val stream2 = env.fromCollection(data2)
stream1.join(stream2)
.where(_._2).equalTo(_._2)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.evictor(new Evictor[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)], GlobalWindow](){
override def evictBefore(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {}
override def evictAfter(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {
import scala.collection.JavaConverters._
val lastInputTwoIndex = elements.asScala.zipWithIndex.filter(e => e._1.getValue.isTwo).lastOption.map(_._2).getOrElse(-1)
if (lastInputTwoIndex == -1) {
println("Waiting for the lookup value before evicting")
return
}
val iterator = elements.iterator()
for (index <- 0 until size) {
val cur = iterator.next()
if (index != lastInputTwoIndex) {
println(s"evicting ${cur.getValue.getOne}/${cur.getValue.getTwo}")
iterator.remove()
}
}
}
})
.apply((r, l) => (r, l))
.print()
The evictor will be applied after the window function (join in this case) has been applied. It's not entirely clear how your use case exactly should work in case you have multiple entries in the second input, but for now, the evictor only works with single entries.
Whenever a new element comes into the window, the window function is immediately triggered (count = 1). Then the join is evaluated with all elements having the same key. Afterwards, to avoid duplicate outputs, we remove all entries from the first input in the current window. Since, the second input may arrive after the first inputs, no eviction is performed, when the second input is empty. Note that my scala is quite rusty; you will be able to write it in a much nicer way. The output of a run is:
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
4> ((1,myId-1),(3,myId-1,myValue-A))
4> ((5,myId-1),(3,myId-1,myValue-A))
4> ((9,myId-1),(3,myId-1,myValue-A))
evicting (1,myId-1)/null
evicting (5,myId-1)/null
evicting (9,myId-1)/null
A final remark: if the table API offers already a concise way of doing what you want, I'd stick to it and then convert it to a DataStream when needed.

Related

Non-blocking array reduce in NodeJS?

I have a function that takes in two very large arrays. Essentially, I am matching up orders with items that are in a warehouse available to fulfill that order. The order is an object that contains a sub array of objects of order items.
Currently I am using a reduce function to loop through the orders, then another reduce function to loop through the items in each order. Inside this nested reduce, I am doing a filter on items a customer returned so as not to give the customer a replacement with the item they just send back. I am then filtering the large array of available items to match them to the order. The large array of items is mutable since I need to mark an item used and not assign it to another item.
Here's some psudocode of what I am doing.
orders.reduce(accum, currentOrder)
{
currentOrder.items.reduce(internalAccum, currentItem)
{
const prevItems = prevOrders.filter(po => po.customerId === currentOrder.customerId;
const availItems = staticItems.filter(si => si.itemId === currentItem.itemId && !prevItems.includes(currentItem.labelId)
// Logic to assign the item to the order
}
}
All of this is running in a MESOS cluster on my server. The issue I am having is that my MESOS system is doing a health check every 10 seconds. During this working of the code, the server will stop responding for a short period of time (up to 45 seconds or so). The health check will kill the container after 3 failed attempts.
I am needing to find some way to do this complex looping without blocking the response of the health check. I have tried moving everything to a eachSerial using the async library but it still locks up. I have to do the work in order or I would have done something like async.each or async.eachLimit, but if not processed in order, then items might be assigned the same thing simultaneously.
You can do batch processing here with a promisified setImmediate so that incoming events can have a chance to execute between batches. This solution requires async/await support.
async function batchReduce(list, limit, reduceFn, initial) {
let result = initial;
let offset = 0;
while (offset < list.length) {
const batchSize = Math.min(limit, list.length - offset);
for (let i = 0; i < batchSize; i++) {
result = reduceFn(result, list[offset + i]);
}
offset += batchSize;
await new Promise(setImmediate);
}
return result;
}

Flink - behaviour of timesOrMore

I want to find pattern of events that follow
Inner pattern is:
Have the same value for key "sensorArea".
Have different value for key "customerId".
Are within 5 seconds from each other.
And this pattern needs to
Emit "alert" only if previous happens 3 or more times.
I wrote something but I know for sure it is not complete.
Two Questions
I need to access the previous event fields when I'm in the "next" pattern, how can I do that without using the ctx command because it is heavy..
My code brings weird result - this is my input
and my output is
3> {first=[Customer[timestamp=50,customerId=111,toAdd=2,sensorData=33]], second=[Customer[timestamp=100,customerId=222,toAdd=2,sensorData=33], Customer[timestamp=600,customerId=333,toAdd=2,sensorData=33]]}
even though my desired output should be all first six events (users 111/222 and sensor are 33 and then 44 and then 55
Pattern<Customer, ?> sameUserDifferentSensor = Pattern.<Customer>begin("first", skipStrategy)
.followedBy("second").where(new IterativeCondition<Customer>() {
#Override
public boolean filter(Customer currCustomerEvent, Context<Customer> ctx) throws Exception {
List<Customer> firstPatternEvents = Lists.newArrayList(ctx.getEventsForPattern("first"));
int i = firstPatternEvents.size();
int currSensorData = currCustomerEvent.getSensorData();
int prevSensorData = firstPatternEvents.get(i-1).getSensorData();
int currCustomerId = currCustomerEvent.getCustomerId();
int prevCustomerId = firstPatternEvents.get(i-1).getCustomerId();
return currSensorData==prevSensorData && currCustomerId!=prevCustomerId;
}
})
.within(Time.seconds(5))
.timesOrMore(3);
PatternStream<Customer> sameUserDifferentSensorPatternStream = CEP.pattern(customerStream, sameUserDifferentSensor);
DataStream<String> alerts1 = sameUserDifferentSensorPatternStream.select((PatternSelectFunction<Customer, String>) Object::toString);
You will have an easier time if you first key the stream by the sensorArea. They you will be pattern matching on streams where all of the events are for a single sensorArea, which will make the pattern easier to express, and the matching more efficient.
You can't avoid using an iterative condition and the ctx, but it should be less expensive after keying the stream.
Also, your code example doesn't match the text description. The text says "within 5 seconds" and "3 or more times", while the code has within(Time.seconds(2)) and timesOrMore(2).

How to buffer and drop a chunked bytestring with a delimiter?

Lets say you have a publisher using broadcast with some fast and some slow subscribers and would like to be able to drop sets of messages for the slow subscriber without having to keep them in memory. The data consists of chunked ByteStrings, so dropping a single ByteString is not an option.
Each set of ByteStrings is followed by a terminator ByteString("\n"), so I would need to drop a set of ByteStrings ending with that.
Is that something you can do with a custom graph stage? Can it be done without aggregating and keeping the whole set in memory?
Avoid Custom Stages
Whenever possible try to avoid custom stages, they are very tricky to get correct as well as being pretty verbose. Usually some combination of the standard akka-stream stages and plain-old-functions will do the trick.
Group Dropping
Presumably you have some criteria that you will use to decide which group of messages will be dropped:
type ShouldDropTester : () => Boolean
For demonstration purposes I will use a simple switch that drops every other group:
val dropEveryOther : ShouldDropTester =
Iterator.from(1)
.map(_ % 2 == 0)
.next
We will also need a function that will take in a ShouldDropTester and use it to determine whether an individual ByteString should be dropped:
val endOfFile = ByteString("\n")
val dropGroupPredicate : ShouldDropTester => ByteString => Boolean =
(shouldDropTester) => {
var dropGroup = shouldDropTester()
(byteString) =>
if(byteString equals endOfFile) {
val returnValue = dropGroup
dropGroup = shouldDropTester()
returnValue
}
else {
dropGroup
}
}
Combining the above two functions will drop every other group of ByteStrings. This functionality can then be converted into a Flow:
val filterPredicateFunction : ByteString => Boolean =
dropGroupPredicate(dropEveryOther)
val dropGroups : Flow[ByteString, ByteString, _] =
Flow[ByteString] filter filterPredicateFunction
As required: the group of messages do not need to be buffered, the predicate will work on individual ByteStrings and therefore consumes a constant amount of memory regardless of file size.

How to increment a variable in Gatlling Loop

I am trying to write a Gatling script where I read a starting number from a CSV file and loop through, say 10 times. In each iteration, I want to increment the value of the parameter.
It looks like some Scala or Java math is needed but could not find information on how to do it or how and where to combine Gatling EL with Scala or Java.
Appreciate any help or direction.
var numloop = new java.util.concurrent.atomic.AtomicInteger(0)
val scn = scenario("Scenario Name")
.asLongAs(_=> numloop.getAndIncrement() <3, exitASAP = false){
feed(csv("ids.csv")) //read ${ID} from the file
.exec(http("request")
.get("""http://finance.yahoo.com/q?s=${ID}""")
.headers(headers_1))
.pause(284 milliseconds)
//How to increment ID for the next iteration and pass in the .get method?
}
You copy-pasted this code from Gatling's Google Group but this use case was very specific.
Did you first properly read the documentation regarding loops? What's your use case and how doesn't it fit with basic loops?
Edit: So the question is: how do I get a unique id per loop iteration and per virtual user?
You can compute one for the loop index and a virtual user id. Session already has a unique ID but it's a String UUID, so it's not very convenient for what you want to do.
// first, let's build a Feeder that set an numeric id:
val userIdFeeder = Iterator.from(0).map(i => Map("userId" -> i))
val iterations = 1000
// set this userId to every virtual user
feed(userIdFeeder)
// loop and define the loop index
.repeat(iterations, "index") {
// set an new attribute named "id"
exec{ session =>
val userId = session("userId").as[Int]
val index = session("index").as[Int]
val id = iterations * userId + index
session.set("id", id)
}
// use id attribute, for example with EL ${id}
}
Here is my answer to this:
Problem Statement: I had to repeat the gatling execution for configured set of times, and my step name has to be dynamic.
object UrlVerifier {
val count = new java.util.concurrent.atomic.AtomicInteger(0)
val baseUrl = Params.applicationBaseUrl
val accessUrl = repeat(Params.noOfPagesToBeVisited,"index") {
exec(session=> {
val randomUrls: List[String] = UrlFeeder.getUrlsToBeTested()
session.set("index", count.getAndIncrement).set("pageToTest", randomUrls(session("index").as[Int]))
}
).
exec(http("Accessing Page ${pageToTest}")
.get(baseUrl+"${pageToTest}")
.check(status.is(200))).pause(Params.timeToPauseInSeconds)
}
So basically UrlFeeder give me list of String (urls to be tested) and in the exec, we are using count (AtomicInteger), and using this we are populating a variable named 'index' whose value will start from 0 and will be getAndIncremented in each iteration. This 'index' variable is the one which will be used within repeat() loop as we are specifying the name of counterVariable to be used as 'index'
Hope it helps others as well.

get_by_id() not returning values

I am writing an application that shows the user a number of elements, where he has to select a few of them to process. When he does so, the application queries the DB for the rest of the data on these elements, and stacks them with their full data on the next page.
I made an HTML form loop with a checkbox next to each element, and then in Python I check for this checkbox's value to get the data.
Even when I'm just trying to query the data, ndb doesn't return anything.
pitemkeys are the ids for the elements to be queried. inpochecks is the checkbox variable.
preqitems is the dict to save the items after getting the data.
The next page queries nothing and is blank.
The comments are my original intended code, which produced lots of errors because of not querying anything.
request_code = self.request.get_all('rcode')
pitemkeys = self.request.get_all('pitemkey')
inpochecks = self.request.get_all('inpo')
preqitems = {}
#idx = 0
#for ix, pitemkey in enumerate(pitemkeys):
# if inpochecks[ix] == 'on':
# preqitems[idx] = Preqitems.get_by_id(pitemkey)
# preqitems[idx].rcode = request_code[ix]
# idx += 1
for ix, pitemkey in enumerate(pitemkeys):
preqitems[ix] = Preqitems.get_by_id(pitemkey)
#preqitems[ix].rcode = request_code[ix]
Update: When trying
preqitems = ndb.get_multi([ndb.Key(Preqitems, k) for k in pitemkeys])
preqitems returns a list full of None values, as if the db couldn't find data for these keys.. I checked the keys and for some reason they are in unicode format, could that be the reason? They look like so.
[u'T-SQ-00301-0002-0001', u'U-T-MT-00334-0007-0002', u'U-T-MT-00334-0008-0001']
Probably you need to do: int(pitemkey) or str(pitemkey), depending if you are using integer or string id

Resources