Starvation of one of 2 streams in ConnectedStreams - apache-flink

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?

It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Related

Does Flink's windowing operation process elements at the end of window or does it do a rolling processing?

I am having some trouble understanding the way windowing is implemented internally in Flink and could not find any article which explain this in depth. In my mind, there are two ways this can be done. Consider a simple window wordcount code as below
env.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(500, TimeUnit.SECONDS)).sum(1)
Method 1: Store all events for 500 seconds and at the end of the window, process all of them by applying the sum operation on the stored events.
Method 2: We use a counter to store a rolling sum for every window. As each event in a window comes, we do not store the individual events but keep adding 1 to previously stored counter and output the result at the end of the window.
Could someone kindly help to understand which of the above methods (or maybe a different approach) is used by Flink in reality. The reason is, there are pros and cons to both approach and is important to understand in order configure the resources for the cluster correctly.
eg: The Method 1 seems very close to batch processing and might potentially have issues related to spike in processing at every 500 sec interval while sitting idle otherwise etc while Method2 would need to maintain a common counter between all task managers.
sum is a reducing function as mentioned here(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#reducefunction). Internally, Flink will apply reduce function to each input element, and simply save the reduced result in ReduceState.
For other windows functions, like windows.apply(WindowFunction). There is no aggregation so all input elements will be saved in the ListState.
This document(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#window-functions) about windows stream mentions about how the internal elements are handled in Flink.

How can you have a Flink-process evaluate multiple independent conditions for the same time window

So, we have events a, b, c, x,y, z in kafka source within 10 minute window and I have flink code with 10 minute window (keyed, tumplingwindow).
I have conditions to fire for (in the 10 miniute window)
a && b && c
a && z
b && x && y
a && e
a && f
It is clear that only the first three conditions will evaluate to true in the 10 min window.
I want each of the three to issue a trigger to a kafka sink for those events (so three(3) triggers for this example-sent to kafka sink).
I checked the examples and I have a full test harness with kafka producer, consumer and Flink processor. But it is not clear from any of the examples how to evaluate multiple independent conditions on data within the same time window.
This is just a conceptual example. In reality, there will be 200,000 - 400,000 conditions to evaluate against the incoming stream (millions of events). Someone mentioned that normally, you'd use something like drools (or a rules engine), but in this specific case the RETE won't scale at all, and recommended I check out one of the complex-event-stream processors--that are more geared to large data.
Sorry, this is already long question, but I wanted to give a good context so flink-experts would better judge. Thanks.

How to handle db concurrency?

Let's say I've got 3 tables (LiveDataTable, ReducedDataTable, ScheduleTable).
Basically I've got a stream of events -> whenever I receive an event I write extracted data of this event to LiveDataTable.
The problem is that there's a huge number of events that's why LiveDataTable may become really huge, so I've got another ReducedDataTable where I combine rows from LiveDataTable (think of selecting 100 rows from LiveDataTable, reducing it to 1 row and write it to ReducedDataTable and then deleting these 100 rows from LiveDataTable).
In order to determine the right time of performing these reducing operations there's ScheduleTable. You may think that 1 row ScheduleTable corresponds to 1 reducing operation.
I want to be able to support List<Data> getData() method from Interface. There're 2 cases: I should either read from ReducedDataTable only or merge the results from ReducedDataTable and LiveDataTable.
Here's how my caching works step-by-step:
Read 1 row from ScheduleTable
Read from LiveDataTable
Write to ReducedDataTable (at least 4 rows)
Remove (<= INT_MAX) rows from LiveDataTable
Remove 1 row from ScheduleTable
The problem is I want to determine whether I should read from LiveDataTable and ReducedDataTable programmatically when receiving getData() request. For every step (before #3) I want to read from LiveDataTable and then I'd like to read from ReducedDataTable. How do I determine what step I'm currently at when receiving getData() request?
The reason I asked this questions I believe this's a common problem in DB when handling concurrency.
(Assuming that your compaction process is fast enough)
You can first optimistically read from the small table and if the data is missing - then read from the non-compacted one.
In most cases there will be only one request, not two.
Otherwise you can maintain the timestamp of the data that has been already compacted.

Generated unique id with 6 characters - handling when too much ids already used

​In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.

Efficient data structure and strategy for synchronizing several item collections

I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.
Primary Collection: A, C, D
Slave Collection 1: A, C (add D)
Slave Collection 2: A, B (add C, D; remove B)
The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.
I don't want to push more data than necessary since the collection may become quite large.
What kind of data structures and strategies would be ideal for this?
For that I use differential execution.
(BTW, the word "slave" is uncomfortable for some people, with reason.)
For each remote site, there is a sequential file at the primary site representing what exists on the remote site.
There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist.
Those differences produce deltas, which are transmitted to the remote site.
At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.
The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.
In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta.
Here are a couple such algorithms:
If the collections can be sorted (like your A,B,C example), just run a merge loop:
while(ix<nx && iy<ny){
if (X[ix] < Y[iy]){
// X[ix] was inserted in X
ix++;
} else if (Y[iy] < X[ix]){
// Y[iy] was deleted from X
iy++;
} else {
// the two elements are equal. skip them both;
ix++; iy++;
}
}
while(ix<nx){
// X[ix] was inserted in X
ix++;
}
while(iy<ny>){
// Y[iy] was deleted from X
iy++;
}
If the collections cannot be sorted (note relationship to Levenshtein distance),
Until we have read through both collections X and Y,
See if the current items are equal
else see if a single item was inserted in X
else see if a single item was deleted from X
else see if 2 items were inserted in X
else see if a single item was replaced in X
else see if 2 items were deleted from X
else see if 3 items were inserted in X
else see if 2 items in X replaced 1 items in Y
else see if 1 items in X replaced 2 items in Y
else see if 3 items were deleted from X
etc. etc. up to some limit
Performance is generally not an issue, because the procedure does not have to be run at high frequency.
There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.
If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.

Resources