Issue managing state of an operator for messages with the same timestamp using Flink - apache-flink

I seem to be encountering an issue while managing state during parallel processing of messages with the same key and the same timestamp.
For (simplified) example let's suppose a simple DoFn<Key, Long> with a ValueState<Long>.
#ProcessElement
public void processElement(
final ProcessContext c,
#StateId(STATE_ID)
final ValueState<Long> someState) {
final long val = c.element().value();
final long currentSum = Optional.ofNullable(someState.read()).orElse(0L);
final long newSum = currentSum + val;
someState.write(newSum);
}
This state is outputted intermittently on some timer.
My question is in the case where there are 2 elements A and B which both have the same Key and timestamp. The value in state is 5, the value of A is 3, and the value of B is 4. One would expect the value in someState after processing both elements to be 12. My question is this guaranteed? That is, will some ordering be applied elements to A and B - or does there possibly exist a race condition whereby the value in state can intermittently either be 8 or 9 depending on whether A or B are processed first. (Note: This is a simplified version of what I am dealing with but I believe this is undeterministic behavior is happening our pipleine). If this assumption is correct what approaches could I take in order to resolve this issue?
Thanks everyone in advance!

There is no race condition caused by identical timestamps. You'll get called twice, once for each element.

Related

Skip a part of the stream when it's stalled

I have a situation where messages are passed through a component that could possibly delay them.
Under stress I'd like to skip this component, so that no more than X messages could be delayed simultaneously. Overflowing messages will skip this stage and move to the next stage of the steam.
Messages are stalled within this stage till their future is done, or up to one minute, whichever comes first.
I can probably implement a custom GraphStage similarly to this buffer example, or use divertTo with some counter to cause messages to skip the stalled component,
but it feels like there might be an easier approach in akka streams
I've been playing around your use case and came up with a solution based on an Akka Actor representing a counter and an asynchronous map stage:
The idea being that up to 3 elements can be processed at any given time and that based on a counter whose max capacity is 2 we'd only allow at most 2 of these elements to simultaneously be processed by the slow component.
This way, there's always one thread of processing reserved to upstream elements that would be branched out from the slow component and directly reach downstream.
Let's first define a basic Counter with a maximum capacity as an Akka Actor:
class Counter(max: Int) extends Actor {
private var count: Int = 0
override def receive: Receive = {
case TryAndLock if count < max =>
count += 1
sender ! true
case TryAndLock =>
sender ! false
case Release =>
count -= 1
}
}
sealed trait CounterAction
case object TryAndLock extends CounterAction
case object Release extends CounterAction
val counter = system.actorOf(Props(new Counter(max = 2)))
It holds a mutable count variable that can be incremented via a TryAndLock request, but only if the count hasn't yet reached the maximum capacity, and can be decremented via a Release request.
We're using an Actor so that concurrent lock and release operations from within the following mapAsync stage are correctly handled without race conditions.
Then, it's just a matter of using a mapAsyncUnordered stage with a parallelism just 1 unit above the counter's max capacity.
Any element passing through the asynchronous stage will query the Counter to try and lock a resource. If a resource's been locked, then the element will be thrown into the slow component. If not, it will skip it. Elements are passed into the slow component until we reach the counter's max capacity, at which point any new element will be skipped until an element exits the slow component and releases a resource from the counter.
We can't simply use a mapAsync because elements would keep their upstream's order when existing the stage, making skipped elements wait for elements processed by the slow component to finish before being produced downstream. Thus the necessity to use the mapAsyncUnordered instead.
Let's define an example with at most 2 elements processed at the same time by the slow component and an asynchronous map whose parallelism is 3:
Source(0 to 15)
.throttle(1, 50.milliseconds)
.mapAsyncUnordered(parallelism = 3) { i =>
(counter ? TryAndLock).map {
case locked: Boolean if locked =>
val result = slowTask(i)
counter ! Release
result
case _ =>
skipTask(i)
}
}
.runForeach(println)
with for instance these two functions that'll simulate the slow component (slowTask) and what to do when skipping the slow component (skipTask):
def slowTask(value: Int): String = {
val start = Instant.now()
Thread.sleep(250)
s"$value - processed - $start - ${Instant.now()}"
}
def skipTask(value: Int): String =
s"$value - skipped - ${Instant.now()}"
which results in something like:
2 - skipped - 2020-06-03T19:07:19.410Z
3 - skipped - 2020-06-03T19:07:19.468Z
4 - skipped - 2020-06-03T19:07:19.518Z
5 - skipped - 2020-06-03T19:07:19.569Z
1 - processed - 2020-06-03T19:07:19.356Z - 2020-06-03T19:07:19.611Z
0 - processed - 2020-06-03T19:07:19.356Z - 2020-06-03T19:07:19.611Z
8 - skipped - 2020-06-03T19:07:19.719Z
9 - skipped - 2020-06-03T19:07:19.769Z
10 - skipped - 2020-06-03T19:07:19.819Z
6 - processed - 2020-06-03T19:07:19.618Z - 2020-06-03T19:07:19.869Z
12 - skipped - 2020-06-03T19:07:19.919Z
7 - processed - 2020-06-03T19:07:19.669Z - 2020-06-03T19:07:19.921Z
14 - skipped - 2020-06-03T19:07:20.019Z
15 - skipped - 2020-06-03T19:07:20.070Z
11 - processed - 2020-06-03T19:07:19.869Z - 2020-06-03T19:07:20.122Z
13 - processed - 2020-06-03T19:07:19.968Z - 2020-06-03T19:07:20.219Z
where the first part is the index of the upstream element, the second part is the transformation the element's been applied on (either processed when entering the slow component or skipped) and the last part is a timestamp so that we visualise when things are happening.
The 2 first elements entering the stage (0 and 1) are processed by the slow component and a bunch of following elements (2, 3, 4 and 5) are skipping the slow stage until these 2 first elements complete and additional elements can enter the slow stage. And so on.
Gilfoyle

Apache Flink - Assign unique id to input

I am loading a CSV file and transforming every line to a POJO using a custom map function. For my program logic i need for every POJO a unique id from 0 to n (where n the total line numbers). My question is, can i assign a unique id (for example the initial line number) to every POJO using a transformation function? The ideal way would be to get an Iterable in a UDF and increment a variable while iterating through the input tuples, and finally outputting the corresponding POJO. My code currently looks like this:
DataSet<MyType> input = env.readCsvFile("/path/file.csv")
.includeFields("1111")
.types(String.class, Double.class, Double.class,Double.class)
.map(new ParseData());
where ParseData transforms Tuples to the MyType POJOs.
Are there any best practices for achieving this task?
The tricky part is, that you run the code in a distributed system, thus the parallel instances of your ParseData function are running independently from each other.
You can still assign unique IDs by using a local ID-counter in ParseData. The trick to avoid duplicates is the correct initialization and counter incrementation. Assume you have a parallelism of four, you would get four ParseData instances (let's call them PD1 ... PD4). You would do the following ID assignments:
PD1: 0, 4, 8, 12, ...
PD2: 1, 5, 9, 13, ...
PD3, 2, 6, 10, 14, ...
PD4: 3, 7, 11, 15, ...
You can accomplish this, by initializing the parallel instances with different values (details below) and increment the count in each instance by your parallelism (ie, ID += parallelism).
In Flink, all instanced of a parallel function get a unique number assigned (so-called task index) automatically. You can just use this number to initialize your ID counter. You can get the task index via RuntimeContext.getIndexOfThisSubtask(). You can also receive the operator/function parallelism via RuntimeContext.getNumberOfParallelSubtasks()
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/functions/RuntimeContext.html
To get the RuntimeContext use a RichMapFunction to implement ParseData and call getRuntimeContext() in open().
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/functions/RichFunction.html
Something like this (only show relevant methods):
class ParseDate extends RichMapFunction {
private long parallelism;
private long idCounter;
public void open(Configuration parameters) {
RuntimeContext ctx = getRuntimeContext();
parallelism = ctx.getNumberOfParallelSubtasks();
idCounter = ctx.getIndexOfThisSubtask();
}
public OutputDataType map(InputDataType value) {
OutputDataType output = new OutputDataType();
output.setID(idCounter);
idCounter += parallelism;
// further processing
return output;
}
}

No. of paths in integer array

There is an integer array, for eg.
{3,1,2,7,5,6}
One can move forward through the array either each element at a time or can jump a few elements based on the value at that index. For e.g., one can go from 3 to 1 or 3 to 7, then one can go from 1 to 2 or 1 to 2(no jumping possible here), then one can go 2 to 7 or 2 to 5, then one can go 7 to 5 only coz index of 7 is 3 and adding 7 to 3 = 10 and there is no tenth element.
I have to only count the number of possible paths to reach the end of the array from start index.
I could only do it recursively and naively which runs in exponential time.
Somebody plz help.
My recommendation: use dynamic programming.
If this key word is sufficient and you want the challenge to find a possible solution on your own, dont read any further!
Here a possible DP-algorithm on the example input {3,1,2,7,5,6}. It will be your job to adjust on the general problem.
create array sol length 6 with just zeros in it. the array will hold the number of ways.
sol[5] = 1;
for (i = 4; i>=0;i--) {
sol[i] = sol[i+1];
if (i+input[i] < 6 && input[i] != 1)
sol[i] += sol[i+input[i]];
}
return sol[0];
runtime O(n)
As for the directed graph solution hinted in the comments :
Each cell in the array represents a node. Make an directed edge from each node to the node accessable. Basically you can then count more easily the number of ways by just looking at the outdegrees on the nodes (since there is no directed cycle) however it is a lot of boiler plate to actual program it.
Adjusting the recursive solution
another solution would be to pruning. This is basically equivalent to the DP-algorithm. The exponentiel time comes from the fact, that you calculate values several times. Eg function is recfunc(index). The initial call recFunc(0) calls recFunc(1) and recFunc(3) and so on. However recFunc(3) is bound to be called somewhen again, which leads to a repeated recursive calculation. To prune this you add a Map to hold all already calculated values. If you make a call recFunc(x) you lookup in the map if x was already calculated. If yes, return the stored value. If not, calculate, store and return it. This way you get a O(n) too.

How do I prevent a Datalog rule from pruning nulls?

I have the following facts and rules:
% frequents(D,P) % D=drinker, P=pub
% serves(P,B) % B=beer
% likes(D,B)
frequents(janus, godthaab).
frequents(janus, goldenekrone).
frequents(yanai, goldenekrone).
frequents(dimi, schlosskeller).
serves(godthaab, tuborg).
serves(godthaab, carlsberg).
serves(goldenekrone, pfungstaedter).
serves(schlosskeller, fix).
likes(janus, tuborg).
likes(janus, carlsberg).
count_good_beers_for_at(D,P,F) :- group_by((frequents(D,P), serves(P,B), likes(D,B)),[D,P],(F = count)).
possible_beers_served_for_at(D,P,B) :- lj(serves(P,B), frequents(D,R), P=R).
Now I would like to construct a rule that should work like a predicate returning "true" when the number of available "liked" beers at each pub that a "drinker" "frequents" is bigger than 0.
I would consider the predicate true when the rule returns no tuples. If the predicate is false, I was planning to make it return the bars not having a single "liked" beer.
As you can see, I already have a rule counting the good beers for a given drinker at a given pub. I also have a rule giving me the number of servable beers.
DES> count_good_beers_for_at(A,B,C)
{
count_good_beers_for_at(janus,godthaab,2)
}
Info: 1 tuple computed.
As you can see, the counter doesn't return the pubs frequented but having 0 liked beers. I was planning to work around this by using a left outer join.
DES> is_happy_at(D,P,Z) :- lj(serves(P,B), count_good_beers_for_at(D,Y,Z), (Y=P))
Info: Processing:
is_happy_at(D,P,Z) :-
lj(serves(P,B),count_good_beers_for_at(D,Y,Z),Y = P).
{
is_happy_at(janus,godthaab,2),
is_happy_at(null,goldenekrone,null),
is_happy_at(null,schlosskeller,null)
}
Info: 3 tuples computed.
This is almost right, except it is also giving me the pubs not frequented. I try adding an extra condition:
DES> is_happy_at(D,P,Z) :- lj(serves(P,B), count_good_beers_for_at(D,Y,Z), (Y=P)), frequents(D,P)
Info: Processing:
is_happy_at(D,P,Z) :-
lj(serves(P,B),count_good_beers_for_at(D,Y,Z),Y = P),
frequents(D,P).
{
is_happy_at(janus,godthaab,2)
}
Info: 1 tuple computed.
Now I somehow filtered everything containing nulls away! I suspect this is due to null-value logic in DES.
I recognize that I might be approaching this whole problem in a wrong way. Any help is appreciated.
EDIT: Assignment is "very_happy(D) ist wahr, genau dann wenn jede Bar, die Trinker D besucht, wenigstens ein Bier ausschenkt, das er mag." which translates to "very_happy(D) is true, iff each bar drinker D visits, serves at least 1 beer, that he likes". Since this assignment is about Datalog, I would think it is definitely possible to solve without using Prolog.
I think that for your assignement you should use basic Datalog, without abusing of aggregates. The point of the question is how to express universally quantified conditions. I googled for 'universal quantification datalog', and at first position I found deductnotes.pdf that asserts:
An universally quantified condition can only be expressed by an equivalent condition with existential quantification and negation.
In that PDF you will find also an useful example (pagg 9 & 10).
Thus we must rephrase our question. I ended up with this code:
not_happy(D) :-
frequents(D, P),
likes(D, B),
not(serves(P, B)).
very_happy(D) :-
likes(D, _),
not(not_happy(D)).
that seems what's required:
DES> very_happy(D)
{
}
Info: 0 tuple computed.
Note the likes(D, _), that's required to avoid that yanai and dimi get listed as very_happy, without explicit assertion of what them like (OT sorry my English really sucks...)
EDIT: I'm sorry, but the above solution doesn't work. I've rewritten it this way:
likes_pub(D, P) :-
likes(D, B),
serves(P, B).
unhappy(D) :-
frequents(D, P),
not(likes_pub(D, P)).
very_happy(D) :-
likes(D, _),
not(unhappy(D)).
test:
DES> unhappy(D)
{
unhappy(dimi),
unhappy(janus),
unhappy(yanai)
}
Info: 3 tuples computed.
DES> very_happy(D)
{
}
Info: 0 tuples computed.
Now we add a fact:
serves(goldenekrone, tuborg).
and we can see the corrected code outcome:
DES> unhappy(D)
{
unhappy(dimi),
unhappy(yanai)
}
Info: 2 tuples computed.
DES> very_happy(D)
{
very_happy(janus)
}
Info: 1 tuple computed.
Maybe not the answer your are expecting. But you can use ordinary Prolog and easily do group by queries with the bagof/3 or setof/3 builtin predicates.
?- bagof(B,(frequents(D,P), serves(P,B), likes(D,B)),L), length(L,N).
D = janus,
P = godthaab,
L = [tuborg,carlsberg],
N = 2
The semantics of bagof/3 is such that it does not compute an outer join for the given query. The query is normally executed by Prolog. The results are first accumulated and key sorted. Finally the results are then returned by backtracking. If your datalog cannot do without nulls, then yes you have to filter.
But you don't need to go into aggregates when you only want to know the existence of a liked beer. You can do it directly via a query without any aggregates:
is_happy_at(D,P) :- frequents(D,P), once((serves(P,B), likes(D,B))).
?- is_happy_at(D,P).
D = janus,
P = godthaab ;
Nein
The once/1 prevents from unnecessary backtrack. Datalog might either automatically not do unnecessary backtracking when it sees the projection in is_happy_at/2, i.e. B is projected away. Or you might need to explicitly use what corresponds to SQL DISTINCT. Or eventually your datalog provides you something that corresponds to SQL EXISTS which most closely corresponds to once/1.
Bye

Finding whether a value is equal to the value of any array element in MATLAB

Can anyone tell me if there is a way (in MATLAB) to check whether a certain value is equal to any of the values stored within another array?
The way I intend to use it is to check whether an element index in one matrix is equal to the values stored in another array (where the stored values are the indices of the elements which meet a certain criteria).
So, if the indices of the elements which meet the criteria are stored in the matrix below:
criteriacheck = [3 5 6 8 20];
Going through the main array (called array) and checking if the index matches:
for i = 1:numel(array)
if i == 'Any value stored in criteriacheck'
%# "Do this"
end
end
Does anyone have an idea of how I might go about this?
The excellent answer previously given by #woodchips applies here as well:
Many ways to do this. ismember is the first that comes to mind, since it is a set membership action you wish to take. Thus
X = primes(20);
ismember([15 17],X)
ans =
0 1
Since 15 is not prime, but 17 is, ismember has done its job well here.
Of course, find (or any) will also work. But these are not vectorized in the sense that ismember was. We can test to see if 15 is in the set represented by X, but to test both of those numbers will take a loop, or successive tests.
~isempty(find(X == 15))
~isempty(find(X == 17))
or,
any(X == 15)
any(X == 17)
Finally, I would point out that tests for exact values are dangerous if the numbers may be true floats. Tests against integer values as I have shown are easy. But tests against floating point numbers should usually employ a tolerance.
tol = 10*eps;
any(abs(X - 3.1415926535897932384) <= tol)
you could use the find command
if (~isempty(find(criteriacheck == i)))
% do something
end
Note: Although this answer doesn't address the question in the title, it does address a more fundamental issue with how you are designing your for loop (the solution of which negates having to do what you are asking in the title). ;)
Based on the for loop you've written, your array criteriacheck appears to be a set of indices into array, and for each of these indexed elements you want to do some computation. If this is so, here's an alternative way for you to design your for loop:
for i = criteriacheck
%# Do something with array(i)
end
This will loop over all the values in criteriacheck, setting i to each subsequent value (i.e. 3, 5, 6, 8, and 20 in your example). This is more compact and efficient than looping over each element of array and checking if the index is in criteriacheck.
NOTE: As Jonas points out, you want to make sure criteriacheck is a row vector for the for loop to function properly. You can form any matrix into a row vector by following it with the (:)' syntax, which reshapes it into a column vector and then transposes it into a row vector:
for i = criteriacheck(:)'
...
The original question "Can anyone tell me if there is a way (in MATLAB) to check whether a certain value is equal to any of the values stored within another array?" can be solved without any loop.
Just use the setdiff function.
I think the INTERSECT function is what you are looking for.
C = intersect(A,B) returns the values common to both A and B. The
values of C are in sorted order.
http://www.mathworks.de/de/help/matlab/ref/intersect.html
The question if i == 'Any value stored in criteriacheck can also be answered this way if you consider i a trivial matrix. However, you are proably better off with any(i==criteriacheck)

Resources