Skip a part of the stream when it's stalled - akka-stream

I have a situation where messages are passed through a component that could possibly delay them.
Under stress I'd like to skip this component, so that no more than X messages could be delayed simultaneously. Overflowing messages will skip this stage and move to the next stage of the steam.
Messages are stalled within this stage till their future is done, or up to one minute, whichever comes first.
I can probably implement a custom GraphStage similarly to this buffer example, or use divertTo with some counter to cause messages to skip the stalled component,
but it feels like there might be an easier approach in akka streams

I've been playing around your use case and came up with a solution based on an Akka Actor representing a counter and an asynchronous map stage:
The idea being that up to 3 elements can be processed at any given time and that based on a counter whose max capacity is 2 we'd only allow at most 2 of these elements to simultaneously be processed by the slow component.
This way, there's always one thread of processing reserved to upstream elements that would be branched out from the slow component and directly reach downstream.
Let's first define a basic Counter with a maximum capacity as an Akka Actor:
class Counter(max: Int) extends Actor {
private var count: Int = 0
override def receive: Receive = {
case TryAndLock if count < max =>
count += 1
sender ! true
case TryAndLock =>
sender ! false
case Release =>
count -= 1
}
}
sealed trait CounterAction
case object TryAndLock extends CounterAction
case object Release extends CounterAction
val counter = system.actorOf(Props(new Counter(max = 2)))
It holds a mutable count variable that can be incremented via a TryAndLock request, but only if the count hasn't yet reached the maximum capacity, and can be decremented via a Release request.
We're using an Actor so that concurrent lock and release operations from within the following mapAsync stage are correctly handled without race conditions.
Then, it's just a matter of using a mapAsyncUnordered stage with a parallelism just 1 unit above the counter's max capacity.
Any element passing through the asynchronous stage will query the Counter to try and lock a resource. If a resource's been locked, then the element will be thrown into the slow component. If not, it will skip it. Elements are passed into the slow component until we reach the counter's max capacity, at which point any new element will be skipped until an element exits the slow component and releases a resource from the counter.
We can't simply use a mapAsync because elements would keep their upstream's order when existing the stage, making skipped elements wait for elements processed by the slow component to finish before being produced downstream. Thus the necessity to use the mapAsyncUnordered instead.
Let's define an example with at most 2 elements processed at the same time by the slow component and an asynchronous map whose parallelism is 3:
Source(0 to 15)
.throttle(1, 50.milliseconds)
.mapAsyncUnordered(parallelism = 3) { i =>
(counter ? TryAndLock).map {
case locked: Boolean if locked =>
val result = slowTask(i)
counter ! Release
result
case _ =>
skipTask(i)
}
}
.runForeach(println)
with for instance these two functions that'll simulate the slow component (slowTask) and what to do when skipping the slow component (skipTask):
def slowTask(value: Int): String = {
val start = Instant.now()
Thread.sleep(250)
s"$value - processed - $start - ${Instant.now()}"
}
def skipTask(value: Int): String =
s"$value - skipped - ${Instant.now()}"
which results in something like:
2 - skipped - 2020-06-03T19:07:19.410Z
3 - skipped - 2020-06-03T19:07:19.468Z
4 - skipped - 2020-06-03T19:07:19.518Z
5 - skipped - 2020-06-03T19:07:19.569Z
1 - processed - 2020-06-03T19:07:19.356Z - 2020-06-03T19:07:19.611Z
0 - processed - 2020-06-03T19:07:19.356Z - 2020-06-03T19:07:19.611Z
8 - skipped - 2020-06-03T19:07:19.719Z
9 - skipped - 2020-06-03T19:07:19.769Z
10 - skipped - 2020-06-03T19:07:19.819Z
6 - processed - 2020-06-03T19:07:19.618Z - 2020-06-03T19:07:19.869Z
12 - skipped - 2020-06-03T19:07:19.919Z
7 - processed - 2020-06-03T19:07:19.669Z - 2020-06-03T19:07:19.921Z
14 - skipped - 2020-06-03T19:07:20.019Z
15 - skipped - 2020-06-03T19:07:20.070Z
11 - processed - 2020-06-03T19:07:19.869Z - 2020-06-03T19:07:20.122Z
13 - processed - 2020-06-03T19:07:19.968Z - 2020-06-03T19:07:20.219Z
where the first part is the index of the upstream element, the second part is the transformation the element's been applied on (either processed when entering the slow component or skipped) and the last part is a timestamp so that we visualise when things are happening.
The 2 first elements entering the stage (0 and 1) are processed by the slow component and a bunch of following elements (2, 3, 4 and 5) are skipping the slow stage until these 2 first elements complete and additional elements can enter the slow stage. And so on.
Gilfoyle

Related

Issue managing state of an operator for messages with the same timestamp using Flink

I seem to be encountering an issue while managing state during parallel processing of messages with the same key and the same timestamp.
For (simplified) example let's suppose a simple DoFn<Key, Long> with a ValueState<Long>.
#ProcessElement
public void processElement(
final ProcessContext c,
#StateId(STATE_ID)
final ValueState<Long> someState) {
final long val = c.element().value();
final long currentSum = Optional.ofNullable(someState.read()).orElse(0L);
final long newSum = currentSum + val;
someState.write(newSum);
}
This state is outputted intermittently on some timer.
My question is in the case where there are 2 elements A and B which both have the same Key and timestamp. The value in state is 5, the value of A is 3, and the value of B is 4. One would expect the value in someState after processing both elements to be 12. My question is this guaranteed? That is, will some ordering be applied elements to A and B - or does there possibly exist a race condition whereby the value in state can intermittently either be 8 or 9 depending on whether A or B are processed first. (Note: This is a simplified version of what I am dealing with but I believe this is undeterministic behavior is happening our pipleine). If this assumption is correct what approaches could I take in order to resolve this issue?
Thanks everyone in advance!
There is no race condition caused by identical timestamps. You'll get called twice, once for each element.

Delete and sort elements in a object [duplicate]

I have a list of strings and I want to keep only the most unique strings. Here is how I have implemented this (maybe there's an issue with the loop),
def filter_descriptions(descriptions):
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
i = 0
while i < len(descriptions):
print("Processing {}/{}...".format(i + 1, len(descriptions)))
desc_to_evaluate = descriptions[i]
j = i + 1
while j < len(descriptions):
similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
if similarity_ratio > MAX_SIMILAR_ALLOWED:
del descriptions[j]
j += 1
i += 1
return descriptions
Please note that the list might have around 110K items which is why I am shortening the list every iteration.
Can anyone please identify what is wrong with this current implementation?
Edit 1:
The current results are "too similar". The filter_descriptions function returned 16 items (from a list of ~110K items). When I tried the following,
SequenceMatcher(None, descriptions[0], descriptions[1]).ratio()
The ratio was 0.99, and with SequenceMatcher(None, descriptions[1], descriptions[2]).ratio() it was around 0.98. But with SequenceMatcher(None, descriptions[0], descriptions[15]).ratio() it was around 0.65 (which is better)
I hope this helps.
If you invert your logic, you can escape having to modify the list in place and still reduce the number of comparisons needed. That is, start with an empty output/unique list and iterate over your descriptions seeing if you can add each one. So for the first description you can add it immediately as it cannot be similar to anything in an empty list. The second description only needs to be compared to the first as opposed to all other descriptions. Later iterations can short circuit as soon as they find a previous description with which they are similar to (and have the candidate description be discarded). ie.
import operator
def unique(items, compare=operator.eq):
# compare is a function that returns True if its two arguments are deemed similar to
# each other and False otherwise.
unique_items = []
for item in items:
if not any(compare(item, uniq) for uniq in unique_items):
# any will stop as soon as compare(item, uniq) returns True
# you could also use `if all(not compare(item, uniq) ...` if you prefer
unique_items.append(item)
return unique_items
Examples:
assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved
assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.
With your code your comparison function would look like:
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
def description_cmp(candidate_desc, unique_desc):
# use unique_desc as first arg as this keeps the argument order the same as with your filter
# function where the first description is the one that is retained if the two descriptions
# are deemed to be too similar
similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
return similarity_ratio > MAX_SIMILAR_ALLOWED
def filter_descriptions(descriptions):
# This would be the new definition of your filter_descriptions function
return unique(descriptions, compare=descriptions_cmp)
The number of comparisons should be exactly the same. That is, in your implementation the first element is compared to all the others, and the second element is only compared to elements that were deemed not similar to the first element and so on. In this implementation the first item is not compared to anything initially, but all other items must be compared to it to be allowed to be added to the unique list. Only items deemed not similar to the first item will be compared to the second unique item, and so on.
The unique implementation will do less copying as it only has to copy the unique list when the backing array runs out of space. Whereas, with the del statement parts of the list must be copied each time it is used (to move all subsequent items into their new correct position). This will likely have a negligible impact on performance though, as the bottleneck is probably the ratio calculation in the sequence matcher.
The Problem with your logic is that each time when you delete an item from the array, the index gets re-arranged and skips a string in between. Eg:
Assume that this is the array:
Description : ["A","A","A","B","C"]
iterartion 1:
i=0 -------------0
description[i]="A"
j=i+1 -------------1
description[j]="A"
similarity_ratio>0.6
del description[j]
Now the array is re-indexed like:
Description:["A","A","B","C"]. The next step is:
j=j+1 ------------1+1= 2
Description[2]="B"
You have skipped Description[1]="A"
To fix this :
Replace
j+=1
With
j=i+1
if deleted. Else do the normal j=j+1 iteration
The value of j should not change when an item from the list is deleted (since a different list item will be present on that spot in the next iteration). Doing j=i+1 restarts the iteration every time an item is deleted (which is not what is desired). The updated code now only increments j in the else condition.
def filter_descriptions(descriptions):
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
i = 0
while i < len(descriptions):
print("Processing {}/{}...".format(i + 1, len(descriptions)))
desc_to_evaluate = descriptions[i]
j = i + 1
while j < len(descriptions):
similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
if similarity_ratio > MAX_SIMILAR_ALLOWED:
del descriptions[j]
else:
j += 1
i += 1
return descriptions

Number of events in one array within w minutes after any event in a second array

I have two sorted arrays of unix time stamps (so integers representing times at which some events happen). Lets call the arrays ts1 and ts2. I want to find the number of events in ts1 that lie after w-minutes of any event in ts2. Let's say the method signature is (take the first and second arrays and window size then return number of events in ts1 that are within w minutes after any event in ts2):
critical_events(ts1,ts2,w)->int
Here are some test cases:
## Test cases.
ev = critical_events([.5,1.5,2.5],[1,2,3],.5)
print(ev==0)
ev = critical_events([1.4,1.4,2.7],[1,2,3],.5)
print(ev==2)
ev = critical_events([1.4,2.4,3.4],[1,2,3],.5)
print(ev==3)
I expect the length of the first array, n to be much larger than the length of the second one, m. Looking for efficient algorithms in terms of time and space and if possible, their average and worst case complexities in terms of n and m, time and space.
My attempt: instead of explaining my attempts, I'll just link to the code which should be self-explanatory (or at least better than what I can do in words): https://gist.github.com/ryu577/fdc22af4ed17d122a6aa25684597745b
You are showing them as sorted, so my assumption is they are (need to be for this to work).
Because your first array is much larger than your second, you need to take your second in a for loop.
I am using example test case 2:ev = critical_events([1.4,1.4,2.7],[1,2,3],.5)
Next you can use a binary search on the first element of ts2 + interval (1 + 0.5) = 1.5.
Your startIndex is 0 and endIndex is 2. So in first compare you take all elements.
Doing a binary search will result in index 2 in ts1. Note: Because you have equal element in your array, you need to go right until you get higher number. What you can tell now is that 2.7 (and all elements after if there where any) are the element what lies after 1.5. Count is ts2.lenght - foundindex.
Now you can set your start index to 2. because you know, all on the left of this index is smaller and will not lie after 1.5 sec.
You take element2 and do a binary search, you will find index 2 ( 2.5 < 2.7), again:
Count = Count + ts2.lenght - foundindex.
To my knowledge, this is the fastest method. I believe the speed is Log(n).m.

No. of paths in integer array

There is an integer array, for eg.
{3,1,2,7,5,6}
One can move forward through the array either each element at a time or can jump a few elements based on the value at that index. For e.g., one can go from 3 to 1 or 3 to 7, then one can go from 1 to 2 or 1 to 2(no jumping possible here), then one can go 2 to 7 or 2 to 5, then one can go 7 to 5 only coz index of 7 is 3 and adding 7 to 3 = 10 and there is no tenth element.
I have to only count the number of possible paths to reach the end of the array from start index.
I could only do it recursively and naively which runs in exponential time.
Somebody plz help.
My recommendation: use dynamic programming.
If this key word is sufficient and you want the challenge to find a possible solution on your own, dont read any further!
Here a possible DP-algorithm on the example input {3,1,2,7,5,6}. It will be your job to adjust on the general problem.
create array sol length 6 with just zeros in it. the array will hold the number of ways.
sol[5] = 1;
for (i = 4; i>=0;i--) {
sol[i] = sol[i+1];
if (i+input[i] < 6 && input[i] != 1)
sol[i] += sol[i+input[i]];
}
return sol[0];
runtime O(n)
As for the directed graph solution hinted in the comments :
Each cell in the array represents a node. Make an directed edge from each node to the node accessable. Basically you can then count more easily the number of ways by just looking at the outdegrees on the nodes (since there is no directed cycle) however it is a lot of boiler plate to actual program it.
Adjusting the recursive solution
another solution would be to pruning. This is basically equivalent to the DP-algorithm. The exponentiel time comes from the fact, that you calculate values several times. Eg function is recfunc(index). The initial call recFunc(0) calls recFunc(1) and recFunc(3) and so on. However recFunc(3) is bound to be called somewhen again, which leads to a repeated recursive calculation. To prune this you add a Map to hold all already calculated values. If you make a call recFunc(x) you lookup in the map if x was already calculated. If yes, return the stored value. If not, calculate, store and return it. This way you get a O(n) too.

find and delete lines in file python 3

I use python 3
Okay, I got a file that lock like this:
id:1
1
34
22
52
id:2
1
23
22
31
id:3
2
12
3
31
id:4
1
21
22
11
how can I find and delete only this part of the file?
id:2
1
23
22
31
I have been trying a lot to do this but can't get it to work.
Is the id used for the decision to delete the sequence, or is the list of values used for the decision?
You can build a dictionary where the id number is the key (converted to int because of the later sorting) and the following lines are converted to the list of strings that is the value for the key. Then you can delete the item with the key 2, and traverse the items sorted by the key, and output the new id:key plus the formated list of the strings.
Or you can build the list of lists where the order is protected. If the sequence of the id's is to be protected (i.e. not renumbered), you can also remember the id:n in the inner list.
This can be done for a reasonably sized file. If the file is huge, you should copy the source to the destination and skip the unwanted sequence on the fly. The last case can be fairly easy also for the small file.
[added after the clarification]
I recommend to learn the following approach that is usefull in many such cases. It uses so called finite automaton that implements actions bound to transitions from one state to another (see Mealy machine).
The text line is the input element here. The nodes that represent the context status are numbered here. (My experience is that it is not worth to give them names -- keep them just stupid numbers.) Here only two states are used and the status could easily be replaced by a boolean variable. However, if the case becomes more complicated, it leads to introduction of another boolean variable, and the code becomes more error prone.
The code may look very complicated at first, but it is fairly easy to understand when you know that you can think about each if status == number separately. This is the mentioned context that captured the previous processing. Do not try to optimize, let the code that way. It can actually be human-decoded later, and you can draw the picture similar to the Mealy machine example. If you do, then it is much more understandable.
The wanted functionality is a bit generalized -- a set of ignored sections can be passed as the first argument:
import re
def filterSections(del_set, fname_in, fname_out):
'''Filtering out the del_set sections from fname_in. Result in fname_out.'''
# The regular expression was chosen for detecting and parsing the id-line.
# It can be done differently, but I consider it just fine and efficient.
rex_id = re.compile(r'^id:(\d+)\s*$')
# Let's open the input and output file. The files will be closed
# automatically.
with open(fname_in) as fin, open(fname_out, 'w') as fout:
status = 1 # initial status -- expecting the id line
for line in fin:
m = rex_id.match(line) # get the match object if it is the id-line
if status == 1: # skipping the non-id lines
if m: # you can also write "if m is not None:"
num_id = int(m.group(1)) # get the numeric value of the id
if num_id in del_set: # if this id should be deleted
status = 1 # or pass (to stay in this status)
else:
fout.write(line) # copy this id-line
status = 2 # to copy the following non-id lines
#else ignore this line (no code needed to ignore it :)
elif status == 2: # copy the non-id lines
if m: # the id-line found
num_id = int(m.group(1)) # get the numeric value of the id
if num_id in del_set: # if this id should be deleted
status = 1 # or pass (to stay in this status)
else:
fout.write(line) # copy this id-line
status = 2 # to copy the following non-id lines
else:
fout.write(line) # copy this non-id line
if __name__ == '__main__':
filterSections( {1, 3}, 'data.txt', 'output.txt')
# or you can write the older set([1, 3]) for the first argument.
Here the output id-lines where given the original number. If you want to renumber the sections, it can be done via a simple modification. Try the code and ask for details.
Beware, the finite automata have limited power. They cannot be used for the usual programming languages as they are not able to capture nested paired structures (like parenteses).
P.S. The 7000 lines is actually a tiny file from a computer perspective ;)
Read each line into an array of strings. The index number is the line number - 1. Check if the line equals "id:2" before you read the line. If yes, then stop reading the line until the line equals "id:3". After reading the line, clear the file and write the array back to the file until the end of the array. This may not be the most efficient way but should work.
if there isn't any values in between that would interfere this would work....
import fileinput
...
def deleteIdGroup( number ):
deleted = False
for line in fileinput.input( "testid.txt", inplace = 1 ):
line = line.strip( '\n' )
if line.count( "id:" + number ): # > 0
deleted = True;
elif line.count( "id:" ): # > 0
deleted = False;
if not deleted:
print( line )
EDIT:
sorry this deletes id:2 and id:20 ... yuo could modify it so that the first if checks - line == "id:" + number

Resources