Equivalent of DataSet groupBy/withPartitioner for DataStream - flink-streaming

Previously with a DataSet I could do a .groupBy(...) followed by a .withPartitioner(...) to create groups such that one group (known to be much, much bigger than all the others) would be assigned to its own slot, and the other groups would be distributed among the remaining slots.
In switching to a DataStream, I don't see any straightforward way to do the same thing. If I dig into .keyBy(...), I see it using a PartitionTransformation with a KeyGroupStreamPartitioner, which is promising - but PartitionTransformation is an internal-use only class (or so the annotation says).
What's the recommended approach with a DataStream for achieving the same result?

With DataStream it's not as straightforward. You can implement a custom Partitioner that you use with partitionCustom, but then you do not have a KeyedStream, and so can not use keyed state or timers.
Another solution is to do a two-step, local/global aggregation, e.g.,
.keyBy(randomizedKey).process(local).keyBy(key).process(global)
And in some cases, the first level of random keying isn't necessary (if the keys are already well distributed among the source partitions).
In principle, given a priori knowledge of the hot keys, you should be able to somehow implement a KeySelector that does a good job of balancing the load among the task slots. I believe one or two people have actually done this (by brute force searching for a suitable mapping from original keys to actual keys), but I don't have a reference implementation at hand.

As David noted, you can sometimes do the double-keyBy trick (initially using a random key) to reduce the impact of key skew. In my case that wasn't viable, as I'm processing records in each group using a large deep learning network with significant memory requirements, which means having all models loaded at the same time for the first grouping.
I re-used a technique I'd gotten to work with an older version of Flink, where you decide which sub-task (operator index) should get each record, and then calculate a key that Flink will assign to the target sub-task. The code, which calculates an Integer key, looks something like:
public static Integer makeKeyForOperatorIndex(int maxParallelism, int parallelism,
int operatorIndex) {
for (int i = 0; i < maxParallelism * 2; i++) {
Integer key = new Integer(i);
int index = KeyGroupRangeAssignment.assignKeyToParallelOperator(
i, maxParallelism, parallelism);
if (index == operatorIndex) {
return key;
}
}
throw new RuntimeException(String.format(
"Unable to find key for target operator index %d (max parallelism = %d, parallelism = %d",
operatorIndex, maxParallelism, parallelism));
}
But note this is very fragile, as it depends on internal Flink implementation details.

Related

Flink: handle skew by partitioning by a field of the key

I have skew when I keyBy on my data. Let's say the key is:
case class MyKey(x: X, y:Y)
To solve this I am thinking of adding an extra field that would make distribution even among the workers by using this field only for partitioning:
case class MyKey(z: evenlyDistributedField, x: X, y:Y) extends MyKey(x, y) {
override def hashCode(): Int = z.hashCode
}
due to this line my records will use the overridden hashCode and be distributed evenly to each worker and use the original equals method (that takes into consideration only the X and Y fields) to find the proper keyed state in later stateful operators.
I know that same (X, Y) pairs will end in different workers, but I can handle that later. (after making the necessary processing with my new key to avoid skew).
My question is where else is the hashCode method of the Key is used?
I suspect for sure when getting keyed state (what is namespace btw?) as I saw extending classes use the key in a hashMap to get the state for this key. I know that retrieving the KeyedState from the map will be slower as as the hashCode will not consider the X, Y fields. But is there any other place in the flink code that uses the hashcode method of the key?
Is there any other way to solve this? I thought of physical partitioning but I cannot use keyBy as well afaik.
SUMMING UP I WANT TO:
partition my data in each worker randomly to produce an even distribution
[EDITED] do a .window().aggregate() in each partition independently from one another (as if the others dont exists). The data in each window aggregate should be keyed on (X,Y)s of this partition ignoring same (X,Y) keys in other partitions.
merge the conflicts due to same (X,Y) pairs appearing in different partition later (This i need not guidance. I just do a new key by on (X, Y))
In this situation I usually create a transient Tuple2<MyKey, Integer>, where I fill in the Tuple.f1 field with whatever I want to use to partition by. The map or flatMap operation following the .keyBy() can emit MyKey. That avoids mucking with MyKey.hashCode().
And note that having a different set of fields for the hashCode() vs. equals() methods leads to pain and suffering. Java has a contract that says "equals consistency: objects that are equal to each other must return the same hashCode".
[updated]
If you can't offload a significant amount of unkeyed work, then what I would do is...
Set the Integer in the Tuple2<MyKey, Integer> to be hashCode(MyKey) % <operator parallelism * factor>. Assuming your parallelism * factor is high enough, you'll only get a few cases of 2 (or more) of the groups going to the same sub-task.
In the operator, use MapState<MyKey, value> to store state. You'll need this since you'll get multiple unique MyKey values going to the same keyed group.
Do your processing and emit a MyKey from this operator.
By using hashCode(MyKey) % some value, you should get a pretty good mix of unique MyKey values going to each sub-task, which should mitigate skew. Of course if one value dominates, then you'll need another approach, but since you haven't mentioned this I'm assuming it's not the case.

Access main table's values (DISTANCE and SIZE)

I am storing a vast amount of mathematical formulas as Content MathML in BaseX databases. To speed up lookup with different search algorithms implemented as XQuery expressions I want to access the main table's values especially PRE, DISTANCE and SIZE. The plan is to get rid of all subtrees which provide the wrong amount of the subtree's total nodes (SIZE).
The PRE value is available via the function db:node-pre and working just fine. How can I access the DISTANCE and SIZE values? I could not find a way in the documentation.
The short answer is: you don't, stay with the offered APIs
If you really want that IDs, use the parent::node() and following-sibling::node()[1] axes and query their pre values. Following equations hold:
PRE(.) = PRE(parent) + SIZE(parent)
PRE(following-sibling[1]) = PRE(.) + SIZE(.)
so you could read those values in constant time by reordering those equations.
The long answer: dig deep into BaseX internals
You'll touching the core (and probably shouldn't, kittens might die!). Implement a BaseX Java binding to get access to the queryContext variable, holding the database context context, which you can query to get a data() reference:
Data data = queryContext.context.data();
Once you have the Data reference, you get access to several functions to query values of the internal data structure:
int Data.dist(int pre, int kind)
int Data.size(int pre, int kind)
where kind is always 1 for element nodes.
Be brave, and watch your step, you're leaving the safe grounds now!

runtime optimization of a matching algorithm

I made the following matching algorithm, but of course i will having big runtimes...
Has anybody an idea to make this matching faster (by changing the code or changing the algorithm)
for (i=0;i<AnzEntity;i++){
for (j=0;j<8;j++){
if (Entity[i].GID[j] > 0){
for (k=0;k<AnzGrid;k++){
if (Entity[i].Module == Grid[k].Module && Entity[i].GID[j] == Grid[k].ID ){
Entity[i].GIDCoord[j][0] = Grid[k].XYZ[0];
Entity[i].GIDCoord[j][1] = Grid[k].XYZ[1];
Entity[i].GIDCoord[j][2] = Grid[k].XYZ[2];
continue;
}
}
}
}
}
A very general question... for which one can give only a very general answer.
All faster search algorithms come down to divide and conquer. There's a whole family of searches which start by sorting the data to be searched, so that you can progressively halve (or better) the number of things you are searching through (eg: binary search lists, trees of all kinds, ...). There's a whole family of searches where you use some property of each value to cut the search to some (small) subset of the data (hashing). There are searches which cache recent results, which can work in some cases (eg: bring to front lists). Which of these may be suitable depends on the data.
The big thing to look at, however, is whether the data being search changes, and if so how often. If the data does not change, then you can hit it with a big hammer and crunch out a simple structure to search. If the data changes all the time, then you need a more complicated structure so that changes are not prohibitively expensive and search speed is maintained. Depending on the circumstances the trade-off will vary.
You are exhaustively comparing all Entity[i] (with a positive GID[j]) to all Grid[k]. This implies a total of AnzEntity * AnzGrid comparisons.
Instead, you can sort the Entity and Grid elements in increasing lexicographical order (by ID value and Module value in case of a tie). You should duplicate all Entity having nonzero Entity.GID.
Exploiting the sorted order, the number of comparisons will drop to 8.AnzEntity + AnzGrid.
Taking the sorts into account, O(NM) is turned to O(NLog(N)+MLog(M)).
ALTERNATIVE:
Another option is to enter one of Entity or Grid items in a hash table, using pairs ID/Module for the key, and use the hash table for fast lookups. This should result in a behavior close to linear O(N + M).

C - How can this simple transaction() function be completely free of deadlocks?

So I have this basic transaction() function written in C:
void transaction (Account from, Account to, double amount) {
mutex lock1, lock2;
lock1 = get_lock(from);
lock2 = get_lock(to);
acquire(lock1);
acquire(lock2);
withdraw(from, amount);
deposit(to, amount);
release(lock2);
release (lock1);
}
It's to my understanding that the function is mostly deadlock-free since the function locks one account and then the other (instead of locking one, making changes, and then locking another). However, if this function was called simultaneously by these two calls:
transaction (savings_account, checking_account, 500);
transaction (checking_account, savings_account, 300);
I am told that this would result in a deadlock. How can I edit this function so that it's completely free of deadlocks?
You need to create a total ordering of objects (Account objects, in this case) and then always lock them in the same order, according to that total ordering. You can decide what order to lock them in, but the simple thing would be to first lock the one that comes first in the total ordering, then the other.
For example, let's say each account has an account number, which is a unique* integer. (* meaning no two accounts have the same number) Then you could always lock the one with the smaller account number first. Using your example:
void transaction (Account from, Account to, double amount)
{
mutex first_lock, second_lock;
if (acct_no(from) < acct_no(to))
{
first_lock = get_lock(from);
second_lock = get_lock(to);
}
else
{
assert(acct_no(to) < acct_no(from)); // total ordering, so == is not possible!
assert(acct_no(to) != acct_no(from)); // this assert is essentially equivalent
first_lock = get_lock(to);
second_lock = get_lock(from);
}
acquire(first_lock);
acquire(second_lock);
withdraw(from, amount);
deposit(to, amount);
release(second_lock);
release(first_lock);
}
So following this example, if checking_account has account no. 1 and savings_account has account no. 2, transaction (savings_account, checking_account, 500); will lock checking_account first and then savings_account, and transaction (checking_account, savings_account, 300); will also lock checking_account first and then savings_account.
If you don't have account numbers (say your working with class Foo instead of class Account) then you need to find something else to establish a total ordering. If each object has a name, as a string, then you can do an alphabetic comparison to determine which string is "less". Or you can use any other type that is comparable for > and <.
However, it is very important that the values be unique for each and every object! If two objects have the same value in whichever field you're testing, then they in the same spot in the ordering. If that can happen, then it is a "partial ordering" not a "total ordering" and it is important to have a total ordering for this locking application.
If necessary, you can make up a "key value" that is an arbitrary number that doesn't mean anything, but is guaranteed unique for each object of that type. Assign a new, unique value to each object when it is created.
Another alternative is to keep all the objects of that type in some kind of list. Then their list position serves to put them in a total ordering. (Frankly, the "key value" approach is better, but some applications may be keeping the objects in a list already for application logic purposes so you can leverage the existing list in that case.) However, take care that you don't end up taking O(n) time (instead of O(1) like the other approaches*) to determine which one comes first in the total ordering when you use this approach.
(* If you're using a string to determine total ordering, then it's not really O(1), but it's linear with the length of the strings and constant w.r.t. the number of objects that hold those strings... However, depending on your application, the string length may be much more reasonably bounded than the numer of objects.)
The problem you are trying to solve is called the dining philosophers problem, it is a well known concurrency problem.
In your case the naive solution would be to change acquire to receive 2 parameters(to and from) and only return when it can get both locks at the same time and to not get any lock if it can't have both (because that's the situation when the deadlock may occur, when get 1 lock and wait for the other). Read about the dining philosophers problem and you'll understand why.
Hope it helps!

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Resources