Flink: handle skew by partitioning by a field of the key - apache-flink

I have skew when I keyBy on my data. Let's say the key is:
case class MyKey(x: X, y:Y)
To solve this I am thinking of adding an extra field that would make distribution even among the workers by using this field only for partitioning:
case class MyKey(z: evenlyDistributedField, x: X, y:Y) extends MyKey(x, y) {
override def hashCode(): Int = z.hashCode
due to this line my records will use the overridden hashCode and be distributed evenly to each worker and use the original equals method (that takes into consideration only the X and Y fields) to find the proper keyed state in later stateful operators.
I know that same (X, Y) pairs will end in different workers, but I can handle that later. (after making the necessary processing with my new key to avoid skew).
My question is where else is the hashCode method of the Key is used?
I suspect for sure when getting keyed state (what is namespace btw?) as I saw extending classes use the key in a hashMap to get the state for this key. I know that retrieving the KeyedState from the map will be slower as as the hashCode will not consider the X, Y fields. But is there any other place in the flink code that uses the hashcode method of the key?
Is there any other way to solve this? I thought of physical partitioning but I cannot use keyBy as well afaik.
partition my data in each worker randomly to produce an even distribution
[EDITED] do a .window().aggregate() in each partition independently from one another (as if the others dont exists). The data in each window aggregate should be keyed on (X,Y)s of this partition ignoring same (X,Y) keys in other partitions.
merge the conflicts due to same (X,Y) pairs appearing in different partition later (This i need not guidance. I just do a new key by on (X, Y))

In this situation I usually create a transient Tuple2<MyKey, Integer>, where I fill in the Tuple.f1 field with whatever I want to use to partition by. The map or flatMap operation following the .keyBy() can emit MyKey. That avoids mucking with MyKey.hashCode().
And note that having a different set of fields for the hashCode() vs. equals() methods leads to pain and suffering. Java has a contract that says "equals consistency: objects that are equal to each other must return the same hashCode".
If you can't offload a significant amount of unkeyed work, then what I would do is...
Set the Integer in the Tuple2<MyKey, Integer> to be hashCode(MyKey) % <operator parallelism * factor>. Assuming your parallelism * factor is high enough, you'll only get a few cases of 2 (or more) of the groups going to the same sub-task.
In the operator, use MapState<MyKey, value> to store state. You'll need this since you'll get multiple unique MyKey values going to the same keyed group.
Do your processing and emit a MyKey from this operator.
By using hashCode(MyKey) % some value, you should get a pretty good mix of unique MyKey values going to each sub-task, which should mitigate skew. Of course if one value dominates, then you'll need another approach, but since you haven't mentioned this I'm assuming it's not the case.


Equivalent of DataSet groupBy/withPartitioner for DataStream

Previously with a DataSet I could do a .groupBy(...) followed by a .withPartitioner(...) to create groups such that one group (known to be much, much bigger than all the others) would be assigned to its own slot, and the other groups would be distributed among the remaining slots.
In switching to a DataStream, I don't see any straightforward way to do the same thing. If I dig into .keyBy(...), I see it using a PartitionTransformation with a KeyGroupStreamPartitioner, which is promising - but PartitionTransformation is an internal-use only class (or so the annotation says).
What's the recommended approach with a DataStream for achieving the same result?
With DataStream it's not as straightforward. You can implement a custom Partitioner that you use with partitionCustom, but then you do not have a KeyedStream, and so can not use keyed state or timers.
Another solution is to do a two-step, local/global aggregation, e.g.,
And in some cases, the first level of random keying isn't necessary (if the keys are already well distributed among the source partitions).
In principle, given a priori knowledge of the hot keys, you should be able to somehow implement a KeySelector that does a good job of balancing the load among the task slots. I believe one or two people have actually done this (by brute force searching for a suitable mapping from original keys to actual keys), but I don't have a reference implementation at hand.
As David noted, you can sometimes do the double-keyBy trick (initially using a random key) to reduce the impact of key skew. In my case that wasn't viable, as I'm processing records in each group using a large deep learning network with significant memory requirements, which means having all models loaded at the same time for the first grouping.
I re-used a technique I'd gotten to work with an older version of Flink, where you decide which sub-task (operator index) should get each record, and then calculate a key that Flink will assign to the target sub-task. The code, which calculates an Integer key, looks something like:
public static Integer makeKeyForOperatorIndex(int maxParallelism, int parallelism,
int operatorIndex) {
for (int i = 0; i < maxParallelism * 2; i++) {
Integer key = new Integer(i);
int index = KeyGroupRangeAssignment.assignKeyToParallelOperator(
i, maxParallelism, parallelism);
if (index == operatorIndex) {
return key;
throw new RuntimeException(String.format(
"Unable to find key for target operator index %d (max parallelism = %d, parallelism = %d",
operatorIndex, maxParallelism, parallelism));
But note this is very fragile, as it depends on internal Flink implementation details.

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
Solr: Random sort order after index version change
Solr - Return random results (Sort by Random)
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

Determine percentage of unused keys in large redis DB

I have a Redis database with many millions of keys in it. Over time, the keys that I have written to and read from have changed, and so there are many keys that I am simply not using any more. Most don't have any kind of TTL either.
I want to get a sense for what percentage of the keys in the Redis database is not in use any more. I was thinking I could use hyperloglog to estimate the cardinality of the number of keys that are being written to, but it seems like a lot of work to do a PFADD for every key that gets written to and read from.
To be clear, I don't want to delete anything yet, I just want to do some analysis on the number of used keys in the database.
I'd start with the scan command to iterate through the keys, and use the object idletime command on each to collect the number of seconds since the key was last used. From there you can generate metrics however you like.
One way, using Redis, would be to use a sorted set with the idletime of the key as its score. The advantage of this over HLL is that you can then say "give me keys idle between x and y seconds ago" by using zrange and/or zrevrange. The results of that you could then use for operations such as deletion, archival, or setting a TTL. With HLL you can't do this.
Another advantage is that, unless you store the result in Redis, there is only a Redis cost when you run it. You don't have to modify your code to do additional operations when accessing keys, for example.
The accuracy of the object's idle time is around ten seconds or so if I recall. But for getting an idea of how many and which keys haven't been accessed in a given time frame it should work fine.
You can analysis the data with time window, and use a hyperloglog to estimate the cardinality for each time window.
For example, you can use a hyperloglog for each day's analysis:
// for each key that has been read or written in day1
// add it to the corresponding hyperloglog
pfadd key-count-day1 a b
pfadd key-count-day1 c d e
// for each key that has been read or written in day2
// add it to the corresponding hyperloglog
pfadd key-count-day2 a
pfadd key-count-day2 c
In this case, you can get the estimated number of keys that are active in dayN with the hyperloglog whose key is key-count-dayN.
With pfcount, you can get the number of active keys for each day or several days.
// number of active keys in day2: count2
pfcount key-count-day2
// number of active keys in day1 and day2: count-total
pfcount key-count-day1 key-count-day2
With these 2 counts, you can calculate the percentage of keys that are unused since day2: (count-total - count2) / count-total

Access main table's values (DISTANCE and SIZE)

I am storing a vast amount of mathematical formulas as Content MathML in BaseX databases. To speed up lookup with different search algorithms implemented as XQuery expressions I want to access the main table's values especially PRE, DISTANCE and SIZE. The plan is to get rid of all subtrees which provide the wrong amount of the subtree's total nodes (SIZE).
The PRE value is available via the function db:node-pre and working just fine. How can I access the DISTANCE and SIZE values? I could not find a way in the documentation.
The short answer is: you don't, stay with the offered APIs
If you really want that IDs, use the parent::node() and following-sibling::node()[1] axes and query their pre values. Following equations hold:
PRE(.) = PRE(parent) + SIZE(parent)
PRE(following-sibling[1]) = PRE(.) + SIZE(.)
so you could read those values in constant time by reordering those equations.
The long answer: dig deep into BaseX internals
You'll touching the core (and probably shouldn't, kittens might die!). Implement a BaseX Java binding to get access to the queryContext variable, holding the database context context, which you can query to get a data() reference:
Data data = queryContext.context.data();
Once you have the Data reference, you get access to several functions to query values of the internal data structure:
int Data.dist(int pre, int kind)
int Data.size(int pre, int kind)
where kind is always 1 for element nodes.
Be brave, and watch your step, you're leaving the safe grounds now!

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
# result_keys now contain 5 random keys.
