I've read in a book that
Flink maintains one state instance per keyvalue and partitions all records with the same key to the
operator task that maintains the state for this key.
my question is:
lets say i have 4 tasks with 2 slots each.
and there's a key that belongs to 95% of the data.
does it means that 95% the data is routed to the same machine?
Yes, it does mean that. If you have a hot key, then partitioning by key doesn't scale well.
In some cases, there are ways to work around this limitation. For example, if you are computing analytics (e.g., you want to count page views per page per minute, and one page gets 95% of the page views), you can do pre-aggregation -- split the work for the hot key across several parallel instances, and then do one final, non-parallel reduction of the partial results. (This is just standard map/reduce logic.)
This is called "data skew" and it is the bane of scalable applications everywhere.
It's also possible that the entire (100%) load goes to the same machine. There's no guarantee that the data is spread as evenly as possible by key, only that each key gets processed on a single machine. Technically, each key gets mapped to a key group (the number of key groups is the max parallelism for the topology) and each key group gets handled by a specific instance of an operator.
One way to handle this situation involves adding a second field to the key, resulting in a greater number of possible keys and possibly reducing the data skew across the keys. Then aggregate the results in a subsequent operator using just the one original key.
Related
I have a stream of data, containing a key, that I need to mix and match with data associated with that key. Each key belongs to a partition, and each partition can be loaded from a database.
Data is quite big and only a few hundred out of hundreds of thousands partitions can fit in a single task manager.
My current approach is to use partitionCustom based on the key.partition and cache the partition data inside a RichMapFunction to mix and match without reloading the data of the partitions multiple times.
When the number of message rate on a same partition gets too high, I hit a hot-spot/performance bottleneck.
What tools do I have in Flink to improve the throughput in this case?
Are there ways to customize the scheduling and to optimize the job placements based on setup time on the machines, and maximum processing time history?
It sounds like (a) your DB-based data is also partitioned, and (b) you have skew in your keys, where one partition gets a lot more keys than other partitions.
Assuming the above is correct, and you've done code profiling on your "mix and match" code to make that reasonably efficient, then you're left with manual optimizations. For example, if you know that keys in partition X are much more common, you can put all of those keys in one partition, and then distribute the remaining keys amongst the other partitions.
Another approach is to add a "batcher" operator, which puts up to N keys for the same partition into a group (typically this also needs a timeout to flush, so data doesn't get stuck). If you can batch enough keys, then it might not be so bad to load the DB data on demand for the partition associated with each batch of keys.
I've read throughout the Internet that the Datastore has a limit of 1 write per second for an Entity Group. Most of what I read indicate a "write to an entity", which I would understand as an update. Does the 1 write per second also apply to adding entities into the group?
A simple case would be a Thread where multiple posts can be added by different users. The way I see it, it's logical to have the Thread be the ancestor of the Posts. Thus, forming a wide entity group. If the answer to my question above is yes, a "trending" thread would be devastated by the write limit.
That said, would it make sense to get rid of the ancestry altogether or should I switch to the user as the ancestor? What I'd like to avoid is having the user be confused when they don't see the post due to eventual consistency.
A quick clarification to start with
1 write per second doesn't mean 1 entity per second. You can batch writes together, up to a maximum of 500 entities (transactions also have a 10 MiB limit). So if you can patch posts, you can improve your write rate.
Note: you can technically go higher than 1 per second, although your risk of contention errors increases the longer you exceed that limit as well as the eventual consistency of the system.
You can read more on the limits here.
Client-side sharding
If you need to use ancestor queries for strong consistency AND 1 write per second is not enough, you could implement client-side sharding. This essentially means that you write the posts to a up to N different entity-groups using a known key scheme, For example:
Primary parent: "AncestorA"
Optional shard 1: "AncestorA-1"
Optional shard N: "AncestorA-(N-1)"
To query for your posts, issue N ancestor queries. Naturally, you'll need to merge these results on the client-side to display it in the correct order.
This will allow you to do N writes per second.
Let's say I had a table with N rows, but no existing columns that could act as a Primary Key.
I'd like to generate one (for my convenience and completeness).
I have a few options for doing this.
I could use a GUID
I could use a sequence and generate an integer for each one (e.g., populated 1 to N)
I could generate a random integer
(and many more)
I get that GUIDs have their advantages and disadvantages.
Is there some advantage to using a randomly generated integer over a sequential integer?
Any CRUD operations on an indexed column shouldn't be affected. And if you were doing a bulk load, I would temporarily turn off the index and then restore it afterwards
I can't see a reason, but I've come across a situation (in this case Oracle) where someone has done just that and I'm hoping its more than "What's a sequence?".
Since you're seeing a specific implementation that has chosen this approach, we can only speculate at what the original developer might have been thinking. That's always subject to error.
My guess is that the original developer was trying to avoid the issue where the right-most block in the index on the sequence-generated key becomes the resource that blocks many different sessions trying to do an insert. The "hot block" problem occurs because every session doing an insert needs to modify the data in the right-most block (assuming sequential keys) so Oracle needs to serialize access. In most systems, this isn't a big deal-- the amount of serialization needed is minimal and most systems don't have enough simultaneous insert operations for this to be a meaningful issue. But if you have a very high-volume system, particularly if you're running on a RAC cluster, those wait events can be meaningful. If you had this sort of issue, generating a random key would eliminate it by causing the various sessions to (generally) write to different blocks in the index.
Of course, generating random keys would not be the recommended approach even if you found yourself waiting on the right-most block of an index frequently. Oracle provides reverse-key indexes to take care of the hot block issue by indexing the data in reverse which distributes I/O across the blocks in the index. If you're licensed to use the partitioning option, a hash-partitioned index would be even better. For a more detailed discussion on reverse-key indexes, RAC, and mitigating hot block issues, here's a link to a related SO question.
I am writing a program that converts an RDBMS into HBase. I selected a sequential entity as a row key like Employee ID (1,2,3....)but i read it somewhere that row key shouldn't be a sequential entity. My question is why selecting a sequential row key is not recommended. what are the design prospects associated for doing the same?
Although sequential rowkeys allow faster scans, it becomes a problem after a certain point as it causes undesirable RegionServer hotspotting during read/write time. By its default behavior Hbase stores rows with similar keys to the same region. It allows faster range scans. So if rowkeys are sequential all of your data will start going to the same machine causing uneven load on that machine. This is called as RegionServer Hotspotting and is the main motivation behind not using sequential keys. I'll take "writes" to explain the problem here.
When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster.
You can find a very good explanation of the problem along with its solution here.
You might also find this page helpful, which shows us how to design rowkeys efficiently.
Hope this answers your question.
Mostly because sequentially increasing row keys will be written to the same region, and not evenly distributed in terms of writes. If you have a write-intensive application, it makes sense to have some randomness in your row-key.
This is a great explanation (with graphics) on why a sequentially increasing row-key is a bad idea for HBase.
I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!