How Row Key is designed in Hbase - database

I am writing a program that converts an RDBMS into HBase. I selected a sequential entity as a row key like Employee ID (1,2,3....)but i read it somewhere that row key shouldn't be a sequential entity. My question is why selecting a sequential row key is not recommended. what are the design prospects associated for doing the same?

Although sequential rowkeys allow faster scans, it becomes a problem after a certain point as it causes undesirable RegionServer hotspotting during read/write time. By its default behavior Hbase stores rows with similar keys to the same region. It allows faster range scans. So if rowkeys are sequential all of your data will start going to the same machine causing uneven load on that machine. This is called as RegionServer Hotspotting and is the main motivation behind not using sequential keys. I'll take "writes" to explain the problem here.
When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster.
You can find a very good explanation of the problem along with its solution here.
You might also find this page helpful, which shows us how to design rowkeys efficiently.
Hope this answers your question.

Mostly because sequentially increasing row keys will be written to the same region, and not evenly distributed in terms of writes. If you have a write-intensive application, it makes sense to have some randomness in your row-key.
This is a great explanation (with graphics) on why a sequentially increasing row-key is a bad idea for HBase.

Related

When adding a unique IDs to your table is there any advantage to making them random?

Let's say I had a table with N rows, but no existing columns that could act as a Primary Key.
I'd like to generate one (for my convenience and completeness).
I have a few options for doing this.
I could use a GUID
I could use a sequence and generate an integer for each one (e.g., populated 1 to N)
I could generate a random integer
(and many more)
I get that GUIDs have their advantages and disadvantages.
Is there some advantage to using a randomly generated integer over a sequential integer?
Any CRUD operations on an indexed column shouldn't be affected. And if you were doing a bulk load, I would temporarily turn off the index and then restore it afterwards
I can't see a reason, but I've come across a situation (in this case Oracle) where someone has done just that and I'm hoping its more than "What's a sequence?".
Since you're seeing a specific implementation that has chosen this approach, we can only speculate at what the original developer might have been thinking. That's always subject to error.
My guess is that the original developer was trying to avoid the issue where the right-most block in the index on the sequence-generated key becomes the resource that blocks many different sessions trying to do an insert. The "hot block" problem occurs because every session doing an insert needs to modify the data in the right-most block (assuming sequential keys) so Oracle needs to serialize access. In most systems, this isn't a big deal-- the amount of serialization needed is minimal and most systems don't have enough simultaneous insert operations for this to be a meaningful issue. But if you have a very high-volume system, particularly if you're running on a RAC cluster, those wait events can be meaningful. If you had this sort of issue, generating a random key would eliminate it by causing the various sessions to (generally) write to different blocks in the index.
Of course, generating random keys would not be the recommended approach even if you found yourself waiting on the right-most block of an index frequently. Oracle provides reverse-key indexes to take care of the hot block issue by indexing the data in reverse which distributes I/O across the blocks in the index. If you're licensed to use the partitioning option, a hash-partitioned index would be even better. For a more detailed discussion on reverse-key indexes, RAC, and mitigating hot block issues, here's a link to a related SO question.

Time to retrieve a single record via a SQL Server index in a large table

Short version of the question:
If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow
Longer version of question and background:
I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).
At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.
They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.
If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.
Or have I got this completely wrong
You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.
Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).
Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.
should be low cost and fast since the columns are indexed and that would be O(n) I think
You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?
Then I would be worried too.
Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.
It depends.
A single retrieval will be low cost and fast
on a decent indexed table
running on decent hardware
over a decent network
On the other hand, it takes time nonetheless.
If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.
Some questions you need to adress
Is my hardware up to spec
Does adding two fields result in a page split (unlikely)
How many extra pages need to be read for your regular result sets
How many retrievals/sec will be made
How many inserts/sec (triggering an index update) will be made
After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Advice on building a fast, distributed database

I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
episode-109-ebays-architecture-principles-with-randy-shoup
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources