Avoiding hotspotting in BigTable or HBase by using SHA1 keys

Avoiding hotspotting in BigTable or HBase by using SHA1 keys - database

I'm using Google BigTable to store event log data according to the following constraints:
Each key should contain a username and timestamp, allowing contiguous reads for time-series data on a per-user basis, like this: USERNAME_TIMESTAMP.
I will be storing up to 10,000,000 event logs or more per day, and so naturally, I need to avoid hotspotting and ensure that I am evenly distributing records across each node.
There is a massive security component to this database, and as such, I'd like to encrypt the username before using it as a key in BigTable.
Obviously, I'd like to avoid doing extra steps whenever I read or write, so I was thinking of encrypting usernames using SHA1 before adding them as a key in BigTable. As a result, all keys in BigTable will now be formatted like this:
cf23df2207d99a74fbe169e3eba035e633b65d94_2018_01_30_15090001
We know that SHA1 is normally distributed, so given that, is it safe to assume that all of my records will be evenly distributed across nodes, while ensuring that all usernames will reside together? Will this in effect prevent hotspotting? Are there any edge cases in this approach that I've missed?

Assuming that User Id is well distributed (i.e. there isn't a user that will have more than 10K operations per second), this approach should be fine.
FYI, Cloud Bigtable measures operations in rows per second, and you want to consider your peak throughput in determining the number of nodes. Each node can support 10,000 simple reads or writes per second. Our smallest production configuration is 3 nodes, which can support up to 30,000 rows per second (2.6 Billion rows per day if used continuously at the maximum).

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks

Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.

The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Is Cassandra / ScyllaDB capable of handling millions of very wide data rows?

A new business need has emerged in our firm, where a relatively "big" data set needs to be accessed by online processes (with typical latency of up to 1 second). There is only one key with a high granularity / rows count measured in tens of millions and the expected number of columns / fields / value columns will likely exceed hundreds of thousands.
The key column is shared among all value columns, so key-value storage, while scalable, seems rather wasteful here. Is there any hope for using Cassandra / ScyllaDB (to which we gradually narrowed down our search) for such a wide data set, while ideally reducing also data storage needs by half (by storing the common key only once)?

If I understand your use case correctly, your use case will have tens of millions of partitions (what you called rows), and each will have hundreds of thousands of different values in each of them (each those would be a clustering row in modern CQL - CQL no longer supports un-schema-ed wide rows). This is a fairly reasonable data set for Scylla and Cassandra.
But I want to add that I'm not sure the storage saving you are hoping for will really be there. Yes, Scylla/Cassandra will not need to store the partition key multiple times, but unless the partition key is very long, this will be often be negligible compared to the other overheads of storing the data on disk.
Another thing you should consider is your expected queries. How will you read from this database? If you'll want to read all 100,000 columns of a particular key, or a contiguous range of them, then the data model you described is perfect. However, if the expected use case is that you always plan to read a single column from a specific key, then this data model will be inefficient - a random-access read from the middle of a long partition is slower than reading the value from a short partition.

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.

I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

Will my BigTable schema result in hotspotting?

Heres my schema
Heres some example data
Rows of this row key structure $PipelineId--$PipelineRunTime will be written less often but with much larger data, not that it would be anywhere close to going over the row limit of data. And rows of this structure $ContentID--$ContentType--$PipelineName will be created much more often but with much less data
This is how I plan to query BT
READ all labels for $PipelineName and $PipelineRunTime
IS $ContentID in labels for $PipelineName at any PipelineRunTime?
READ $ContentID return all labels for any $PipelineName

The hot-spotting situation in the context of BigTable is related to key distribution and its request rate. There are two problems:
How keys are distributed on the backend, and
If the hot keys are distanced in the distribution.
For example, if you have 1 million keys and the request is for only two of them, frequently the capacity would be limited to 1 or 2 backends.
In case the keys would be sequential, then possibly one backend would be serving both keys (will hotspot on high request rates).
In case the keys would not be sequential, there is a probability that two backends would serve those.
As you try to use time as part of the key, you should look into:
Schema design for time series data
[YT Video] Visualizing Cloud Bigtable Access Patterns
[YT Video] Designing Row keys for Bigtable
To understand the performance characteristics, access pattern and to find out if there will be hot-spotting, you should run performance tests and use a key visualizer then apply optimizations if needed.
Understanding BigTable performance
Overview of Key Visualizer

Optimal Riak storage strategy

I'm planning to use Riak for storing some sensor data, but sensors are connected to different users. My plan is to make a structure like this:
Bucket = user id
key = time, new key each minute (or two minutes maybe)
When I say a new key each minute, the readings are not always continuous and are not real time, but they are being uploaded later. They are recorded at certain periods of the day. The frequency of metering is quite high, 250 samples a second. If I make a new key for each measurement, I will get an explosion of keys very fast and I don't think it will do good for performance. Besides that, I do not really need to know the precise number at each given moment, I will use them more sequentially in a period (values from minute N to minute M).
So I'm thinking of "grouping" the results for each minute, and storing them like that as some JSON.
Does this strategy look feasible?
Also, I'm thinking about using LevelDB as the storage engine, just to be on the safe side as far as RAM usage goes.

Lower keys count seems better for me then key for each event. How would you use this data later?
If data is intended for further analyze, leveldb and secondary indexes allow you to pick a data for certain period (if your keys somehow ordered, datetime for instance) in a mapreduce job (with additional efforts it could be done in a background).
Also leveldb do not store all keys in memory, it is good for continuously growing dataset, if you plan to store all the data forever.
If your application depends on predictable latency and need fixed amount of data per query It better to group data like application wants (for sample all keys for a 10 min in one object).
One more concern is total object size, as riak docs says it better not exceed 10mb size for single object.