Database Shard count recommendation (power of 2) - database

I was reading this post and author advises to have shard count as a power of two.
Benefit we get from it? Why cant it be a simple number like 500, 150 or 1000?

A typical growth pattern for distributed data systems is to double the cluster size when needed. This allows for more even rebalancing of data and minimizes the effect of any hotspots.
Here is an in-depth discussion on database sharding that you may find useful. (Disclosure: dbShards is one of my company's products)

Related

How to decide Kafka Cluster size

I am planning to decide on how many nodes should be present on Kafka Cluster. I am not sure about the parameters to take into consideration. I am sure it has to be >=3 (with replication factor of 2 and failure tolerance of 1 node).
Can someone tell me what parameters should be kept in mind while deciding the cluster size and how they effect the size.
I know of following factors but don't know how it quantitatively effects the cluster size. I know how it qualitatively effect the cluster size. Is there any other parameter which effects cluster size?
1. Replication factor (cluster size >= replication factor)
2. Node failure tolerance. (cluster size >= node-failure + 1)
What should be cluster size for following scenario while consideration of all the parameters
1. There are 3 topics.
2. Each topic has messages of different size. Message size range is 10 to 500kb. Average message size being 50kb.
3. Each topic has different partitions. Partitions are 10, 100, 500
4. Retention period is 7 days
5. There are 100 million messages which gets posted every day for each topic.
Can someone please point me to relevant documentation or any other blog which may discuss this. I have google searched it but to no avail
As I understand, getting good throughput from Kafka doesn't depend only on the cluster size; there are others configurations which need to be considered as well. I will try to share as much as I can.
Kafka's throughput is supposed to be linearly scalabale with the numbers of disk you have. The new multiple data directories feature introduced in Kafka 0.8 allows Kafka's topics to have different partitions on different machines. As the partition number increases greatly, so do the chances that the leader election process will be slower, also effecting consumer rebalancing. This is something to consider, and could be a bottleneck.
Another key thing could be the disk flush rate. As Kafka always immediately writes all data to the filesystem, the more often data is flushed to disk, the more "seek-bound" Kafka will be, and the lower the throughput. Again a very low flush rate might lead to different problems, as in that case the amount of data to be flushed will be large. So providing an exact figure is not very practical and I think that is the reason you couldn't find such direct answer in the Kafka documentation.
There will be other factors too. For example the consumer's fetch size, compressions, batch size for asynchronous producers, socket buffer sizes etc.
Hardware & OS will also play a key role in this as using Kafka in a Linux based environment is advisable due to its pageCache mechanism for writing data to the disk. Read more on this here
You might also want to take a look at how OS flush behavior play a key role into consideration before you actually tune it to fit your needs. I believe it is key to understand the design philosophy, which makes it so effective in terms of throughput and fault-tolerance.
Some more resource I find useful to dig in
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
http://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
https://grey-boundary.io/load-testing-apache-kafka-on-aws/
https://cwiki.apache.org/confluence/display/KAFKA/Performance+testing
I had recently worked with kafka and these are my observations.
Each topic is divided into partitions and all the partitions of a topic are distributed across kafka brokers; first of all these help to save topics whose size is larger than the capacity of a single kafka broker and also they increase the consumer parallelism.
To increase the reliability and fault tolerance,replications of the partitions are made and they do not increase the consumer parallelism.The thumb rule is a single broker can host only a single replica per partition. Hence Number of brokers must be >= No of replicas
All partitions are spread across all the available brokers,number of partitions can be irrespective of number of brokers but number of partitions must be equal to the number of consumer threads in a consumer group(to get best throughput)
The cluster size should be decided keeping in mind the throughput you want to achieve at consumer.
The total MB/s per broker would be:
Data/Day = (100×10^6 Messages / Day ) × 0.5MB = 5TB/Day per Topic
That gives us ~58MB/s per Broker. Assuming that the messages are equally split between partitions, for the total cluster we get: 58MB/s x 3 Topics = 178MB/s for all the cluster.
Now, for the replication, you have: 1 extra replica per topic. Therefore this becomes 58MB/sec/broker INCOMING original data + 58MB/sec/broker OUTGOING replication data + 58MB/sec/broker INCOMING replication data.
This gets about ~136MB/s per broker ingress and 58MB/s per broker egress.
The systems load will get very high and this is without taking into consideration any stream processing.
The system load could be handled by increasing the number of brokers and splitting your topics to more specific partitions.
If your data are very important, then you may want a different (high) replication factor. Fault tolerance is also an important factor for deciding the replication.
For example, if you had very very important data, apart from the N active brokers (with the replicas) that are managing your partitions, you may require to add stand-by followers in different areas.
If you require very low latency, then you may want to further increase your partitions (by adding additional keys). The more keys you have, the fewer messages you will have on each partition.
For low latency, you may want a new cluster (with the replicas) that manages only that special topic and no additional computation is done to other topics.
If a topic is not very important, then you may want to lower the replication factor of that particular topic and be more elastic to some data loss.
When building a Kafka cluster, the machines supporting your infrastructure should be equally capable. That is since the partitioning is done with round-robin style, you expect that each broker is capable of handling the same load, therefore the size of your messages does not matter.
The load from stream processing will also have a direct impact. A good software to manage your kafka monitor and manage your streams is Lenses, which I personally favor a lot since it does an amazing work with processing real-time streams

How do I know how many partitions a DynamoDB table is spread over?

Amazon's DynamoDB in designed for guaranteed performances. A customer must provision throughput for each of it's tables.
To achieve this performances, tables are transparently spread over multiple "servers" AKA "partitions".
Amazon provides us with a "best practice" guide for dimensioning and optimizing the throughput. In this guide, we are told that the provisioned throughput is evenly divided over the partitions. In other words, If the requests are not evenly distributed over the partitions, only a fraction of the reserved (and paid) throughput will be available to the application.
In the worst case scenario, it will be:
worst_throughput = provisioned_and_paid_throughput / partitions
To estimate this "worst_throughput", I need to know the total number of partitions. Where can I find it or how do I estimate it ?
It says, "When storing data, Amazon DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element."
What you really want to know is the throughput of a single partition. It seems like you can test that by hammering a single key.
See this page: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
Which has some simple calculations you can carry out based on the amount of read and write capacity you provision. Note that this is only for initial capacity. As your usage of dynamodb continues, these calculations will have less and less relevance.
A single partition can hold approximately 10 GB of data, and can support a maximum of 3,000 read capacity units or 1,000 write capacity units.

How does database query time scale with database size?

I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.

How does Voldemort compare to Cassandra?

How does Voldemort compare to Cassandra?
I'm not talking about size of community and only want to hear from people who have actually used both.
Especially I'm interested in:
How they dynamically scale when adding and removing nodes
Query performance
How they scale when adding nodes (linear)?
Write speed
Voldemort's support for adding nodes was just added recently (this month). So I would expect Cassandra's to be more robust given the longer time to cook and a larger community testing.
Both are fast (> 10k ops/s per machine). Because of their storage designs, I would expect Cassandra to be faster at writes, and Voldemort to be faster at reads. I would also expect Cassandra's performance to degrade less as the amount of data per node increases. And of course if you need more than just a key/value data model Cassandra's ColumnFamily model wins.
I don't know of any head-to-head benchmarks since the one done for NoSQL SF last June, which found Cassandra to be somewhat faster at whatever workload mix he was using. (The "vpork" talk from http://blog.oskarsson.nu/2009/06/nosql-debrief.html) 8 months is an eternity with projects under this much development, though.
Some additional comments:
Regarding write speed, Cassandra should be faster -- it is designed to be faster to write than read (you can avoid immediate disk hit for writes due to specialized way storage is done)
But main difference I think is actually not performance but feature set: Voldemort is strictly a key/value store (currently anyway), whereas Cassandra can offer range queries (with order-preserving partitioner), and bit more structure around data (column families etc). Former is an important consideration for design; latter IMO less so, you can always structure BLOB data on client side.

Using SQL server partitioning when storing 1000 records a second

I read your article(SQL Server partitioning: not the answer to everything)
and being amazing of use partitioning for my case or not
I must to store about 1000 record per a second this data is about location of mobile nodes, these data make my database too huge
do you think i must partitioning my database or not(I have so much reporting in future).
1000 a second isn't that much.
Is it every second of 24/7?
In a defined window?
Is it a peak of 1000 per second but usually less?
We have a recent system growing at 20 million rows/month (after tidy ups of say another 50-80 million) and we're not thinking of anything like partitioning.
That's a lot of data.
What is the lifecycle of the data i.e. do you only need to store the records for a finite amount of time? For example after a month, perhaps certain data can be archived off or moved to a Data warehouse?
Given the volume of data that you intend to work with you are probably going to want to use an architecture that scales easily? For this reason you may want to look at using Cloud type services such as Amazon Ec2, or SQL Data Services on the Azure Platform.
http://aws.amazon.com/ec2/
http://www.microsoft.com/azure/data.mspx
Perhaps if you provide more specific details about what it is you are actually looking to do i.e. what business process you are looking to support, we may be able to provide more specific assistance.
Without such details it is not possible to ascertain whether or not SQL Server Partitioning would be an appropriate design approach for you.
You might need to look at a different RDMS. I would take a look at Vertica.
Presuming the table in question is indexed, then one of two options is certainly warranted when any of the indexes outgrow the available RAM. Not surprisingly, one of them is, increase RAM. The other of course is vertical partitioning.
gbn's answer provides some good things to consider which you have not mentioned, such as how many records per month (or week, or day) are being added. Richard's comment as to how big the (average) record is is also significant, particularly in terms of how big the average records for the indexes are, presuming the indexes do not include all the fields from the table.
gbn's answer however also seems a bit reckless to me. Growing at 20 million rows per month and not even "thinking of anything like partitioning". Without sufficient metrics as alluded to above, this is a possible recipe for disaster. You should at least be thinking about it, even it just to determine how long you can sustain your current and/or expected rate of growth, before needing to consider more RAM or partitioning.

Resources