Cassandra - What is the reasonable maximum number of tables? - database

I am new to Cassandra. As I understand the maximum number of tables that can be stored per keyspace is Integer.Max_Value. However, what are the implications from the performance perspective (speed, storage, etc) of such a big number of tables? Is there any recommendation regarding that?

While there are legitimate use cases for having lots of tables in Cassandra, they are rare. Your use case might be one of them, but make sure that it is. Without knowning more about the problem you're trying to solve, it's obviously hard to give guidance. Many tables will require more resources, obviously. How much? That depends on the settings, and the usage.
For example, if you have a thousand tables and write to all of them at the same time there will be contention for RAM since there will be memtables for each of them, and there is a certain overhead for each memtable (how much depends on which version of Cassandra, your settings, etc.).
However, if you have a thousand tables but don't write to all of them at the same time, there will be less contention. There's still a per table overhead, but there will be more RAM to keep the active table's memtables around.
The same goes for disk IO. If you read and write to a lot of different tables at the same time the disk is going to do much more random IO.
Just having lots of tables isn't a big problem, even though there is a limit to how many you can have – you can have as many as you want provided you have enough RAM to keep the structures that keep track of them. Having lots of tables and reading and writing to them all at the same time will be a problem, though. It will require more resources than doing the same number of reads and writes to fewer tables.

In my opinion if you can split the data into multiple tables, even thousands, is beneficial.
Pros:
Suppose you want to scale in future to 10+ nodes and with a RF of 2 will result in having the data evenly distributed across nodes, thus not salable.
Another point is random IO which will be big if you will read from many tables at the same time but I don't see why there is a difference when having just one table. Also you will seek for another partition key, so no difference in IO.
When the compactation takes place it will have to do less work if there is only one table. The values from SSTables must be loaded into memory, merged and saved back.
Cons:
Having multiple tables will result in having multiple memtables. I think the difference added by this to the RAM is insignificant.
Also, check out the links, they helped me A LOT http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/
https://www.infoq.com/presentations/Apache-Cassandra-Anti-Patterns
Please fell free to edit my post, I am kinda new to Big Data

Related

What operations are O(n) on the number of tables in PostgreSQL?

Let's say theoretically, I have database with an absurd number of tables (100,000+). Would that lead to any sort of performance issues? Provided most queries (99%+) will only run on 2-3 tables at a time.
Therefore, my question is this:
What operations are O(n) on the number of tables in PostgreSQL?
Please note, no answers about how this is bad design, or how I need to plan out more about what I am designing. Just assume that for my situation, having a huge number of tables is the best design.
pg_dump and pg_restore and pg_upgrade are actually worse than that, being O(N^2). That used to be a huge problem, although in recent versions, the constant on that N^2 has been reduced to so low that for 100,000 table it is probably not enough to be your biggest problem. However, there are worse cases, like dumping tables can be O(M^2) (maybe M^3, I don't recall the exact details anymore) for each table, where M is the number of columns in the table. This only applies when the columns have check constraints or defaults or other additional info beyond a name and type. All of these problems are particularly nasty when you have no operational problems to warn you, but then suddenly discover you can't upgrade within a reasonable time frame.
Some physical backup methods, like barman using rsync, are also O(N^2) in the number of files, which is at least as great as the number of tables.
During normal operations, the stats collector can be a big bottleneck. Everytime someone requests updated stats on some table, it has to write out a file covering all tables in that database. Writing this out is O(N) for the tables in that database. (It used to be worse, writing out one file for the while instance, not just the database). This can be made even worse on some filesystems, which when renaming one file over the top of an existing one, implicitly fsyncs the file, so putting it on a RAM disc can at least ameliorate that.
The autovacuum workers loop over every table (roughly once per autovacuum_naptime) to decide if they need to be vacuumed, so a huge number of tables can slow this down. This can also be worse than O(N), because for each table there is some possibility it will request updated stats on it. Worse, it could block all concurrent autovacuum workers while doing so (this last part fixed in a backpatch for all supported versions).
Another problem you might into is that each database backend maintains a cache of metadata on each table (or other object) it has accessed during its lifetime. There is no mechanism for expiring this cache, so if each connection touches a huge number of tables it will start consuming a lot of memory, and one copy for each backend as it is not shared. If you have a connection pooler which hold connections open indefinitely, this can really add up as each connection lives long enough to touch many tables.
pg_dump with some options, probably -s. Some other options make it depend more on size of data.

How large can a hbase table actually grow?

Would there be any reason to split a hbase table into smaller entities, or can it grow forever (assuming available disk space)?
Background:
We have realtime data (measurements), up to lets say 500,000/s, which consists essentially of timestamp, value, flags. If we distribute the values to different tables, it would also mean to insert each of the entries individually, which is a performance killer. If we insert in bulk it is much faster. The question is, are there any downsides to have a hbase table with an extreme size?
There could be a strong reason behind splitting a table, which is avoiding RegionServer hotspotting, by distributing the load across multiple RegionServers. HBase, by virtue of its nature, stores rows sequentially at one place. Rows with similar keys go to the same server(timeseries data, for example). This is to facilitate better range queries. However, this starts becoming a bottleneck once your data grows too big(and your disk still has space).
In cases like above data will continue to go to the same RegionServer, leading to hotspotting. So, we split tables manually to distribute the data uniformly across the cluster.
I don't see the point in manually splitting an HBase table, HBase does this on his own and extremely well (which called HBase table regions)
HBase has been made to handle extremely large data, so I like to believe that the limit depends on your hardware only (of course so configurations might impact performance such as automatic major compaction etc...)

Processing performance hit in SSAS with 2000+ partitions in 2008 R2

I am looking into the performance hits in processing time when increasing the number of partitions in a cube. I realise from http://technet.microsoft.com/en-us/library/ms365363.aspx that in theory it can be 2+ billion however I expect there is still a hit with any increase. Is there a way I can estimate this (I realise it's subject, I guess I'm looking for a formula) or would I have to proof it out?
Many thanks,
Sara
Partitions are generally used to increase the performance, not to decrease performance, but you're right that if you have too many, then you will take a performance hit. It looks like you want to know how to find out how many partitions is too many.
I'm going to assume that the processing time you are talking about is the time to process the cube, not the time to query the cube.
The general idea of partitions is that you only have to process only a small subset of the partitions when you are reprocessing the cube. This makes it a huge performance enhancement. If you are processing a large number of partitions, then the overhead of processing an individual partition becomes non-negligible. The point this happens can depend on a number of factors. The factors that scale with partitions include:
Additional queries to your data source. This cost varies greatly with your data source arrangements.
Additional files to store the partitions.
Additional links to the partitions.
I think the biggest factor here is how you get the data from the data source. If the partitioning is not supported well by your source, then your performance will be horrendous. If it's supported well, e.g. it has all the necessary indices in a relational database, then you only incur the overhead of individual queries.
So I think a more fitting way to ask this question is not how many partitions is too many, but how small of a partition is too small? I would say if the number of facts in a partition is in the low hundreds, then you probably have too many partitions. It's highly unlikely you will want to make that many partitions. I think the 2 billion quoted is just to assure you that you'll never get there.
Regarding whether you should have this many partitions, I don't think you should. I think you should partition carefully, making maybe a few hundred partitions, partitioning the data based on whether the data changes often or not.

Advice on building a fast, distributed database

I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
episode-109-ebays-architecture-principles-with-randy-shoup
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources