In the open source version, Scylla recommends keeping up to 50% of disk space free for “compactions”. At the same time, the documentation states that each table is compacted independently of each other. Logically, this suggests that in a applications with dozens (or even multiple) tables there’s only a small chance that so many compaction will coincide.
Is there a mathematical model of calculating how multiple compaction might overlap in an application with several tables? Based on a cursory analysis, it seems that the likelihood of multiple overlapping compaction is small, especially when we are dealing with dozens of independent tables.
You're absolutely right:
With the size-tiered compaction strategy a compaction may temporarily double the disk requirements. But it doesn't double the entire disk requirements but only of the sstables involved in this compaction (see also my blog post on size-tiered compaction and its space amplification). There is indeed a difference between "the entire disk usage" and just "the sstables involved in this compaction" for two reasons:
As you noted in your question, if you have 10 tables of similar size, compacting just one of them will work on just 10% of the data, so the temporary disk usage during compaction might be 10% of the disk usage, not 100%.
Additionally, Scylla is sharded, meaning that different CPUs handle their sstables, and compactions, completely independently. If you have 8 CPUs on your machines, each CPU only handles 1/8th of the data, so when it does compaction, the maximum temporary overhead will be 1/8th of the table's size - not the full table size.
The second reason cannot be counted on - since shards choose when to compact independently, if you're unlucky all shards may decide to compact the same table at exactly the same time, and worse - may happen to do the biggest compactions all at the same time. This "unluckiness" can also happen at 100% probability if you start a "major compaction" (nodetool compact).
The first reason, the one which you asked about, is indeed more useful and reliable: Beyond it being unlikely that all shards will choose to compact all sstables are exactly the same time, there is an important detail in Scylla's compaction algorithm which helps here: Each shard only does one compaction of a (roughly) given size at a time. So if you have many roughly-equal-sized tables, no shard can be doing full compaction of more than one of those tables at a time. This is guaranteed - it's not a matter of probability.
Of course, this "trick" only helps if you really have many roughly-equal-sized tables. If one table is much bigger than the rest, or tables have very different sizes, it won't help you too much to control the maximum temporary disk use.
In issue https://github.com/scylladb/scylla/issues/2871 I proposed a idea of how Scylla can guarantee that when disk space is low, the sharding (point 1) is also used to reduce temporary disk space usage. We haven't implemented this idea, but instead implemented a better idea - "incremental compaction strategy", which does huge compactions in pieces ("incrementally") to avoid most of the temporary disk usage. See this blog post for how this new compaction strategy works, and graphs demonstrating how it lowers the temporary disk usage. Note that Incremental Compaction Strategy is currently part of the Scylla Enterprise version (it's not in the open-source version).
Related
Let's say theoretically, I have database with an absurd number of tables (100,000+). Would that lead to any sort of performance issues? Provided most queries (99%+) will only run on 2-3 tables at a time.
Therefore, my question is this:
What operations are O(n) on the number of tables in PostgreSQL?
Please note, no answers about how this is bad design, or how I need to plan out more about what I am designing. Just assume that for my situation, having a huge number of tables is the best design.
pg_dump and pg_restore and pg_upgrade are actually worse than that, being O(N^2). That used to be a huge problem, although in recent versions, the constant on that N^2 has been reduced to so low that for 100,000 table it is probably not enough to be your biggest problem. However, there are worse cases, like dumping tables can be O(M^2) (maybe M^3, I don't recall the exact details anymore) for each table, where M is the number of columns in the table. This only applies when the columns have check constraints or defaults or other additional info beyond a name and type. All of these problems are particularly nasty when you have no operational problems to warn you, but then suddenly discover you can't upgrade within a reasonable time frame.
Some physical backup methods, like barman using rsync, are also O(N^2) in the number of files, which is at least as great as the number of tables.
During normal operations, the stats collector can be a big bottleneck. Everytime someone requests updated stats on some table, it has to write out a file covering all tables in that database. Writing this out is O(N) for the tables in that database. (It used to be worse, writing out one file for the while instance, not just the database). This can be made even worse on some filesystems, which when renaming one file over the top of an existing one, implicitly fsyncs the file, so putting it on a RAM disc can at least ameliorate that.
The autovacuum workers loop over every table (roughly once per autovacuum_naptime) to decide if they need to be vacuumed, so a huge number of tables can slow this down. This can also be worse than O(N), because for each table there is some possibility it will request updated stats on it. Worse, it could block all concurrent autovacuum workers while doing so (this last part fixed in a backpatch for all supported versions).
Another problem you might into is that each database backend maintains a cache of metadata on each table (or other object) it has accessed during its lifetime. There is no mechanism for expiring this cache, so if each connection touches a huge number of tables it will start consuming a lot of memory, and one copy for each backend as it is not shared. If you have a connection pooler which hold connections open indefinitely, this can really add up as each connection lives long enough to touch many tables.
pg_dump with some options, probably -s. Some other options make it depend more on size of data.
I am new to Cassandra. As I understand the maximum number of tables that can be stored per keyspace is Integer.Max_Value. However, what are the implications from the performance perspective (speed, storage, etc) of such a big number of tables? Is there any recommendation regarding that?
While there are legitimate use cases for having lots of tables in Cassandra, they are rare. Your use case might be one of them, but make sure that it is. Without knowning more about the problem you're trying to solve, it's obviously hard to give guidance. Many tables will require more resources, obviously. How much? That depends on the settings, and the usage.
For example, if you have a thousand tables and write to all of them at the same time there will be contention for RAM since there will be memtables for each of them, and there is a certain overhead for each memtable (how much depends on which version of Cassandra, your settings, etc.).
However, if you have a thousand tables but don't write to all of them at the same time, there will be less contention. There's still a per table overhead, but there will be more RAM to keep the active table's memtables around.
The same goes for disk IO. If you read and write to a lot of different tables at the same time the disk is going to do much more random IO.
Just having lots of tables isn't a big problem, even though there is a limit to how many you can have – you can have as many as you want provided you have enough RAM to keep the structures that keep track of them. Having lots of tables and reading and writing to them all at the same time will be a problem, though. It will require more resources than doing the same number of reads and writes to fewer tables.
In my opinion if you can split the data into multiple tables, even thousands, is beneficial.
Pros:
Suppose you want to scale in future to 10+ nodes and with a RF of 2 will result in having the data evenly distributed across nodes, thus not salable.
Another point is random IO which will be big if you will read from many tables at the same time but I don't see why there is a difference when having just one table. Also you will seek for another partition key, so no difference in IO.
When the compactation takes place it will have to do less work if there is only one table. The values from SSTables must be loaded into memory, merged and saved back.
Cons:
Having multiple tables will result in having multiple memtables. I think the difference added by this to the RAM is insignificant.
Also, check out the links, they helped me A LOT http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/
https://www.infoq.com/presentations/Apache-Cassandra-Anti-Patterns
Please fell free to edit my post, I am kinda new to Big Data
If my index is say 80% fragmented and is used in joins can the overall performance be worse than if that index didn't exist? And if so, why?
Your question is too vague to answer consistently, or even to know what you're actually after, but consider this:
A fragmented index means you'll have a lot of actual disk activity compared to the amount of disk activity you'd need for a certain query.
Take a look at DBCC SHOWCONTIG
Among other useful information, it shows you a figure for Scan Density. A very low "hit rate" on this can imply that you're doing heaps more IO than you'd need to with a properly maintained index. This could even exceed the amount of IO you'd need to perform a table scan, but it all depends on the size of your objects and your data access pattern.
One area where a poorly maintained (= highly fragmented) index will hurt you double, is that it hurts performance in inserts, updates AND selects.
With this in mind, it's a pretty common practice for ETL processes to drop indexes before and recreating them after processing large batches of information. In the mean time, they'd only hurt write performance and be too far fragmented to help lookups.
Besides that: it's easy to do index maintenance. I'd recommend deploying Ola Hallengren's index maintenance solution and no longer worry about it.
I am looking into the performance hits in processing time when increasing the number of partitions in a cube. I realise from http://technet.microsoft.com/en-us/library/ms365363.aspx that in theory it can be 2+ billion however I expect there is still a hit with any increase. Is there a way I can estimate this (I realise it's subject, I guess I'm looking for a formula) or would I have to proof it out?
Many thanks,
Sara
Partitions are generally used to increase the performance, not to decrease performance, but you're right that if you have too many, then you will take a performance hit. It looks like you want to know how to find out how many partitions is too many.
I'm going to assume that the processing time you are talking about is the time to process the cube, not the time to query the cube.
The general idea of partitions is that you only have to process only a small subset of the partitions when you are reprocessing the cube. This makes it a huge performance enhancement. If you are processing a large number of partitions, then the overhead of processing an individual partition becomes non-negligible. The point this happens can depend on a number of factors. The factors that scale with partitions include:
Additional queries to your data source. This cost varies greatly with your data source arrangements.
Additional files to store the partitions.
Additional links to the partitions.
I think the biggest factor here is how you get the data from the data source. If the partitioning is not supported well by your source, then your performance will be horrendous. If it's supported well, e.g. it has all the necessary indices in a relational database, then you only incur the overhead of individual queries.
So I think a more fitting way to ask this question is not how many partitions is too many, but how small of a partition is too small? I would say if the number of facts in a partition is in the low hundreds, then you probably have too many partitions. It's highly unlikely you will want to make that many partitions. I think the 2 billion quoted is just to assure you that you'll never get there.
Regarding whether you should have this many partitions, I don't think you should. I think you should partition carefully, making maybe a few hundred partitions, partitioning the data based on whether the data changes often or not.
I've been studying indexes and there are some questions that pother me and which I think important.
If you can help or refer to sources, please feel free to do it.
Q1: B-tree indexes can favor a fast access to specific rows on a table. Considering an OLTP system, with many accesses, both Read and Write, simultaneously, do you think it can be a disadvantage having many B-tree indexes on this system? Why?
Q2: Why are B-Tree indexes not fully occupied (typically only 75% occupied, if I'm not mistaken)?
Q1: I've no administration experience with large indexing systems in practice, but the typical multiprocessing environment drawbacks apply to having multiple B-tree indexes on a system -- cost of context switching, cache invalidation and flushing, poor IO scheduling, and the list goes up. On the other hand, IO is something that inherently ought to be non-blocking for maximal use of resources, and it's hard to do that without some sort of concurrency, even if done in a cooperative manner. (For example, some people recommend event-based systems.) Also, you're going to need multiple index structures for many practical applications, especially if you're looking at OLTP. The biggest thing here is good IO scheduling, access patterns, and data caching depending on said access patterns.
Q2: Because splitting and re-balancing nodes is expensive. The naive methodology for speed is "only split with they're full." Given this, there's two extremes -- a node was just split and is half full, or a node is full so it will be next time. The 'average' between the cases (50% and 100%) is 75%. Yes, it's somewhat bad logic from a mathematics perspective, but it exposes the underlying reason as to why the 75% figure appears.