I am new to Cassandra. As I understand the maximum number of tables that can be stored per keyspace is Integer.Max_Value. However, what are the implications from the performance perspective (speed, storage, etc) of such a big number of tables? Is there any recommendation regarding that?
While there are legitimate use cases for having lots of tables in Cassandra, they are rare. Your use case might be one of them, but make sure that it is. Without knowning more about the problem you're trying to solve, it's obviously hard to give guidance. Many tables will require more resources, obviously. How much? That depends on the settings, and the usage.
For example, if you have a thousand tables and write to all of them at the same time there will be contention for RAM since there will be memtables for each of them, and there is a certain overhead for each memtable (how much depends on which version of Cassandra, your settings, etc.).
However, if you have a thousand tables but don't write to all of them at the same time, there will be less contention. There's still a per table overhead, but there will be more RAM to keep the active table's memtables around.
The same goes for disk IO. If you read and write to a lot of different tables at the same time the disk is going to do much more random IO.
Just having lots of tables isn't a big problem, even though there is a limit to how many you can have – you can have as many as you want provided you have enough RAM to keep the structures that keep track of them. Having lots of tables and reading and writing to them all at the same time will be a problem, though. It will require more resources than doing the same number of reads and writes to fewer tables.
In my opinion if you can split the data into multiple tables, even thousands, is beneficial.
Pros:
Suppose you want to scale in future to 10+ nodes and with a RF of 2 will result in having the data evenly distributed across nodes, thus not salable.
Another point is random IO which will be big if you will read from many tables at the same time but I don't see why there is a difference when having just one table. Also you will seek for another partition key, so no difference in IO.
When the compactation takes place it will have to do less work if there is only one table. The values from SSTables must be loaded into memory, merged and saved back.
Cons:
Having multiple tables will result in having multiple memtables. I think the difference added by this to the RAM is insignificant.
Also, check out the links, they helped me A LOT http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/
https://www.infoq.com/presentations/Apache-Cassandra-Anti-Patterns
Please fell free to edit my post, I am kinda new to Big Data
I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.
Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.
I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:
The data for each row that is relevant to the scoring process is about 50 bytes per item.
Inserts and updates to the DB are negligible.
Each score is independent of the others, so scores can be computed in parallel.
Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
The method should scale well for multiple simultaneous users
The fewer computing resources this requires, in terms of number of servers, the better.
Thanks
A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.
The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.
If you could post query you are having issue with can help.
Although here are some things.
Make sure you have indexes created on database.
Make sure to use optimized queries and using joins instead of inner queries.
Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.
I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.
I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
episode-109-ebays-architecture-principles-with-randy-shoup
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)
I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size
I'm doing some research into databases and I'm looking at some limitations of relational DBs.
I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck?
How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?
Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.
In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.
Denormalising to improve performance? It sounds convincing, but it doesn't hold water.
Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.
I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.
He found that:
Some of them hold for special cases
All of them fail to pay off for general use
All of them are significantly worse for other special cases
It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.
Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.
Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.
Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.
It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.
There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
One of the arguments of the comparison is an expression (no index)
Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.
If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.
For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.
You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.
I'd also like to respond to
Joins are just cartesian products with some lipgloss
What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.
The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B
Notes
David Aldridge provides some important additional information.
There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.
A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.
I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.
"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.
The reason that I responded so savagely was that the statement as worded says that
Joins are cartesian products...
This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.
I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.
The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.
Sometimes denormalisation is a good idea
Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.
You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.
A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).
An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.
Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?
Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.
ADO.NET problem with type mismatches
Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.
VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.
"Count the disk hits" (Rick James)
If everything is cached in RAM, JOINs are rather cheap. That is, normalization does not have much performance penalty.
If a "normalized" schema causes JOINs to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.
Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.
What most commenters fail to note is the wide range of join methodologies available in a complex RDBMS, and the denormalisers invariably gloss over the higher cost of maintaining denormalised data. Not every join is based on indexes, and databases have a lot of optimised algotithms and methodologies for joining that are intended to reduce join costs.
In any case, the cost of a join depends on its type and a few other factors. It needn't be expensive at all - some examples.
A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help.
The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort.
The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins.
A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.
Databases are designed to join, and they're very flexible in how they do it and generally very performant unless they get the join mechanism wrong.
I think the whole question is based on a false premise. Joins on large tables are not necessarily expensive. In fact, doing joins efficiently is one of the main reasons relational databases exist at all. Joins on large sets often are expensive, but very rarely do you want to join the entire contents of large table A with the entire contents of large table B. Instead, you write the query such that only the important rows of each table are used and the actual set kept by the join remains smaller.
Additionally, you have the efficiencies mentioned by Peter Wone, such that only the important parts of each record need be in memory until the final result set is materialized. Also, in large queries with many joins you typically want to start with the smaller table sets and work your way up to the large ones, so that the set kept in memory remains as small as possible as long as possible.
When done properly, joins are generally the best way to compare, combine, or filter on large amounts of data.
The bottleneck is pretty much always disk I/O, and even more specifically - random disk I/O (by comparison, sequential reads are fairly fast and can be cached with read ahead strategies).
Joins can increase random seeks - if you're jumping around reading small parts of a large table. But, query optimizers look for that and will turn it into a sequential table scan (discarding the unneeded rows) if it thinks that'd be better.
A single denormalized table has a similar problem - the rows are large, and so less fit on a single data page. If you need rows that are located far from another (and the large row size makes them further apart) then you'll have more random I/O. Again, a table scan may be forced to avoid this. But, this time, your table scan has to read more data because of the large row size. Add to that the fact that you're copying data from a single location to multiple locations, and the RDBMS has that much more to read (and cache).
With 2 tables, you also get 2 clustered indexes - and can generally index more (because of less insert/update overhead) which can get you drastically increased performance (mainly, again, because indexes are (relatively) small, quick to read off disk (or cheap to cache), and lessen the amount of table rows you need to read from disk).
About the only overhead with a join comes from figuring out the matching rows. Sql Server uses 3 different types of joins, mainly based on dataset sizes, to find matching rows. If the optimizer picks the wrong join type (due to inaccurate statistics, inadequate indexes, or just an optimizer bug or edge case) it can drastically affect query times.
A loop join is farily cheap for (at least 1) small dataset.
A merge join requires a sort of both datasets first. If you join on an indexed column, though, then the index is already sorted and no further work needs to be done. Otherwise, there is some CPU and memory overhead in sorting.
The hash join requires both memory (to store the hashtable) and CPU (to build the hash). Again, this is fairly quick in relation to the disk I/O. However, if there's not enough RAM to store the hashtable, Sql Server will use tempdb to store parts of the hashtable and the found rows, and then process only parts of the hashtable at a time. As with all things disk, this is fairly slow.
In the optimal case, these cause no disk I/O - and so are negligible from a performance perspective.
All in all, at worst - it should actually be faster to read the same amount of logical data from x joined tables, as it is from a single denormalized table because of the smaller disk reads. To read the same amount of physical data, there could be some slight overhead.
Since query time is usually dominated by I/O costs, and the size of your data does not change (minus some very miniscule row overhead) with denormalization, there's not a tremendous amount of benefit to be had by just merging tables together. The type of denormalization that tends to increase performance, IME, is caching calculated values instead of reading the 10,000 rows required to calculate them.
The order in which you're joining the tables is extremely important. If you have two sets of data try to build the query in a way so the smallest will be used first to reduce the amount of data the query has to work on.
For some databases it does not matter, for example MS SQL does know the proper join order most of the time.
For some (like IBM Informix) the order makes all the difference.
Deciding on whether to denormalize or normalize is fairly a straightforward process when you consider the complexity class of the join. For instance, I tend to design my databases with normalization when the queries are O(k log n) where k is relative to the desired output magnitude.
An easy way to denormalize and optimize performance is to think about how changes to your normalize structure affect your denormalized structure. It can be problematic however as it may require transactional logic to work on a denormalized structured.
The debate for normalization and denormalization isn't going to end since the problems are vast. There are many problems where the natural solution requires both approaches.
As a general rule, I've always stored a normalized structure and denormalized caches that can be reconstructed. Eventually, these caches save my ass to solve the future normalization problems.
Elaborating what others have said,
Joins are just cartesian products with some lipgloss. {1,2,3,4}X{1,2,3} would give us 12 combinations (nXn=n^2). This computed set acts as a reference on which conditions are applied. The DBMS applies the conditions (like where both left and right are 2 or 3) to give us the matching condition(s). Actually it is more optimised but the problem is the same. The changes to size of the sets would increase the result size exponentially. The amount of memory and cpu cycles consumed all are effected in exponential terms.
When we denormalise, we avoid this computation altogether, think of having a colored sticky, attached to every page of your book. You can infer the information with out using a reference. The penalty we pay is that we are compromising the essence of DBMS (optimal organisation of data)