Does scalability with databases have a limit? - database

Assuming that the larger a database gets, the longer it will take to SELECT rows, won't a database eventually take too long (i.e. annoying to users) to traverse regardless of how optimized it is?
Is it simply a matter of the increasing time being so negligible that there is only a theoretical limit, but no realistic one?

Well, yes, in a manner of speaking. Generally, the more data you have, the longer it will take to find what you're looking for.
There are ways to dramatically reduce that time (indexing, sharding, etc), and you can always add more hardware. Indexing especially saves you from scanning the whole table to find your result. If you've got a simple B-tree index, the worst case should be O(log n).
Apart from theoretical limits, there are also practical ones, for example maximum number of rows per table, but these days those limits are so high that you can almost ignore them.
I wouldn't worry about it. If you're using a decent DBMS and decent hardware... with realistic amounts of data, you can always find a way to return a result in an acceptable amount of time. If you do reach the limits, chances are that you're making money from what you've got stored, and then you can always hire a pro to help you out ;)

Related

Processing performance hit in SSAS with 2000+ partitions in 2008 R2

I am looking into the performance hits in processing time when increasing the number of partitions in a cube. I realise from http://technet.microsoft.com/en-us/library/ms365363.aspx that in theory it can be 2+ billion however I expect there is still a hit with any increase. Is there a way I can estimate this (I realise it's subject, I guess I'm looking for a formula) or would I have to proof it out?
Many thanks,
Sara
Partitions are generally used to increase the performance, not to decrease performance, but you're right that if you have too many, then you will take a performance hit. It looks like you want to know how to find out how many partitions is too many.
I'm going to assume that the processing time you are talking about is the time to process the cube, not the time to query the cube.
The general idea of partitions is that you only have to process only a small subset of the partitions when you are reprocessing the cube. This makes it a huge performance enhancement. If you are processing a large number of partitions, then the overhead of processing an individual partition becomes non-negligible. The point this happens can depend on a number of factors. The factors that scale with partitions include:
Additional queries to your data source. This cost varies greatly with your data source arrangements.
Additional files to store the partitions.
Additional links to the partitions.
I think the biggest factor here is how you get the data from the data source. If the partitioning is not supported well by your source, then your performance will be horrendous. If it's supported well, e.g. it has all the necessary indices in a relational database, then you only incur the overhead of individual queries.
So I think a more fitting way to ask this question is not how many partitions is too many, but how small of a partition is too small? I would say if the number of facts in a partition is in the low hundreds, then you probably have too many partitions. It's highly unlikely you will want to make that many partitions. I think the 2 billion quoted is just to assure you that you'll never get there.
Regarding whether you should have this many partitions, I don't think you should. I think you should partition carefully, making maybe a few hundred partitions, partitioning the data based on whether the data changes often or not.

How does database query time scale with database size?

I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Costs vs Consistant gets

What does it indicate to see a query that has a low cost in the explain plan but a high consistent gets count in autotrace? In this case the cost was in the 100's and the CR's were in the millions.
The cost can represent two different things depending on version and whether you are running in cpu-based costing mode or not.
Briefly, the cost represents that amount of time that the optimizer expects the query to execute for, but it is expressed in units of the amount of time that a single block read takes. For example if Oracle expects a single block read to take 1ms and the query to take 20ms, then the cost equals 20.
Consistent gets do not match exactly with this for a number of reasons: the cost includes non-consistent (current) gets (eg reading and writing temp data), the cost includes CPU time, and a consistent get can be a multiblock read instead of a single block read and hence have a different duration. Oracle can also get the estimate of the cost completely wrong and it could end up requiring a lot more or less consistent gets than the estimate suggested.
A useful method that can helo explain disconnects between predicted execution plan and actual performance is "cardinality feedback". See this presentation: http://www.centrexcc.com/Tuning%20by%20Cardinality%20Feedback.ppt.pdf
At best, the cost is the optimizer's estimate of the number of I/O's that a query would perform. So, at best, the cost is only likely to be accurate if the optimizer has found a very good plan-- if the optimizer's estimate of the cost is correct and the plan is ideal, that generally means that you're never going to bother looking at the plan because that query is going to perform reasonably well.
Consistent gets, however, is an actual measure of the number of gets that a query actually performed. So that is a much more accurate benchmark to use.
Although there are many, many things that can influence the cost, and a few things that can influence the number of consistent gets, it is probably reasonable to expect that if you have a very low cost and a very high number of consistent gets that the optimizer is probably working with poor estimates of the cardinality of the various steps (the ROWS column in the PLAN_TABLE tells you the expected number of rows returned in each step). That may indicate that you have missing or outdated statistics, that you are missing some histograms, that your initialization parameters or system statistics are wrong in some way, or that the CBO has problems for some other reason estimating the cardinality of your results.
What version of Oracle are you using?

How do you measure SQL Fill Factor value

Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.

Resources