Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.
Related
Assuming that the larger a database gets, the longer it will take to SELECT rows, won't a database eventually take too long (i.e. annoying to users) to traverse regardless of how optimized it is?
Is it simply a matter of the increasing time being so negligible that there is only a theoretical limit, but no realistic one?
Well, yes, in a manner of speaking. Generally, the more data you have, the longer it will take to find what you're looking for.
There are ways to dramatically reduce that time (indexing, sharding, etc), and you can always add more hardware. Indexing especially saves you from scanning the whole table to find your result. If you've got a simple B-tree index, the worst case should be O(log n).
Apart from theoretical limits, there are also practical ones, for example maximum number of rows per table, but these days those limits are so high that you can almost ignore them.
I wouldn't worry about it. If you're using a decent DBMS and decent hardware... with realistic amounts of data, you can always find a way to return a result in an acceptable amount of time. If you do reach the limits, chances are that you're making money from what you've got stored, and then you can always hire a pro to help you out ;)
I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.
Not sure whether there isn't a DBS that does and whether this is indeed a useful feature, but:
There are a lot of suggestions on how to speed up DB operations by tuning buffer sizes. One example is importing Open Street Map data (the planet file) into a Postgres instance. There is a tool called osm2pgsql (http://wiki.openstreetmap.org/wiki/Osm2pgsql) for this purpose and also a guide that suggests to adapt specific buffer parameters for this purpose.
In the final step of the import, the database is creating indexes and (according to my understanding when reading the docs) would benefit from a huge maintenance_work_mem whereas during normal operation, this wouldn't be too useful.
This thread http://www.mail-archive.com/pgsql-general#postgresql.org/msg119245.html in the contrary suggests a large maintenance_work_mem would not make too much sense during final index creation.
Ideally (imo), the DBS should know best what buffers size combination it could profit most given a limited size of total buffer memory.
So, are there some good reasons why there isn't a built-in heuristic that is able to adapt the buffer sizes automatically according to the current task?
The problem is the same as with any forecasting software. Just because something happened historically doesn't mean it will happen again. Also, you need to complete a task in order to fully analyze how you should have done it more efficient. Problem is that the next task is not necessarily anything like the previously completed task. So if your import routine needed 8gb of memory to complete, would it make sense to assign each read-only user 8gb of memory? The other way around wouldn't work well either.
In leaving this decision to humans, the database will exhibit performance characteristics that aren't optimal for all cases, but in return, let's us (the humans) optimize each case individually (if like to).
Another important aspect is that most people/companies value reliable and stable levels over varying but potentially better levels. Having a high cost isn't as big a deal as having large variations in cost. This is of course not true all the times as entire companies are based around the fact the once in a while hit that 1%.
Modern databases already make some effort into adapting itself to the tasks presented, such as increasingly more sofisticated query optimizers. At least Oracle have the option to keep track of some of the measures that are influencing the optimizer decisions (cost of single block read which will vary with the current load).
My guess would be it is awfully hard to get the knobs right by adaptive means. First you will have to query the machine for a lot of unknowns like how much RAM it has available - but also the unknown "what do you expect to run on the machine in addition".
Barring that, by setting a max_mem_usage parameter only, the problem is how to make a system which
adapts well to most typical loads.
Don't have odd pathological problems with some loads.
is somewhat comprehensible code without error.
For postgresql however the answer could also be
Nobody wrote it yet because other stuff is seen as more important.
You didn't write it yet.
Should I be careful adding too many include columns to a non-cluster index?
I understand that this will prevent bookmark look-ups on fully covered queries, but the counter I assume is there's the additional cost of maintaining the index if the columns aren't static and the additional overall size of the index causing additional physical reads.
You said it in the question: the risk with having many indexes and/or many columns in indexes is that the cost of maintaining the indexes may become significant in databases which receive a lot of CUD (Create/Update/Delete) operations.
Selecting the right indexes, is an art of sort which involves balancing the most common use cases, along with storage concerns (typically a low priority issue, but important in some contexts), and performance issues with CUD ops.
I agree with mjv - there's no real easy and quick answer to this - it's a balancing act.
In general, fewer but wider indices are preferable over lots of narrower ones, and covering indices (with include fields) are preferable over having to do a bookmark lookup - but that's just generalizations, and those are generally speaking wrong :-)
You really can't do much more than test and measure:
measure your performance in the areas of interest
then add your wide and covering index
measure again and see if you a) get a speedup on certain operations, and b) the remaining performance doesn't suffer too much
All the guessing and trying to figure out really doesn't help - measure, do it, measure again, compare the results. That's really all you can do.
I agree with both answers so far, just want to add 2 things:
For covering indexes, SQL Server 2005 introduced the INCLUDE clause which made storage and usage more efficient. For earlier versions, included columns were part of the tree, part of the 900 byte width and made the index larger.
It's also typical for your indexes to be larger than the table when using sp_spaceused. Databases are mostly reads (I saw "85% read" somewhere), even when write heavy (eg INSERT looks for duplicates, DELETE checks FKs, UPDATE with WHERE etc).
I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size