Database scalability - performance vs. database size - database

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints

Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).

It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.

Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.

Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)

I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Related

What is the best moment to create SQL indexes?

When starting a project, should SQL indexes be created at the beginning?
I have a project where I haven´t created any indexes yet in production. The table that will grow most has 30000 rows and I have measured the time of the queries against this table creating an index and deleting it afterwards. The times are very similar.
I have decided to postpone the creation of the indexes in production until I notice a reduction of the response time in queries when creating them.
Is my approach correct? Or should I create them now?
I'm pretty deep into the topic of database indexing (it's actually my full-time job, also wrote a book about it (SQL Performance Explained) which is available for free here).
In my opinion, indexes should be created at the time you write the query because this is the time you have all the required information needed to decide which indexes to create in your head. In other words, if you do it at that time, it doesn't take you any extra effort. Another reason is that indexing sometimes affects the way you have to write the query so it can actually take benefit of that index.
However, the above statement assumes that you know how indexes work so you can decide which indexes to create. If you don't know that, I'd really suggest to learn about proper indexing first. Again, the book I've written is available for free on the web (Table of Contents). According to a recent survey, it takes you about 4-5 hours to read through it. Well-spent time, I'd say.
However, due to the ludicrous speed of modern hardware and vast amount of memory—even cheap commodity hardware—it is absolutely possible that you cannot measure any difference with these small tables (30k is small in DB world) yet. Nevertheless, you because you cannot measure this difference with a timers resolution of maybe 10ms, it doesn't mean the difference isn't there. Further: did you verify that the index was actually used? Are you sure the index you created was a good index for the given query?
Never the less, if the overall system is fast enough for you at the moment, sure you can go on without indexes. The risk remains, however, that it isn't fast enough on the day a major news outlet covers your app. What is supposed to be your best day might turn out to become your worst day :(
You didn't tell us a lot about your app, so I've to do some guesswork. I guess it is more like an OLTP app like an online website (as opposed to BI/OLAP). Although indexes add some overhead to write operations (insert, update, delete and merge), this is typically small compared to the benefit they bring to select (still assuming OLTP). Sure you can misuse indexes (e.g., creating hundreds on a single table) so that the overhead becomes a major problem too. But adding "a few" indexes on an OLTP table will most certainly not cause any problems due to the maintenance overhead.
Coming to an end: if you already know which indexes are good for your queries (verify it using explain), add them now before it is too late. If you are not sure, I'd still suggest to put some effort on that now. If you are not afraid of load peaks taking your app down, go on without indexes.
If you need more help, create a new question containing your query, table & index definitions as well as the explain output and people will be happy to help you figuring out if that index is fine or not.
Just create them now based on sensible choices: start with primary and foreign keys - thst'll keep your joins fast - then add indexes on single columns you'll be searching on (name, phone, etc) you are using.
Avoid creating multiple column indexes until you have a demonstrated performance problem and you can prove that an index helps. Often, reworking the query will fix the problem better than some complicated index.
The only time I delay creating indexes is if I'm about to load a heap of data and building indexes before loading means a much slower load as the index is updated for every row addition, although some databases allow the index rebuild to be deferred until after the load, so even then there's no point in waiting.

Time to retrieve a single record via a SQL Server index in a large table

Short version of the question:
If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow
Longer version of question and background:
I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).
At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.
They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.
If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.
Or have I got this completely wrong
You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.
Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).
Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.
should be low cost and fast since the columns are indexed and that would be O(n) I think
You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?
Then I would be worried too.
Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.
It depends.
A single retrieval will be low cost and fast
on a decent indexed table
running on decent hardware
over a decent network
On the other hand, it takes time nonetheless.
If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.
Some questions you need to adress
Is my hardware up to spec
Does adding two fields result in a page split (unlikely)
How many extra pages need to be read for your regular result sets
How many retrievals/sec will be made
How many inserts/sec (triggering an index update) will be made
After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

Scaling a MS SQL Server 2008 database

Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.

Advice on building a fast, distributed database

I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
episode-109-ebays-architecture-principles-with-randy-shoup
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Resources