Why don't databases intelligently create the indexes they need?

Why don't databases intelligently create the indexes they need? - database

I just heard that you should create an index on any column you're joining or querying on. If the criterion is this simple, why can't databases automatically create the indexes they need?

Well, they do; to some extent at least...
See SQL Server Database Engine Tuning Advisor, for instance.
However, creating optimal indexes is not as simple as you mentioned. An even simpler rule could be to create indexes on every column (which is far from optimal)!
Indexes are not free. You create indexes at the cost of storage and update performance among other things. They should be carefully thought about to be optimal.

Every index you add may increase the speed of your queries. It will decrease the speed of your updates, inserts and deletes and it will increase disk space usage.
I, for one, would rather keep the control to myself, using tools such as DB Visualizer and explain statements to provide the information I need to evaluate what should be done. I do not want a DBMS unilaterally deciding what's best.
It's far better, in my opinion, that a truly intelligent entity be making decisions re database tuning. The DBMS can suggest all it wants but the final decision should be left up to the DBAs.
What happens when the database usage patterns change for one week? Do you really want the DBMS creating indexes and destroying them a week later? That sounds like a management nightmare scenario right up alongside Skynet :-)

This is a good question. Databases could create the indexes they need based on data usage patterns, but this means that the database would be slow the first time certain queries were executed and then get faster as time goes on. For example if there is a table like this:
ID USERNAME
-- --------
: then the username would be used to look up the users very often. After some time the database could see that say 50% of queries did this, in which case it could add an index on the username.
However the reason that this hasn't been implemented in great detail is simply because it is not a killer feature. Adding indexes is performed relatively few times by the DBA, and by automating this (which is a very big task) is probably just not worth it for the database vendors. Remember that every query will have to be analyzed to enable auto indexes, and also the query response time, and result set size as well, so it is non-trivial to implement.

Because databases simply store and retrieve data - the database engine has no clue how you intend to retrieve that data until you actually do it, in which case it is too late to create an index. And the column you are joining on may not be suitable for an efficient index.

It's a non-trivial problem to solve, and in many cases a sub-optimal automatic solution might actually make things worse. Imagine a database whose read operations were sped up by automatic index creation but whose inserts and updates got hosed as a result of the overhead of managing the index? Whether that's good or bad depends on the nature of your database and the application it's serving.
If there were a one-size-fits-all solution, databases would certainly do this already (and there are tools to suggest exactly this sort of optimization). But tuning database performance is largely an app-specific function and is best accomplished manually, at least for now.

An RDBMS could easily self-tune and create indices as it saw fit but this would only work for simple cases with queries that do not have demanding execution plans. Most indices are created to optimize for specific purposes and these kinds of optimizations are better handled manually.

Related

non clustered index for 100 millions records in a table or pratitions

I have a table in IBM DB2 which contains more than 100 million records . Database was made 13 years ago and is not partitioned . Searching data and creating joins with this table takes huge amount of time .What should be proper approach to optimize searching and joins .
1. Using Non Clustered Index and searching via indexes .
2. Partitioning Table
3. or any other efficient approach.
I would like thanks in advance for your valuable time and efforts.

A "proper" approach is, of course, subjective. It's usually a trade-off, and the things most people trade off are the cost of implementing the change, the cost of maintaining the change, and the performance of the solution.
In all cases, I recommend gathering metrics, and agreeing your target - otherwise, you risk continuously optimizing beyond the point the business really needs. Typically, this means creating a representative test environment, with representative data. You then run the queries as they are today, and measure their performance. Finally, you agree (with whoever is paying the bills) what the minimum and optimum targets are. Once you reach that target - stop!
By far the cheapest solution is to optimize your queries, which often means creating indices. Depending on your queries, this can sometimes take just a few hours, and doesn't require any ongoing maintenance.
The next thing to do is to look at server configuration - tuning the memory allocation and disk strategy can do wonders, and making sure the database statistics are up to date. These tasks usually require 2 or 3 people to work together, and you may need to set up regular maintenance tasks.
If that doesn't do the job, consider improving the hardware. If your database server is as old as the database (13 years), it's quite possible that your mobile phone has better performance characteristics than your server. It's much cheaper to improve the hardware than it is to go to the next steps.
If hardware doesn't solve the problem, consider de-normalizing your data. For instance, if you are running lots of queries joining your large table to other large tables, consider creating a de-normalized table with all the data you need to fulfill that query. This is expensive, both from a development point of view (you have to work out how to maintain the denormalized data, how to make sure all the queries still work), and from a maintenance point of view - the additional complexity will make all enhancements and bug fixes harder.
If denormalizing doesn't work, partitioning is the next most expensive solution. This is a fairly drastic solution, because as far as I know, there's no "out of the box" solution to glue your front-end applications into the partitioning logic. So, pretty much every piece of code that needs to interact with the database needs to understand the partitioning logic, and a bug in any one place will break every other component that interacts with that data.

What are the first issues to check while optimizing an existing database?

What are the top issues and in which order of importance to look into while optimizing (performance tuning, troubleshooting) an existing (but unknown to you) database?
Which actions/measures in your previous optimizations gave the most effect (with possibly the minimum of work) ?
I'd like to partition this question into following categories (in order of interest to me):
one needs to show the performance boost (improvements) in the shortest time. i.e. most cost-effective methods/actions;
non-intrusive or least-troublesome most effective methods (without changing existing schemas, etc.)
intrusive methods
Update:
Suppose I have a copy of a database on dev machine without access to production environment to observe stats, most used queries, performance counters, etc. in real use.
This is development-related but not DBA-related question.
Update2:
Suppose the database was developed by others and was given to me for optimization (review) before it was delivered to production.
It is quite usual to have outsourced development detached from end-users.
Besides, there is a database design paradigm that a database, in contrast to application data storage, should be a value in itself independently on specific applications that use it or on context of its use.
Update3: Thanks to all answerers! You all pushed me to open subquestion
How do you stress load dev database (server) locally?

Create a performance Baseline (non-intrusive, use performance counters)
Identify the most expensive queries (non-intrusive, use SQL Profiler)
Identify the most frequently run queries (non-intrusive, use SQL Profiler)
Identify any overly complex queries, or those using slowly performing constructs or patterns. (non-intrusive to identify, use SQL Profiler and/or code inspections; possibly intrusive if changed, may require substantial re-testing)
Assess your hardware
Identify Indexes that would benefit the measured workload (non-intrusive, use SQL Profiler)
Measure and compare to your baseline.
If you have very large databases, or extreme operating conditions (such as 24/7 or ultra high query loads), look at the high end features offered by your RDBMS, such as table/index partitioning.
This may be of interest: How Can I Log and Find the Most Expensive Queries?

If the database is unknown to you, and you're under pressure, then you may not have time for Mitch's checklist which is good best practice to monitor server health.
You also need access to production to gather real info from assorted queries you can run. Without this, you're doomed. The server load pattern is important: you can't reproduce many issue yourself on a development server because you won't use the system like an end user.
Also, focus on "biggest bang for the buck". An expensive query running once daily at 3am can be ignored. A not-so-expensive one running every second is well worth optimising. However, you may not know this without knowing server load pattern.
So, basic steps..
Assuming you're firefighting:
server logs
SQL Server logs
sys.sysprocesses eg ASYNC_NETWORK_IO waits
Slow response:
profiler, with a duration filter. What runs often and is lengthy
most expensive query, weighted for how often used
open transaction with plan
weighted missing index
Things you should have:
Backups
Tested restore of aforementioned backups
Regular index and statistic maintenance
Regular DBCC and integrity checks
Edit: After your update
Static analysis is best practices only: you can't optimise for usage. This is all you can do. This is marc_s' answer.
You can guess what the most common query may be, but you can't guess how much data will be written or how badly a query scales with more data
In many shops developers provide some support, either directly or as *3rd line"
If you've been given a DB for review by another team that you hand over to another team to deploy: that's odd.

If you're not interested in the runtime behavior of the database, e.g. what are the most frequently executed queries and those that consume the most time, you can only do a "static" analysis of the database structure itself. That has a lot less value, really, since you can only check for a number of key indicators of bad design - but you cannot really tell much about the "dynamics" of the system being used.
Things I would check for in a database that I get as a .bak file - without the ability to collect live and actual runtime performance statistics - would be:
normalization - is the table structure normalized to third normal form? (at least most of the time - there might be some exceptions)
do all tables have a primary key? ("if it doesn't have a primary key, it's not a table", after all)
For SQL Server: do all the tables have a good clustering index? A unique, narrow, static, and preferably ever-increasing clustered key - ideally an INT IDENTITY, and most definitely not a large compound index of many fields, no GUID's and no large VARCHAR fields (see Kimberly Tripp's excellent blog posts on the topics for details)
are there any check and default constraints on the database tables?
are all the foreign key fields backed up by a non-clustered index to speed up JOIN queries?
are there any other, obvious "deadly sins" in the database, e.g. overly complicated views, or really badly designed tables etc.
But again: without actual runtime statistics, you're quite limited in what you can do from a "static analysis" point of view. The real optimization can only really happen when you have a workload from a regular day of operation, to see what queries are used frequently and put the most stress on your database --> use Mitch's checklist to check those points.

The most important thing to do is collect up-to-date statistics. Performance of a database depends on:
the schema;
the data in the database; and
the queries being executed.
Looking at any of those in isolation is far less useful than the whole.
Once you have collected the statistics, then you start identifying operations that are sub-par.
For what it's worth, the vast majority of performance problems we've fixed have been by either adding indexes, adding extra columns and triggers to move the cost of calculations away from the select to the insert/update, and tactfully informing the users that their queries are, shall we say, less than optimal :-)
They're usually pleased that we can just give them an equivalent query that runs much faster.

Optimize database for web usage (lots more reading than writing)

I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.

There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.

The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.

The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.

All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.

30 million records a day, SQL Server can't keep up, other kind of database system needed?

Some time ago I thought an new statistics system over, for our multi-million user website, to log and report user-actions for our customers.
The database-design is quite simple, containing one table, with a foreignId (200,000 different id's), a datetime field, an actionId (30 different id's), and two more fields containing some meta-information (just smallints). There are no constraints to other tables. Furthermore we have two indexes each containing 4 fields, which cannot be dropped, as users are getting timeouts when we are having smaller indexes. The foreignId is the most important field, as each and every query contains this field.
We chose to use SQL server, but after implementation doesn't a relational database seem like a perfect fit, as we cannot insert 30 million records a day (it's insert only, we don't do any updates) when also doing alot of random reads on the database; because the indexes cannot be updated fast enough. Ergo: we have a massive problem :-) We have temporarily solved the problem, yet
a relational database doesn't seem to be suited for this problem!
Would a database like BigTable be a better choice, and why? Or are there other, better choices when dealing with this kind of problems?
NB. At this point we use a single 8-core Xeon system with 4 GB memory and Win 2003 32-bit. RAID10 SCSI as far as I know. The index size is about 1.5x the table size.

You say that your system is capable of inserting 3000 records per second without indexes, but only about 100 with two additional non-clustered indexes. If 3k/s is the maximum throughput your I/O permits, adding two indexes should in theory reduces the throughput at about 1000-1500/sec. Instead you see a degradation 10 times worse. The proper solution and answer is 'It Dependts' and some serious troubleshooting and bottleneck identification would have to be carried out. With that in mind, if I was to venture a guess, I'd give two possible culprits:
A. Th additional non-clustered indexes distribute the writes of dirty pages into more allocation areas. The solution would be to place the the clustered index and each non-clustered index into its own filegroup and place the three filegroups each onto separate LUNs on the RAID.
B. The low selectivity of the non-clustered indexes creates high contention between reads and writes (key conflicts as well as %lockres% conflicts) resulting in long lock wait times for both inserts and selects. Possible solutions would be using SNAPSHOTs with read committed snapshot mode, but I must warn about the danger of adding lot of IO in the version store (ie. in tempdb) on system that may already be under high IO stress. A second solution is using database snapshots for reporting, they cause lower IO stress and they can be better controlled (no tempdb version store involved), but the reporting is no longer on real-time data.
I tend to believe B) as the likely cause, but I must again stress the need to proper investigation and proper root case analysis.
'RAID10' is not a very precise description.
How many spindles in the RAID 0 part? Are they short-striped?
How many LUNs?
Where is the database log located?
Where is the database located?
How many partitions?
Where is tempdb located?
As on the question whether relational databases are appropriate for something like this, yes, absolutely. There are many more factors to consider, recoverability, availability, toolset ecosystem, know-how expertise, ease of development, ease of deployment, ease of management and so on and so forth. Relational databases can easily handle your workload, they just need the proper tuning. 30 million inserts a day, 350 per second, is small change for a database server. But a 32bit 4GB RAM system hardly a database server, regardless the number of CPUs.

It sounds like you may be suffering from two particular problems. The first issue that you are hitting is that your indexes require rebuilding everytime you perform an insert - are you really trying to run live reports of a transactional server (this is usually considered a no-no)? Secondly, you may also be hitting issues with the server having to resize the database - check to ensure that you have allocated enough space and aren't relying on the database to do this for you.
Have you considered looking into something like indexed views in SQL Server? They are a good way to remove the indexing from the main table, and move it into a materialised view.

You could try making the table a partitioned one. This way the index updates will affect smaller sets of rows. Probably daily partitioning will be sufficient. If not, try partitioning by the hour!

You aren't providing enough information; I'm not certain why you say that a relational database seems like a bad fit, other than the fact that you're experiencing performance problems now. What sort of machine is the RDBMS running on? Given that you have foreign ID's, it seems that a relational database is exactly what's called for here. SQL Server should be able to handle 30 million inserts per day, assuming that it's running on sufficient hardware.

Replicating the database for reporting seems like the best route, given heavy traffic. However, a couple of things to try first...
Go with a single index, not two indexes. A clustered index is probably going to be a better choice than non-clustered. Fewer, wider indexes will generally perform better than more, narrower, indexes. And, as you say, it's the indexing that's killing your app.
You don't say what you're using for IDs, but if you're using GUIDs, you might want to change your keys over to bigints. Because GUIDs are random, they put a heavy burden on indexes, both in building indexes and in using them. Using a bigint identity column will keep the index running pretty much chronological, and if you're really interested in real-time access for queries on your recent data, your access pattern is much better suited for monotonically increasing keys.

Sybase IQ seems pretty good for the goal as our architects/DBAs indicated (as in, they explicitly move all our stats onto IQ stating that capability as the reason). I can not substantiate myself though - merely nod at the people in our company who generally know what they are talking about from past experience.
However, I'm wondering whether you MUST store all 30mm records? Would it not be better to store some pre-aggregated data?

Not sure about SQL server but in another database system I have used long ago, the ideal method for this type activity was to store the updates and then as a batch turn off the indexes, add the new records and then reindex. We did this once per night. I'm not sure if your reporting needs would be a fit for this type solution or even if it can be done in MS SQL, but I'd think it could.

You don't say how the inserts are managed. Are they batched or is each statistic written separately? Because inserting one thousand rows in a single operation would probably be way more efficient than inserting a single row in one thousand separate operations. You could still insert frequently enough to offer more-or-less real time reporting ;)

Oracle recommendations for high volume writes and low volume read

Is there some general guidelines online on how to tweak oracle for doing a high number of inserts and low number of reads?
All the answers below are pretty good recommendations. I have to clarify the following things. I am using 10g and this is an absolute requirement that we use Oracle. I am also more interested in oracle instance parameters for tuning (perhaps some different locking policies).

Let me assume you want to do an excessive high number of inserts, so that you simply want to just ignore all other kinds of operations just to get those inserts to complete, without problems.
First, have you completely ruled out other types of databases? There are systems like industry databases that cope very well with massive amounts of inserts, typically used to receive and store data from equipment that is measuring something in a factory environment. Oracle is a relational database, it might not be the right type of software for your needs.
Having said that, let's assume you can, or will, or should, use Oracle. The very first thing you need to do is to consider all the various types of data you need to make this assumption about. If they're all about the same kind of data, you need 1 table, and it need to be lean and mean regarding inserts.
The optimal way do that is to do the following:
do not add any indexes on this table at all, if you need a primary key, that's the only index you want
if you need to do reads against this table, consider having a shadow table with indexes that you do reads, lookups, and aggregates against. If this doesn't have to be up-to-the-millisecond updated, consider a periodic batch job to update it with data from the master table. This will disturb the master table with read-locks as little as possible
Make sure your server has fast disks. Transactional write operations will typically involve the disk at some point, so make sure that's a small bottleneck as you can get.
If your application is gathering data from many incoming sources, consider adding a layer in front of the database that will keep the number of concurrent connections and thus transactions to that table to a minimum. If you get a high number of write-locks on the same page for an oracle database, ultimately your performance will suffer.
If you can split up the data, consider splitting it in such a way that it is stored on different physical disks. That way, disk I/O problems won't be cross-data-type, and only affect one type of data.
In the other end of the spectrum you have a denormalized table with lots of indexes optimized for a balance between lookups and updates, and you need to find some middle-way that will get you the performance you want.

In terms of database design put as few constraints, indexes and triggers on the table(s) you're inserting into as possible as these will all slow down the insert.
The lack of indexes will obviously hurt your SELECT performance, but it doesn't sound like this is your primary concern.

What sort of application are we talking about? What version of Oracle?
If you are designing a data warehouse load process, for example, you would generally want to do direct-path inserts into staging table(s), then build any necessary indexes, then do a partition exhange to load the data into the partitioned destination table. This doesn't work as well, of course, if you are doing single-row inserts.
Depending on the Oracle version and the type of application, you may also want to enable compression on the table. Inserts are generally cheap from a CPU standpoint, so there is probably plenty of CPU available to do the compression which can substantially decrease the amount of I/O required, which is generally going to be your bottleneck.

I'm going to suggest that you take your question to Tom Kyte's site, http://asktom.oracle.com. You can generally find an answer there. Otherwise, try Oracle's forums.
Also try looking up any of Tom Kyte's books. Suggest checking the library or your local bookstore to find the right one, to ensure that the book contains the right topics for you. Also, his blog has links to his books and some articles/discussions on each book.
I did a quick google, site:oracle.com tuning write, and found this
OracleAS TopLink Writing Optimization Features. I realize that you might not be using TopLink but it may have some good tips. Keywords you'll want to try using: tuning, performance, insert(s), improve. Also through in the technology you are using like java/c++/etc.
Other tips you can try:
using stored procedures or using them in more efficient ways.
tweaking your server's hardware. Faster hard drives or a specific RAID array, possibly more cpu's.
Ask Tom thread - some nice comments here, also links to Fowler's site
You will probably have to start running some performance analytics on your queries/implementations to find the sweet spot for each one. I wish I had an easy answer for you. Good Luck!

A couple of suggestions for you to look into further:-
direct path load
block compression

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight