Related
I was asked in an interview to enlighten the ways one can use to optimize the query Select * from TableA if it is taking a lot of time to execute. (TableA) can be any table with large amount of data. The interviewer didn't leave me any option like to select few columns or to use "WHERE" clause rather he wanted me to give solution for the subject query.
It's really hard to know what the interviewer was looking for.
They might be relatively inexperienced and expected answers like:
"list all the columns instead of * since that's way faster!"; or,
"add an ORDER BY because that will always speed it up!"
The kinds of things an experienced person might be looking for are:
inspect the query plan, are there computed columns or other similar things taking additional resources?
revisit the requirements - do the users really need the whole table in arbitrary order?
is there a clustered index on the table; if not, is the heap full of forwarding pointers?
is there excessive fragmentation on the underlying table (and/or the index being used to satisfy the query)?
is the query being blocked?
what is the query waiting on?
is the query waiting on an external resource (e.g. crappy I/O subsystem, a memory grant, a tempdb autogrow)?
is the query parallel and suffering packet waits because the stats are out of date?
There are a lot of underlying things that may be making that query slow or that may make that query a bad choice.
Actually some databases will have optimize commands that will rebuild the database tables to reduce fragmentation - and this way actually improve the performance for such queries.
PostgreSQL and SQLite have the command
VACUUM;
MySQL and ORACLE have a command
OPTIMIZE TABLE table;
It is expensive, as it will move around a lot of data. But in doing so, it will make the pages more balanced and this way usually shrink the total database size (some databases may however decide to add an index at this point, so it may also grow).
Since the data is stored in pages, reducing the number of pages by rebuilding the database can improve the performance even for a SELECT * FROM table; statement.
We have a mid-size SQL Server based application that has no indexes defined. Not even on the the identity columns. I suggested to our moderately expensive application consultant that perhaps we might get better performance (particularly as our database grows) by creating some indexes on appropriate fields, and he said:
"Indexes will significantly impact other areas of the application and customers should not create them under any circumstances."
Anybody ever heard of anything like this? Are there ever circumstances where one should not create any indexes? I can see nothing special about this app - it's got int identity columns, then lots of string columns, bunch of relational tables but nothing special or weird that I can see.
Thanks!
[EDIT: the identity columns are not using "identity specification", they seem to be set by the program, looking at the database with Management Studio, I can find NO indexes...]
FOLLOWUP: At a conference I asked the CEO (and chief architect) of the company producing this product about this, his response was that they felt for small to midsize deployments, the overhead associated with maintaining indexes would have more of a negative to overall user experience (the application does a lot of writes) than the benefits of the indexes would offset, but for large databases, they do create indexes. The tech support guy was just overzealous and very unhelpful with his answer. Mystery solved.
Hire me and I'll create the indexes for you. 14 years' Sybase/SQL Server experience tells me to create those !darn! indexes. Unless your table has less than 500 records each.
My idea is that an index hash node is roughly sized to 1000.
The other thing you need to look out for is whether your consultant has normalized the tables. Perhaps, the table has 500 fields/columns, containing more than one conceptual entity or a whole dozen of conceptual entities. And that could be why he is nervous about creating indexes, because if there are 12 conceptual entities in the table there would be at least 12 set of indexes - in which case, he is absolutely true - under no circumstances ... blah blah.
However, if he indeed does have 500 columns or detectably multiple conceptual entities per table - he is a very very lousy data design engineer. In all my years working with more experienced data engineers, our tables rarely exceed 20 columns. 5 on the low side, 10 on the average. Sometimes for performance' sake we do allow mixing two entities in a table, or horizontalizing row occurrences into columns of a table.
When you look at the table design you can with an untrained eye see Product, Project, BuildSheet, FloorPlan, Equipment, etc records all rolled into one long row. You cannot mix all these entities together into one table.
That is the only reason I know why he could advise you against having indexes. If he is doing that, you should know that he is fraudulently representing his data design skills to your company and you should immediately drop him from your weekly contractual expenses.
OK, after reading larry's post - I agree with him too.
There is such a thing as over-indexing, especially in INSERT and UPDATE heavy applications with very large tables. So the answer to the question in your title is yes, it can sometimes be a bad idea to add indexes.
That's quite a different question from the one you ask in the body of your question, which is "Is it ever normal to have NO indexes in a SQL Server database". The answer is that unless you're using the database as a "write-only" system, in which data is added but only read after being bulk extracted and transformed into a another data store, it's exceedingly unusual not to have some indexes in the database.
Your consultant's statement is odd enough to make me believe that you may have left some important information out of your description. If not, I'd say he's nuts.
Do you have the disk space to spare? I've seen cases where the indexes weighed more than the table.
However, No indexes exist whatsoever! There can't be a case for that except for when all read operations need the entire table.
Columns with key constraints will have an implicit index on them anyway. So if you're always selecting by the primary key, then there's no point adding more indexes. If you're selecting by other criteria, then it makes sense to add indexes on those columns that you're querying on.
It also depends on how insert-heavy your data is. If you're inserting more often than you're querying, then the overhead of keeping the indexes up to date can make your inserts slower.
But to say you "should not create [indexes] under any circumstances" is a bit much.
What I would recommend is that you run the SQL Server Profiler tool with some your queries. This tool will recommend which indexes to add that will have the biggest effect on performance.
In most run-of-the-mill applications, the impact of indexes on insertion performance is a bit of non-issue. You're usually better off creating the index and if insertion performance drops dramatically (which it probably won't) you can try something else. Obviously there are some exceptions, where you should be more careful, like tables that are used for logging for instance.
As mentioned, disk space can be an issue.
Creating irrelevant indexes (e.g. duplicates) will also waste microseconds and occasionally result in a bad query execution plan.
The other problem I've seen is with strangely code third-party applications that generate parts of the database at runtime, and can delete or choke on indexes that they don't know about.
In the vast majority of cases though, a carefully chosen index will only be a benefit.
Not having indexes on id columns sounds really unusual and I would find any justification for not including them to smell very fishy.
You should be aware that if you are doing a high volume of commits to the database, adding more indexes will affect the speed of insertion, but no index on id? Wow.
It would be good to get better justification of exactly how adding extra indexes might cause problems though.
the more indexes you have the slower data inserts and modifications will be. Make sure that you add indexes when appropriate and write queries that can take advantage of those indexes, also if the selectivity leve of your index is low, it will not be used effectively
I would say that if your server is having troubles with CPU time, indexes could be a solution. If you are querying tables without indexes, the server will need a lot more resources and if tables are having millions of records, it can become a serious problem. I recently cooled down a CPU from 80-90% all the time to 10-20% just by putting the right indexes.
If using MS SQL, you could check the activity monitor to see what queries are expensive and create indexes based on the where clauses or joins.
Then at the recent expensive queries:
You can then right click and check the complete query!
Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.
I have a table with 158 columns in SQL Server 2005.
any disdvantage of keeping so many columns.
Also I have to keep those many columns,
how can i improve performance - like using SP's, Indexes?
Wide tables can be quite performant when you usually want all the fields for a particular row. Have you traced your user's usage patterns? If they're usually pulling just one or two fields from multiple rows then your performance will suffer. The main issue is when your total row size hits the 8k page mark. That means SQL has to hit the disk twice for every row (first page + overflow page), and thats not counting any index hits.
The guys at Universal Data Models will have some good ideas for refactoring your table. And Red Gate's SQL Refactor makes splitting a table heaps easier.
There is nothing inherently wrong with wide tables. The main case for normalization is database size, where lots of null columns take up a lot of space.
The more columns you have, the slower your queries will be.
That's just a fact. That isn't to say you aren't justified in having many columns. The above does not give one carte blanche to split one entity's worth of table with many columns into multiple tables with fewer columns. The administrative overhead of such a solution would most probably outweigh any perceived performance gains.
My number one recommendation to you, based off of my experience with abnormally wide tables (denormalized schemas of bulk imported data) is to keep the columns as thin as possible. I had to work with a lot of crazy data and left most of the columns as VARCHAR(255). I recommend against this. Although convenient for development purposes, performance would spiral out of control, especially when working with Perl. Shrinking the columns to their bare minimum (VARCHAR(18) for instance) helped immensely.
Stored procedures are just batches of SQL commands; they don't have any direct on speed other than that regular use of certain types of stored procedures will end up using cached query plans (which is a performance boost).
You can use indexes to speed up certain queries, but there's no hard and fast rule here. Good index design depends entirely on the type of queries you're running. Indexing will, by definition, make your writes slower; they exist only to make your reads faster.
The problem with having that many columns in a table is that finding rows using the clustered primary key can be expensive. If it were possible to change the schema, breaking this up into many normalized tables will be the best way to improve efficiency. I would strongly recommend this course.
If not, then you may be able to use indices to make some SELECT queries faster. If you have queries that only use a small number of the columns, adding indices on those columns could mean that the clustered index will not need to be scanned. Of course, there is always a price to pay with indices, in terms of storage space and INSERT, UPDATE and DELTETE time, so this may not be a good idea for you.
I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.