Optimize the query "Select * from tableA" - sql-server

I was asked in an interview to enlighten the ways one can use to optimize the query Select * from TableA if it is taking a lot of time to execute. (TableA) can be any table with large amount of data. The interviewer didn't leave me any option like to select few columns or to use "WHERE" clause rather he wanted me to give solution for the subject query.

It's really hard to know what the interviewer was looking for.
They might be relatively inexperienced and expected answers like:
"list all the columns instead of * since that's way faster!"; or,
"add an ORDER BY because that will always speed it up!"
The kinds of things an experienced person might be looking for are:
inspect the query plan, are there computed columns or other similar things taking additional resources?
revisit the requirements - do the users really need the whole table in arbitrary order?
is there a clustered index on the table; if not, is the heap full of forwarding pointers?
is there excessive fragmentation on the underlying table (and/or the index being used to satisfy the query)?
is the query being blocked?
what is the query waiting on?
is the query waiting on an external resource (e.g. crappy I/O subsystem, a memory grant, a tempdb autogrow)?
is the query parallel and suffering packet waits because the stats are out of date?
There are a lot of underlying things that may be making that query slow or that may make that query a bad choice.

Actually some databases will have optimize commands that will rebuild the database tables to reduce fragmentation - and this way actually improve the performance for such queries.
PostgreSQL and SQLite have the command
VACUUM;
MySQL and ORACLE have a command
OPTIMIZE TABLE table;
It is expensive, as it will move around a lot of data. But in doing so, it will make the pages more balanced and this way usually shrink the total database size (some databases may however decide to add an index at this point, so it may also grow).
Since the data is stored in pages, reducing the number of pages by rebuilding the database can improve the performance even for a SELECT * FROM table; statement.

Related

What do you do to make sure a new index does not slow down queries?

When we add or remove a new index to speed up something, we may end up slowing down something else.
To protect against such cases, after creating a new index I am doing the following steps:
start the Profiler,
run a SQL script which contains lots of queries I do not want to slow down
load the trace from a file into a table,
analyze CPU, reads, and writes from the trace against the results from the previous runs, before I added (or removed) an index.
This is kind of automated and kind of does what I want. However, I am not sure if there is a better way to do it. Is there some tool that does what I want?
Edit 1 The person who voted to close my question, could you explain your reasons?
Edit 2 I googled up but did not find anything that explains how adding an index can slow down selects. However, this is a well known fact, so there should be something somewhere. If nothing comes up, I can write up a few examples later on.
Edit 3 One such example is this: two columns are highly correlated, like height and weight. We have an index on height, which is not selective enough for our query. We add an index on weight, and run a query with two conditions: a range on height and a range on weight. because the optimizer is not aware of the correlation, it grossly underestimates the cardinality of our query.
Another example is adding an index on increasing column, such as OrderDate, can seriously slow down a query with a condition like OrderDate>SomeDateAfterCreatingTheIndex.
Ultimately what you're asking can be rephrased as 'How can I ensure that the queries that already use an optimal, fast, plan do not get 'optimized' into a worse execution plan?'.
Whether the plan changes due to parameter sniffing, statistics update or metadata changes (like adding a new index) the best answer I know of to keep the plan stable is plan guides. Deploying plan guides for critical queries that already have good execution plans is probably the best way to force the optimizer into keep using the good, validated, plan. See Applying a Fixed Query Plan to a Plan Guide:
You can apply a fixed query plan to a plan guide of type OBJECT or
SQL. Plan guides that apply a fixed query plan are useful when you
know about an existing execution plan that performs better than the
one selected by the optimizer for a particular query.
The usual warnings apply as to any possible abuse of a feature that prevents the optimizer from using a plan which may be actually better than the plan guide.
How about the following approach:
Save the execution plans of all typical queries.
After applying new indexes, check which execution plans have changed.
Test the performance of the queries with modified plans.
From the page "Query Performance Tuning"
Improve Indexes
This page has many helpful step-by-step hints on how to tune your indexes for best performance, and what to watch for (profiling).
As with most performance optimization techniques, there are tradeoffs. For example, with more indexes, SELECT queries will potentially run faster. However, DML (INSERT, UPDATE, and DELETE) operations will slow down significantly because more indexes must be maintained with each operation. Therefore, if your queries are mostly SELECT statements, more indexes can be helpful. If your application performs many DML operations, you should be conservative with the number of indexes you create.
Other resources:
http://databases.about.com/od/sqlserver/a/indextuning.htm
However, it’s important to keep in mind that non-clustered indexes slow down the data modification and insertion process, so indexes should be kept to a minimum
http://searchsqlserver.techtarget.com/tip/Stored-procedure-to-find-fragmented-indexes-in-SQL-Server
Fragmented indexes and tables in SQL Server can slow down application performance. Here's a stored procedure that finds fragmented indexes in SQL servers and databases.
Ok . First off, index's slow down two things (at least)
-> insert/update/delete : index rebuild
-> query planning : "shall I use that index or not ?"
Someone mentioned the query planner might take a less efficient route - this is not supposed to happen.
If your optimizer is even half-decent, and your statistics / parameters correct, there is no way it's going to pick the wrong plan.
Either way, in your case (mssql), you can hardly trust the optimizer and will still have to check every time.
What you're currently doing looks quite sound, you should just make sure the data you're looking at is relevant, i.e. real use case queries in the right proportion (this can make a world of difference).
In order to do that I always advise to write a benchmarking script based on real use - through logging of production-env. queries, a bit like I said here :
Complete db schema transformation - how to test rewritten queries?

Alternatives to SQL COUNT(*) practices?

I was looking for improvements to PostgreSQL/InnoDB MVCC COUNT(*) problem when I found an article about implementing a work around in PostgreSQL. However, the author made a statement that caught my attention:
MySQL zealots tend to point to
PostgreSQL’s slow count() as a
weakness, however, in the real world,
count() isn’t used very often, and if
it’s really needed, most good database
systems provide a framework for you to
build a workaround.
Are there ways to skip using COUNT(*) in the way you design your applications?
Is it true that most applications are designed so they don't need it? I use COUNT() on most of my pages since they all need pagination. What is this guy talking about? Is that why some sites only have a "next/previous" link?
Carrying this over into the NoSQL world, is this also something that has to be done there since you can't COUNT() records very easily?
I think when the author said
however, in the real world, count() isn’t used very often
they specifically meant an unqualified count(*) isn't used very often, which is the specific case that MyISAM optimises.
My own experience backs this up- apart from some dubious Munin plugins, I can't think of the last time I did a select count(*) from sometable.
For example, anywhere I'm doing pagination, it's usually the output of some search. Which implies there will be a WHERE clause to limit the results anyway- so I might be doing something like select count(*) from sometable where conditions followed by select ... from sometable limit n offset m. Neither of which can use the direct how-many-rows-in-this-table shortcut.
Now it's true that if the conditions are purely index conditions, then some databases can merge together the output of covering indices to avoid looking at the table data too. Which certainly decreases the number of blocks looked at... if it works. It may be that, for example, this is only a win if the query can be satisfied with a single index- depends on the db implementation.
Still, this is by no means always the case- a lot of our tables have an active flag which isn't indexed, but often is filtered on, so would require a heap check anyway.
If you just need an idea of whether a table has data in it or not, Postgresql and many other systems do retain estimated statistics for each table: you can examine the reltuples and relpages columns in the catalogue for an estimate of how many rows the table has and how much space it is taking. Which is fine so long as ~6 significant figures is accurate enough for you, and some lag in the statistics being updated is tolerable. In my use case that I can remember (plotting the number of items in the collection) it would have been fine for me...
Trying to maintain an accurate row counter is tricky. The article you cited caches the row count in an auxiliary table, which introduces two problems:
a race condition between SELECT and INSERT populating the auxiliary table (minor, you could seed this administratively)
as soon as you add a row to the main table, you have an update lock on the row in the auxiliary table. now any other process trying to add to the main table has to wait.
The upshot is that concurrent transactions get serialised instead of being able to run in parallel, and you've lost the writers-dont-have-to-block-either benefits of MVCC- you should reasonably expect to be able to insert two independent rows into the same table at the same time.
MyISAM can cache the row count per table because it tacks on exclusive lock on the table when someone writes to it (iirc). InnoDB allows finer-grained locking-- but it doesn't try to cache the row count for the table. Of course if you don't care about concurrency and/or transactions, you can take shortcuts... but then you're moving away from Postgresql's main focus, where data integrity and ACID transactions are primary goals.
I hope this sheds some light. I must admit, I've never really felt the need for a faster "count(*)", so to some extent this is simply a "but it works for me" testament rather than a real answer.
While you're asking more of an application design than database question really, there is more detail about how counting executes in PostgreSQL and the alternatives to doing so at Slow Counting. If you must have a fast count of something, you can maintain one with a trigger, with examples in the references there. That costs you a bit on the insert/update/delete side in return for speeding that up. You have to know in advance what you will eventually want a count of for that to work though.

SQL server file size affection on performance

I have two table in database A,B
A: summary view of my data
B: detail view (text files has detail story)
I have 1 million record on my database 70% of the size is for Table B and 30% for Table A.
I want to know if the size of database affect the query performance response time ?
Is it beneficial to remove my Table B and store it on Disk to reduce the file size of the database to optimize the performance of my database ?
Absolutely size can be a factor on DB performance! However, the size can be mitigated through the proper use of indexes, relationships and integrity. There are a number of other things than can cause performance loss like triggers that may execute in unwanted situations.
I would say removing the table is an option but I don't think should be your first choice. Make sure your database has used the things I mention above properly.
There are many databases that are exponentially larger than yours that perform very well. Also, use explain plans on your queries to make sure you are using sound syntax like the proper use of joins.
I would personally start with using explain plans. This will tell you if you are missing indexes, joins, etc. Then make the changes one at a time until you are happy with the performance.
A million records is a tiny table in database terms. Any modern datbase should be able to handle that without breaking a sweat, even Access can easily handle that. If you are currently experiencing performance problems you likely have database design problems, poorly written queries, and/or inadequate hardware.

What are the types and inner workings of a query optimizer?

As I understand it, most query optimizers are "cost-based". Others are "rule-based", or I believe they call it "Syntax Based". So, what's the best way to optimize the syntax of SQL statements to help an optimizer produce better results?
Some cost-based optimizers can be influenced by "hints" like FIRST_ROWS(). Others are tailored for OLAP. Is it possible to know more detailed logic about how Informix IDS and SE's optimizers decide what's the best route for processing a query, other than SET EXPLAIN? Is there any documentation which illustrates the ranking of SELECT statements as to what's the fastest way to access rows, assuming it's indexed?
I would imagine that "SELECT col FROM table WHERE ROWID = n" is the fastest (rank 1).
If I'm not mistaking, Informix SE's ROWID is a SERIAL(INT) which allows for a max. of 2GB nrows, or maybe it uses INT9 for TB's nrows? SE's optimizer is cost based when it has enough data but it does not use distributions like the IDS optimizer.
IDS'ROWID isn't an INT, it is the logical address of the row's page left
shifted 8 bits plus the slot number on the page that contains the row's data.
IDS' optimizer is a cost based optimizer that uses data
about the index depth and width, number of rows, number of pages, and the
data distributions created by update statistics MEDIUM and HIGH to decide
which query path is the least expensive, but there's no ranking of statements?
I think Oracle uses HEX values for ROWID. Too bad ROWID can't be oftenly used, since a rows ROWID can change. So maybe ROWID can be used by the optimizer as a counter to report a query progress?, an idea I mentioned in my "Begin viewing query results before query completes" question? I feel it wouldn't be that difficult to report a query's progress while being processed, perhaps at the expense of some slight overhead, but it would be nice to know ahead of time: A "Google-like" estimate of how many rows meet a query's criteria, display it's progress every 100, 200, 500 or 1,000 rows, give users the ability to cancel it at anytime and start displaying the qualifying rows as they are being put into the current list, while it continues searching?.. This is just one example, perhaps we could think other neat/useful features, the ingridients are more or less there.
Perhaps we could fine-tune each query with more granularity than currently available? OLTP queries tend to be mostly static and pre-defined. The "what-if's" are more OLAP, so let's try to add more control and intelligence to it? So, therefore, being able to more precisely control, not just "hint/influence" the optimizer is what's needed. We can then have more dynamic SELECT statements for specific situations! Maybe even tell IDS to read blocks of index nodes at-a-time instead of one-by-one, etc. etc.
I'm not really sure what your are after but here is some info on SQL Server query optimizer which I've recently read:
13 Things You Should Know About Statistics and the Query Optimizer
SQL Server Query Execution Plan Analysis
and one for Informix that I just found using google:
Part 1: Tuning Informix SQL
For Oracle, your best resource would be Cost Based oracle Fundamentals. It's about 500 pages (and billed as Volume 1 but there haven't been any followups yet).
For a (very) simple full-table scan, progress can sometimes be monitored through v$session_longops. Oracle knows how many blocks it has to scan, how many blocks it has scanned, how many it has to go, and reports on progress.
Indexes are a different matter. If I search for records for a client 'Frank', and use the index, the database will make a guess at how many 'Frank' entries are in the table, but that guess can be massively off. It may be that you have 1000 'Frankenstein' and just 1 'Frank' or vice versa.
It gets even more complicated as you add in other filter and access predicates (eg where multiple indexes can be chosen), and makes another leap as you include table joins. And thats without getting into the complex stuff about remote databases, domain indexes like Oracle Text and Locator.
In short, it is very complicated. It is stuff that can be useful to know if you are responsible for tuning a large application. Even for basic development you need to have some grounding in how the database can physically retrieve that data you are interested.
But I'd say you are going the wrong way here. The point of an RDBMS is to abstract the details so that, for the most part, they just happen. Oracle employs smart people to write query transformation stuff into the optimizer so us developers can move away from 'syntax fiddling' to get the best plans (not totally, but it is getting better).

Any suggestions for identifying what indexes need to be created?

I'm in a situation where I have to improve the performance of about 75 stored procedures (created by someone else) used for reporting. The first part of my solution was creating about 6 denormalized tables that will be used for the bulk of the reporting. Now that I've created the tables I have the somewhat daunting task of determining what Indexes I should create to best improve the performance of these stored procs.
I'm curious to see if anyone has any suggestions for finding what columns would make sense to include in the indexes? I've contemplated using Profiler/DTA, or possibly fasioning some sort of query like the one below to figure out the popular columns.
SELECT name, Count(so.name) as hits, so.xtype
from syscomments as sc
INNER JOIN sysobjects so ON sc.id=so.id
WHERE sc.text like '%ColumnNamme%'
AND xtype = 'P'
Group by name,so.xtype
ORDER BY hits desc
Let me know if you have any ideas that would help me not have to dig through these 75 procs by hand.
Also, inserts are only performed on this DB once per day so insert performance is not a huge concern for me.
Any suggestions for identifying what indexes need to be created?
Yes! Ask Sql Server to tell you.
Sql Server automatically keeps statistics for what indexes it can use to improve performance. This is already going on in the background for you. See this link:
http://msdn.microsoft.com/en-us/library/ms345417.aspx
Try running a query like this (taken right from msdn):
SELECT mig.*, statement AS table_name,
column_id, column_name, column_usage
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig ON mig.index_handle = mid.index_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id;
Just be careful. I've seen people take the missing index views as Gospel, and use them to push out a bunch of indexes they don't really need. Indexes have costs, in terms of upkeep at insert, update, and delete time, as well as disk space and memory use. To make real, accurate use of this information you want to profile actual execution times of your key procedures both before and after any changes, to make sure the benefits of an index (singly or cumulative) aren't outweighed by the costs.
If you know all of the activity is coming from the 75 stored procedures then I would use profiler to track which stored procedures take the longest and are called the most. Once you know which ones are then look at those procs and see what columns are being used most often in the Where clause and JOIN ON sections. Most likely, those are the columns you will want to put non-clustered indexes on. If a set of columns are often times used together then there is a good chance you will want to make 1 non-clustered index for the group. You can have many non-clustered indexes on a table (250) but you probably don't want to put more than a handful on it. I think you will find the data is being searched and joined on the same columns over and over. Remember the 80/20 rule. You will probably get 80% of your speed increases in the first 20% of the work you do. There will be a point where you get very little speed increase for the added indexes, that is when you want to stop.
I concur with bechbd - use a good sample of your database traffic (by running a server trace on a production system during real office hours, to get the best snapshot), and let the Database Tuning Advisor analyze that sampling.
I agree with you - don't blindly rely on everything the Database Tuning Advisor tells you to do - it's just a recommendation, but the DTA can't take everything into account. Sure - by adding indices you can speed up querying - but you'll slow down inserts and updates at the same time.
Also - to really find out if something helps, you need to implement it, measure again, and compare - that's really the only reliable way. There are just too many variables and unknowns involved.
And of course, you can use the DTA to fine-tune a single query to perform outrageously well - but that might neglect the fact that this query is only ever called one per week, or that by tuning this one query and adding an index, you hurt other queries.
Index tuning is always a balance, a tradeoff, and a trial-and-error kind of game - it's not an exact science with a formula and a recipe book to strictly determine what you need.
You can use SQL Server profiler in SSMS to see what and how your tables are being called then using the Database Tuning Tool in profiler to at least start you down the correct path. I know most DBA's will probably scream at me for recommending this but for us non-DBA types such as myself it at least gives us a starting point.
If this is strictly a reporting database and you need performance, consider moving to a data warehouse design. A star or snowflake schema will outperform even a denormalized relational design when it comes to reporting.

Resources