How to avoid table scan and index scan for huge tables - sql-server

I am using MSSQL 2008 R2. I have a table with huge number of rows (test table)
I have the following SQL code, please suggest where I can use index hints, force seek or any other means to improve performance.
Indexes
1. non-clustered - idx_id (id)
2. non-clustered - idx_name (name)
SELECT DISTINCT
p.id,
p.name,
FROM
test p
LEFT OUTER JOIN
(
SELECT
e.id
FROM
test e
INNER JOIN
(
SELECT
c.id
FROM
test c
GROUP BY
c.id
HAVING
COUNT(1) > 1
) f
ON e.id = f.id
WHERE
e.name = 'test_name'
) m
ON p.id = m.id
WHERE
m.id is null

Prerequise: have a primary key
select distinct
p.id
, p.name
from test p
where not exists (
SELECT TOP(1)
1
FROM test e
WHERE e.PrimaryKey <> p.PrimaryKey
AND e.id = p.id
AND 'test_name' IN (e.name, p.name)
)

How many columns your table contains? If there's only these two columns, it makes no sense to add nonclustered index. You should create CLUSTERED index on ID column, and that's it - you'll see performance increase.
If you have many colums, consider two options:
Create clustered index on NAME column and nonclustered index on ID column.
Create nonclustered index on ID column, and INCLUDE NAME column (you'll create covering index that way)

Generally speaking, relational databases (being relational) are written in such a way to optimize join statements. When using a "JOIN" clause with "ON" criteria, the database engine can create an optimized execution plan that takes the table structure, indexes, etc. into account. When joining on a sub-select, sometimes the same optimizing factors are not available, or are not taken into account the same way. It depends on your schema, but it is a good rule of thumb to assume that a standard join with an "on" clause is going to be more efficient than a join on a sub-select.
Your schema is pretty vague, so I am not even sure that you need the joins, but if you do, you should try performing the joins directly with "on" criteria.

Related

Merge Join over sorted columns instead of Hash Join

I have two tables
Table A
(
id int,
name varchar(39),
lname varchar (49),
...
)
Table B
(
id int,
city varchar(39),
...
)
Both tables are sorted on column ID. IDs are simply identities and are populated by auto incremented integers 1 to n.
However, if I input a query e.g.,
SELECT *
FROM A, B
WHERE A.id = B.id;
I get a hash join instead of the efficient merge join. How can I enforce the merge join in SQL Server instead? I don't want to use an index, thus no index-based plans.
Note that I don't want a merge-join with a sort-enforcer either, I know that one can hint the planner by rewriting the query to
SELECT *
FROM A
INNER MERGE JOIN B ON A.ID = B.ID;
By the way I'm using SQL Server Express edition. But I can change to any open source DB if the latter supports the query plan that I'm aiming.
Thanks in advance
If you believe you are smarted then the SQL Engine :-) you can use hits like this:
SELECT *
FROM A
INNER HASH JOIN B
ON A.id = B.id
OR
SELECT *
FROM A
INNER MERGE JOIN B
ON A.id = B.id
At least, you can test if the MERGE will be really better. And even it is better in this case, does not mean that it will be the best choice always. It can reduce the performance in other cases, so generally it will be better to leave this work to the engine.

Too many parameter values slowing down query

I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here

Query optimization for INNER JOIN OR condition

I have the following query that runs really slowly and I've used the estimated execution plan to narrow down the problem to the final INNER JOIN's OR condition.
SELECT
TableE.id
FROM
TableA WITH (NOLOCK)
INNER JOIN TableB WITH (NOLOCK)
ON TableA.[bid] = TableB.[id]
LEFT JOIN TableC WITH (NOLOCK)
ON TableB.[cid] = TableC.[id]
LEFT JOIN TableD WITH (NOLOCK)
ON TableA.[did] = TableD.[id]
INNER JOIN TableE WITH (NOLOCK)
ON TableD.[eid] = TableE.[id]
OR TableE.[numericCol] = TableB.[numericCol] -- commenting out this OR statement leads to large performance increase
WHERE
TableA.[id] = #Id
I have the following index on TableB:
CREATE UNIQUE NONCLUSTERED INDEX [IX_TableB_numericCol_id] ON [dbo].[TableB]
(
[numericCol] ASC,
[id] ASC
)
and on TableE:
CREATE NONCLUSTERED INDEX [IX_TableE_numericCol] ON [dbo].[TableE]
(
[numericCol] ASC
)
Any ideas?
See my answer here:
What goes wrong when I add the Where clause?
You say that you tried adding a covering index on TableE, covering both numericCol and id. However, I suspect that the query didn't use it. That is why you see a Table Spool.
You want to force the query to use the covering index, either by making it the only index on the table, or by including a query hint. That should eliminate the Table Spool and speed up the Nested Loop.
Same for TableB. If the index being used is only on Id, then it is not helping the Nested Loop which is on NumericCol. Force the issue either by getting rid of the index on Id, or with a query hint.

What's the difference between these T-SQL queries using OR?

I use Microsoft SQL Server 2008 (SP1, x64). I have two queries that do the same, or so I think, but they are have completely different query plans and performance.
Query 1:
SELECT c_pk
FROM table_c
WHERE c_b_id IN (SELECT b_id FROM table_b WHERE b_z = 1)
OR c_a_id IN (SELECT a_id FROM table_a WHERE a_z = 1)
Query 2:
SELECT c_pk
FROM table_c
LEFT JOIN (SELECT b_id FROM table_b WHERE b_z = 1) AS b ON c_b_id = b_id
LEFT JOIN (SELECT a_id FROM table_a WHERE a_z = 1) AS a ON c_a_id = a_id
WHERE b_id IS NOT NULL
OR a_id IS NOT NULL
Query 1 is fast as I would expect, whereas query 2 is very slow. The query plans look quite different.
I would like query 2 to be as fast as query 1. I have software that uses query 2, and I cannot change that into query 1. I can change the database.
Some questions:
why are the query plans different?
can I "teach" SQL Server somehow that query 2 is equal to query 1?
All tables have (clustered) primary keys and proper indexes on all columns:
CREATE TABLE table_a (
a_pk int NOT NULL PRIMARY KEY,
a_id int NOT NULL UNIQUE,
a_z int
)
GO
CREATE INDEX IX_table_a_z ON table_a (a_z)
GO
CREATE TABLE table_b (
b_pk int NOT NULL PRIMARY KEY,
b_id int NOT NULL UNIQUE,
b_z int
)
GO
CREATE INDEX IX_table_b_z ON table_b (b_z)
GO
CREATE TABLE table_c (
c_pk int NOT NULL PRIMARY KEY,
c_a_id int,
c_b_id int
)
GO
CREATE INDEX IX_table_c_a_id ON table_c (c_a_id)
GO
CREATE INDEX IX_table_c_b_id ON table_c (c_b_id)
GO
The tables are not modified after filling initially. I'm the only one querying them. They contains millions of records (table_a: 5M, table_b: 4M, table_c: 12M), but using only 1% gives similar results.
Edit: I tried adding FOREIGN KEYs for c_a_id and c_b_id, but that only made query 1 slower...
I hope someone can have a look at the query plans and explain the difference.
Join are slower, let me say by design. First query uses a sub-query (cacheable) to filter records so it'll produce less data (and less accesses to each table).
Did you read these:
http://www.sql-server-performance.com/2006/tuning-joins/
http://blogs.msdn.com/b/craigfr/archive/2006/12/04/semi-join-transformation.aspx
What I mean is that with IN the DB can do better optimizations like removing duplicates, stop at first match and similar (and these are from school memories so I'm sure it'll do much better). So I guess the question isn't why QP is different but how smart how deep optimizations can go.
You are comparing non equivalent queries also you are using left join in quite unusual way.
Generally if yours intention was to select all entries in table_c which has linked records either in table_a or table_b you should use exists statement:
SELECT c_pk
FROM table_c
WHERE Exists(
SELECT 1
FROM table_b
WHERE b_z = 1 and c_b_id = b_id
) OR Exists(
SELECT 1
FROM table_a
WHERE a_z = 1 and c_a_id = a_id
)
Since you can't change the query, at least you can improve the query's environment.
Highlight your query, right-click on it in SSMS and select "Analyze
Query in Database Engine Tuning Advisor."
Run the analysis to find out if you need any additional indexes or
statistics built.
Heed SQL Server's advice.

Pivot Table vs Parent_ID for 1-Many Relation

In a general one-to-many (parent-to-child) relationship, is there a significant efficiency difference between (a) putting parent_id in the child table and (b) using a pivot table of only parent_id, child_id?
NOTE: Assume Oracle if necessary, otherwise answer using a general RDBMS.
If by PIVOT table you mean a many-to-many link table, then no, it will only hamper performance.
You should keep parent_id in the child table.
The many-to-many link table takes an extra JOIN and therefore is less efficient.
Compare the following queries:
SELECT *
FROM child_table c
JOIN child_to_parent cp
ON cp.child = c.id
JOIN parent p
ON p.id = cp.parent
WHERE c.property = 'some_property'
and this one:
SELECT *
FROM child_table c
JOIN parent p
ON p.id = c.parent
WHERE c.property = 'some_property'
The latter one is one JOIN shorter and more efficient.
The only possible exception to that rule is that you run these queries often:
SELECT *
FROM child_table c
JOIN parent_table p
ON p.id = c.parent
WHERE c.id IN (id1, id2, ...)
, i. e. you know the id's of the child rows beforehand.
This may be useful if you use natural keys for your child_table.
In this case yes, the child_to_parent link table will be more efficient, since you can just replace it with the following query:
SELECT *
FROM child_to_parent cp
JOIN parent_table p
ON p.id = cp.parent
WHERE cp.child IN (id1, id2, ...)
and child_to_parent will be always less or equal in size to child_table, and hence more efficient.
However, in Oracle you can achieve just the same result creating a composite index on child_table (id, parent_id).
Since Oracle does not index NULL's, this index will be just like your child_to_parent table, but without the table itself and implied maintenance overhead.
In other systems (which index NULL's), the index may be less efficient than a dedicated table, especially if you have lots of NULL parents.

Resources