If I have a query:
Select *
From tableA
Inner Join tableB ON tableA.bId = tableB.id
Inner Join tableC ON tableA.cId = tableC.id
where
tableA.someColumn = ?
Do I get any performance benefit from creating a composite index(bId,cId,someColumn)?
I'm using DB2 Database for this activity.
Indexing joins depends on the join algorithm used by the database. You'll see that in the execution plan.
You will probably need an index on tableA that starts with someColumn for the where clause. Everything else depends on the join algorithm and join order.
You will probably get a more specific answer if you post the execution plan. You can also read the chapter "The Join Operation" on my site about sql indexing and try yourself.
If there are no indexes now, I'd guess that the composite index might be used in one or both inner joins. I doubt that it would be used in the WHERE clause.
But I've been doing this stuff for a long time. Guessing, like hoping, doesn't scale well.
Instead of guessing, you're better off learning how to use DB2's explain and design advisor utilities. Expect to test things like indexing first on a development computer. Building a three-column index on a 500 million row table that's in production will not make you popular.
Related
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
INNER JOIN versus WHERE clause — any difference?
What is the difference between an INNER JOIN query and an implicit join query (i.e. listing multiple tables after the FROM keyword)?
For example, given the following two tables:
CREATE TABLE Statuses(
id INT PRIMARY KEY,
description VARCHAR(50)
);
INSERT INTO Statuses VALUES (1, 'status');
CREATE TABLE Documents(
id INT PRIMARY KEY,
statusId INT REFERENCES Statuses(id)
);
INSERT INTO Documents VALUES (9, 1);
What is the difference between the below two SQL queries?
From the testing I've done, they return the same result. Do they do the same thing? Are there situations where they will return different result sets?
-- Using implicit join (listing multiple tables)
SELECT s.description
FROM Documents d, Statuses s
WHERE d.statusId = s.id
AND d.id = 9;
-- Using INNER JOIN
SELECT s.description
FROM Documents d
INNER JOIN Statuses s ON d.statusId = s.id
WHERE d.id = 9;
There is no reason to ever use an implicit join (the one with the commas). Yes for inner joins it will return the same results. However, it is subject to inadvertent cross joins especially in complex queries and it is harder for maintenance because the left/right outer join syntax (deprecated in SQL Server, where it doesn't work correctly right now anyway) differs from vendor to vendor. Since you shouldn't mix implicit and explict joins in the same query (you can get wrong results), needing to change something to a left join means rewriting the entire query.
If you do it the first way, people under the age of 30 will probably chuckle at you, but as long as you're doing an inner join, they produce the same result and the optimizer will generate the same execution plan (at least as far as I've ever been able to tell).
This does of course presume that the where clause in the first query is how you would be joining in the second query.
This will probably get closed as a duplicate, btw.
The nice part of the second method is that it helps separates the join condition (on ...) from the filter condition (where ...). This can help make the intent of the query more readable.
The join condition will typically be more descriptive of the structure of the database and the relation between the tables. e.g., the salary table is related to the employee table by the EmployeeID column, and queries involving those two tables will probably always join on that column.
The filter condition is more descriptive of the specific task being performed by the query. If the query is FindRichPeople, the where clause might be "where salaries.Salary > 1000000"... thats describing the task at hand, not the database structure.
Note that the SQL compiler doesn't see it that way... if it decides that it will be faster to cross join and then filter the results, it will cross join and filter the results. It doesn't care what is in the ON clause and whats in the WHERE clause. But, that typically wont happen if the on clause matches a foreign key or joins to a primary key or indexed column. As far as operating correctly, they are identical; as far as writing readable, maintainable code, the second way is probably a little better.
there is no difference as far as I know is the second one with the inner join the new way to write such statements and the first one the old method.
The first one does a Cartesian product on all record within those two tables then filters by the where clause.
The second only joins on records that meet the requirements of your ON clause.
EDIT: As others have indicated, the optimization engine will take care of an attempt on a Cartesian product and will result in the same query more or less.
A bit same. Can help you out.
Left join vs multiple tables in SQL (a)
Left join vs multiple tables in SQL (b)
In the example you've given, the queries are equivalent; if you're using SQL Server, run the query and display the actual exection plan to see what the server's doing internally.
I'm not clear about working difference between queries mentioned below.
Specifically I'm unclear about the concept of OPTION(LOOP JOIN).
1st approach: it's a traditional join used, which is most expensive than all of below.
SELECT *
FROM [Item Detail] a
LEFT JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (FORCE ORDER);
2nd approach: It includes OPTION in a statement with sorted data, merely optimized.
SELECT *
FROM [Item Detail] a
LEFT LOOP JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (FORCE ORDER);
3rd approach: Here, I am not clear, how the query works and includes OPTION with loop join!!?
SELECT *
FROM [Item Detail] a
LEFT LOOP JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (LOOP JOIN);
Can anybody explain difference and way of working and advantages of each one over other?
Note: These are not Nested OR Hash loops!
From Query Hints (Transact-SQL)
FORCE ORDER Specifies that the join order indicated by the query
syntax is preserved during query optimization. Using FORCE ORDER does
not affect possible role reversal behavior of the query optimizer.
also
{ LOOP | MERGE | HASH } JOIN Specifies that all join operations are
performed by LOOP JOIN, MERGE JOIN, or HASH JOIN in the whole query.
If more than one join hint is specified, the optimizer selects the
least expensive join strategy from the allowed ones.
Advanced Query Tuning Concepts
If one join input is small (fewer than 10 rows) and the other join
input is fairly large and indexed on its join columns, an index nested
loops join is the fastest join operation because they require the
least I/O and the fewest comparisons.
If the two join inputs are not small but are sorted on their join
column (for example, if they were obtained by scanning sorted
indexes), a merge join is the fastest join operation.
Hash joins can efficiently process large, unsorted, nonindexed inputs.
And Join Hints (Transact-SQL)
Join hints specify that the query optimizer enforce a join strategy
between two tables
Your option 1 tells the optimizer to keep the join order as is. So the JOIN type can be decided by the optimizer, so might be MERGE JOIN.
You option 2 is telling the optimizer to use LOOP JOIN for this specific JOIN. If there were any other joins in the FROM section, the optimizer would be able to decide for them. Also, you are specifying the order of JOINS to take for the optimizer.
Your last option OPTION (LOOP JOIN) would enforce LOOP JOIN across all joins in the query.
This all said, it is very seldom that the optimizer would choose an incorrect plan, and this should probably indicate bigger underlying issues, such as outdated statistics or fragmented indexes.
I heard that setting index is closely related to the sequence of joining tables. Could you provide some examples or article about this point?
Thanks.
Not so much the join sequence as written by you, but the join sequence that the Query Optimizer elects to use.
So... if you're joining on two fields called CustomerId, then indexing that field in both tables (using an index which also incorporates fields that are needed by the query), then you can get a Merge Join happening.
Bear in mind that if your query filters a table using somefield = somevalue, then your filter should be on somefield first.
As per subject, i am looking for a fast way to count records in a table without table scan with where condition
There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.
The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.
I am no good at SQL.
I am looking for a way to speed up a simple join like this:
SELECT
E.expressionID,
A.attributeName,
A.attributeValue
FROM
attributes A
JOIN
expressions E
ON
E.attributeId = A.attributeId
I am doing this dozens of thousands times and it's taking more and more as the table gets bigger.
I am thinking indexes - If I was to speed up selects on the single tables I'd probably put nonclustered indexes on expressionID for the expressions table and another on (attributeName, attributeValue) for the attributes table - but I don't know how this could apply to the join.
EDIT: I already have a clustered index on expressionId (PK), attributeId (PK, FK) on the expressions table and another clustered index on attributeId (PK) on the attributes table
I've seen this question but I am asking for something more general and probably far simpler.
Any help appreciated!
You definitely want to have indexes on attributeID on both the attributes and expressions table. If you don't currently have those indexes in place, I think you'll see a big speedup.
In fact, because there are so few columns being returned, I would consider a covered index for this query
i.e. an index that includes all the fields in the query.
Some things you need to care about are indexes, the query plan and statistics.
Put indexes on attributeId. Or, make sure indexes exist where attributeId is the first column in the key (SQL Server can still use indexes if it's not the 1st column, but it's not as fast).
Highlight the query in Query Analyzer and hit ^L to see the plan. You can see how tables are joined together. Almost always, using indexes is better than not (there are fringe cases where if a table is small enough, indexes can slow you down -- but for now, just be aware that 99% of the time indexes are good).
Pay attention to the order in which tables are joined. SQL Server maintains statistics on table sizes and will determine which one is better to join first. Do some investigation on internal SQL Server procedures to update statistics -- it's been too long so I don't have that info handy.
That should get you started. Really, an entire chapter can be written on how a database can optimize even such a simple query.
I bet your problem is the huge number of rows that are being inserted into that temp table. Is there any way you can add a WHERE clause before you SELECT every row in the database?
Another thing to do is add some indexes like this:
attributes.{attributeId, attributeName, attributeValue}
expressions.{attributeId, expressionID}
This is hacky! But useful if it's a last resort.
What this does is create a query plan that can be "entirely answered" by indexes. Usually, an index actually causes a double-I/O in your above query: one to hit the index (i.e. probe into the table), another to fetch the actual row referred to by the index (to pull attributeName, etc).
This is especially helpful if "attributes" or "expresssions" is a wide table. That is, a table that's expensive to fetch the rows from.
Finally, the best way to speed your query is to add a WHERE clause!
If I'm understanding your schema correctly, you're stating that your tables kinda look like this:
Expressions: PK - ExpressionID, AttributeID
Attributes: PK - AttributeID
Assuming that each PK is a clustered index, that still means that an Index Scan is required on the Expressions table. You might want to consider creating an Index on the Expressions table such as: AttributeID, ExpressionID. This would help to stop the Index Scanning that currently occurs.
Tips,
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition