How to Speed Up Simple Join

How to Speed Up Simple Join - sql-server

I am no good at SQL.
I am looking for a way to speed up a simple join like this:
SELECT
E.expressionID,
A.attributeName,
A.attributeValue
FROM
attributes A
JOIN
expressions E
ON
E.attributeId = A.attributeId
I am doing this dozens of thousands times and it's taking more and more as the table gets bigger.
I am thinking indexes - If I was to speed up selects on the single tables I'd probably put nonclustered indexes on expressionID for the expressions table and another on (attributeName, attributeValue) for the attributes table - but I don't know how this could apply to the join.
EDIT: I already have a clustered index on expressionId (PK), attributeId (PK, FK) on the expressions table and another clustered index on attributeId (PK) on the attributes table
I've seen this question but I am asking for something more general and probably far simpler.
Any help appreciated!

You definitely want to have indexes on attributeID on both the attributes and expressions table. If you don't currently have those indexes in place, I think you'll see a big speedup.

In fact, because there are so few columns being returned, I would consider a covered index for this query
i.e. an index that includes all the fields in the query.

Some things you need to care about are indexes, the query plan and statistics.
Put indexes on attributeId. Or, make sure indexes exist where attributeId is the first column in the key (SQL Server can still use indexes if it's not the 1st column, but it's not as fast).
Highlight the query in Query Analyzer and hit ^L to see the plan. You can see how tables are joined together. Almost always, using indexes is better than not (there are fringe cases where if a table is small enough, indexes can slow you down -- but for now, just be aware that 99% of the time indexes are good).
Pay attention to the order in which tables are joined. SQL Server maintains statistics on table sizes and will determine which one is better to join first. Do some investigation on internal SQL Server procedures to update statistics -- it's been too long so I don't have that info handy.
That should get you started. Really, an entire chapter can be written on how a database can optimize even such a simple query.

I bet your problem is the huge number of rows that are being inserted into that temp table. Is there any way you can add a WHERE clause before you SELECT every row in the database?

Another thing to do is add some indexes like this:
attributes.{attributeId, attributeName, attributeValue}
expressions.{attributeId, expressionID}
This is hacky! But useful if it's a last resort.
What this does is create a query plan that can be "entirely answered" by indexes. Usually, an index actually causes a double-I/O in your above query: one to hit the index (i.e. probe into the table), another to fetch the actual row referred to by the index (to pull attributeName, etc).
This is especially helpful if "attributes" or "expresssions" is a wide table. That is, a table that's expensive to fetch the rows from.
Finally, the best way to speed your query is to add a WHERE clause!

If I'm understanding your schema correctly, you're stating that your tables kinda look like this:
Expressions: PK - ExpressionID, AttributeID
Attributes: PK - AttributeID
Assuming that each PK is a clustered index, that still means that an Index Scan is required on the Expressions table. You might want to consider creating an Index on the Expressions table such as: AttributeID, ExpressionID. This would help to stop the Index Scanning that currently occurs.

Tips,
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition

Related

which execution plan has better performance?

(my english is not good enough. so bear with me)
I am working on optimizing this query:
Select Title from Groups Where GroupID = WWPMD.GroupID) as GroupName
FROM dbo.Detail AS WWPMD INNER JOIN
dbo.News AS WWPMN ON WWPMD.DetailID = WWPMN.NewsID
WHERE
WWPMD.IsDeleted = 0 AND WWPMD.IsEnabled= 1 AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
in this case, tables have clustered indexes on primary keys which are GUID. in the execution plan, the arrows were a little thick, and the cost of clustered index seek, for table "detail" was 87%, and there was no cost for key lookup
then, I changed indexes of table "detail". I have put clustered index on a datetime column, and 3 unclustered indexes on PK and FKs. now in the execution plan, index seek cost for table detail is 4percent and key lookup is 80 percent, with thin arrows.
I want to know which execution plan is better, and what can I do to improve this query.
UPDATE:
thank all of you for your guidance. one more question. I want to know if 80% cost of a clustered index seek is better, or 80% total cost of a non-clustered index seek and key look up. which one is better?

IN statement is good for selecting littele bit data if you are selecting more data you should use INNER JOIN its better performance than IN for large data
IN is faster than JOIN on DISTINCT
EXISTS is more efficient that IN, because "EXISTS returns only one row"

Clustered index on a guid column is not a good idea when your guids are not sequential, since this will lead to performance loss on inserts. The records in your table are physically ordered based on the clustered index. The clustered index should be put on a column wich has sequential values and doesn't change (often).
if you have a nonclustered index on groupid (table groups), then you could make 'title' an included column on that index. See msdn for included columns.

I suggest that use following query:
INNER JOIN (SELECT DISTINCT ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')z OR WWPMD.GroupID= z.ModuleID
Instead of:
AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
Also must be survey execution plan of this query, It seem that Index on Detail.DetailId with filter (IsDeleted = 0 and IsEnable = 1) was very useful.

Indexes will not be particularly useful unless the leftmost indexed columns are specified in JOIN and/or WHERE clauses. I suggest thise indexes below (unique when possible): dbo.Detail(GroupID), dbo.News(DetailID), dbo.Page_Content_Item(ContentID)
You can fine tine these indexes with included columns and/or filters, but I suspect performance may be good enough without those measures.
Note that primary keys are important. Not only is that a design best practice, a unique index will automatically be created on the key columns(s), which can improve performance of joins to the table on related columns. Consider reviewing your model to ensure you have proper primary keys and relationships.

minimize time when updating table with left join

I have 2 tables like:
-table1: id_1, id_2, id_3, ref_id (id_1, id_2 is pk)
-table2: ref_id, id_4
I want id_3 field should be equal to table2's id_4(ref_id is the primary key)
table1 has about 6 million records and table2 has about 2700 records.
I wrote a sql like:
update table1
set id_3 = b.id_3
from table1
left join table2 b on id_1= b.ref_id
By using SQL Server the query takes so much time like about 16 hr and still no response. How can I reduce the query time?

Sounds like it is indeed taking absurdly long, but the lack of indices could be the cause of that. Without indices the database basically has to walk through 2700 records for every single record in your 6M records table.
So start by adding an index (assuming the primary key isn't an index) on ref_id and also add an index on id_1.
To make things easier to monitor (in terms of progress) simply loop through the 2700 records in table 2 and do an update per record (or per 10, 100, etc..) so you can update in parts and see how far it gets.
Also, to make sure you don't do anything useless, I would recommend adding a and table1.id_3 <> table2.id_3

Updating every row in a 6-million row table is likely to be slow regardless.
One way to get a benchmark for the maximum speed of updating every row would be to just time the query:
update table1
set id_3 = 100
Also, do you need to update rows in table1 that have no matching row in table2? In that case, switching the left outer join to an inner join would greatly improve performance.

To answer this question we really need to know what the clustered indexes on the two tables are. I can make a suggestion for the clustered indexes to make this particular query fast, however, other factors should really be considered when choosing clustered indexes.
So with that in mind, see if these indexes help:
table1: UNIQUE CLUSTERED INDEX on (id_1, id_2)
table2: UNIQUE CLUSTERED INDEX on (ref_id)
Basically make your PKs clustered if they are not already.
The other important thing is whether the tables are seeing other traffic while you are running this update. If so, the long runtime may be due to blocking. In this case you should consider batching, i.e. updating only a small portion at a time instead of all in single statement.

Why oracle table indexed but still do full table scan?

I have a table 'MSATTRIBUTE' with 3000K rows. I used the following query to retrieve data, this query has different execution plan with same DB data but in different env. in one env, it appears full scan so the query is very slow, but in another env it all used index scan it's quite good, everybody who knows why it have full table scan in one env because I built index for them, how do I let become index scan just like what I tested in env 1. how I can improve this query?

without understanding way more than I care to know about your data model and you business it's hard to give concrete positive advice. But here are some notes about your indexing strategy and why I would guess the optimizer is not using the indxes you have.
In the sub-query the access path to REDLINE_MSATTRIBUTE drives from three columns:
CLASS
OBJECT_ID
CHANGE_RELEASE_DATE.
CLASS is not indexed. but that is presumably not very selective. OBJECT_ID
is the leading column of a compound index but the other columns are irrelevant the sub-query.
But the biggest problem is CHANGE_RELEASE_DATE. This is not indexed at all. Which is bad news, as your one primary key look up produces a date which is then compared with CHANGE_RELEASE_DATE. If a column is not indexed teh database has to read the table to get its values.
The main query drives off
ATTID
CHANGE_ID
OBJECT_ID (again)
CHANGE_RELEASE_DATE (again)
CLASS (again)
OLD_VALUE
ATTID is indexed but how sleective is that index? The optimizer probably doesn't think it's very selective. ATTID is also in a compound index with CHANGE_ID and OLD_VALUE but none of them are the leading columns, so that's not very useful. And we've discussed CLASS, CHANGE_RELEASE_DATE and OBJECT_ID already.
The optimizer will only choose to use an index if it is cheaper (fewer reads) than a table scan. This usually means WHERE clause criteria need to map to the leading (i.e. leftmost) columns of an index. This could be the case with OBJECT_ID and ATTID in the sub-query except that
The execution plan would have to do an INDEX SKIP SCAN because REDLINE_MSATTRIBUTE_INDEX1 has CHANGE_ID between the two columns
The database has to go to the table anyway to get the CLASS and the CHANGE_RELEASE_DATE.
So, you might get some improvement by building an index on (CHANGE_RELEASE_DATE, CLASS, OBJECT_ID, ATTID). But as I said upfront, without knowing more about your situation these are just ill-informed guesses.

If the rows are in a different order in the two tables then the indexes in the two systems can have different clustering factors, and hence difference estimated costs for index access. Check the table and index statistics, including the clustering factor, to see if there are significant differences.
Also, do either of the systems' explain plans mention dynamic sampling?

when oracle has an index and decides to use/not use it it's might be because
1) you may have different setting for OPTIMIZER_MODE - make sure it's not on RBO.
2) the data is different - in this case oracle might evaluate the query stats differently.
3) the data is the same but the stats are not up to date. in this case - gather stats
dbms_stats.gather_table_stats('your3000Ktable',cascade=>true);
4) there are allot more reasons why oracle will not use the index on one environment, i'll suggest comparing parameters (such as OPTIMIZER_ INDEX_COST_ADJ etc...)

One immediate issue is this piece SELECT RELEASE_DATE FROM CHANGE WHERE ID = 136972355 (This piece of code will run for every row coming back and it doesn't need to... a better way of doing this is using a single cartesian table so it only runs once and returns a static value to compare....
Example 1:
Select * From Table1, (Select Sysdate As Compare From Dual) Table2 Where Table1.Date > Table2.Compare.
Is always faster than
Select * from Table1 Where Date > Sysdate -- Sysdate will get called for each row as it is dynamic function based value. The earlier example will resolve once to a literal and drastically faster. and i believe this is definitely once piece hurting your query and forcing a table scan.
I also believe this is a more efficient way to execute the query.
Select
REDLINE_MSATTRIBUTE.ATTID
,REDLINE_MSATTRIBUTE.VALUE
From
REDLINE_MSATTRIBUTE
,(
SELECT ATTID
,CHANGE_ID
,MIN(CHANGE_RELEASE_DATE) RELEASE_DATE
FROM REDLINE_MSATTRIBUTE
,(SELECT RELEASE_DATE FROM CHANGE WHERE ID = 136972355) T_COMPARE
WHERE CLASS = 9000
And OBJECT_ID = 32718015
And CHANGE_RELEASE_DATE > T_COMPARE.RELEASE_DATE
And ATTID IN (1564, 1565)
GROUP
BY ATTID,
CHANGE_ID
) T_DYNAMIC
Where
REDLINE_MSATTRIBUTE.ATTID = T_DYNAMIC.ATTID
And REDLINE_MSATTRIBUTE.CHANGE_ID = T_DYNAMIC.CHANGE_ID
And REDLINE_MSATTRIBUTE.RELEASE_DATE = T_DYNAMIC.RELEASE_DATE
And CLASS = 9000
And OBJECT_ID = 32718015
And OLD_VALUE ='Y'
Order
By REDLINE_MSATTRIBUTE.ATTID,
REDLINE_MSATTRIBUTE.VALUE;

Why does SQL Server sometimes choose an index scan over a bookmark lookup?

we have a straightforward table such as this:
OrderID primary key / clustered index
CustomerID foreign key / single-column non-clustered index
[a bunch more columns]
Then, we have a query such as this:
SELECT [a bunch of columns]
FROM Orders
WHERE CustomerID = 1234
We're finding that sometimes SQL Server 2008 R2 does a seek on the non-clustered index, and then a bookmark lookup on the clustered index (we like this - it's plenty fast).
But on other seemingly random occasions, SQL Server instead does a scan on the clustered index (very slow - brings our app to a crawl - and it seems to do this at the busiest hours of our day).
I know that we could (a) use an index hint, or (b) enhance our non-clustered index so that it covers our large set of selected columns. But (a) ties the logical to the physical, and regarding (b), I've read that an index shouldn't cover too many columns.
I would first love to hear any ideas why SQL Server is doing what it's doing. Also, any recommendations would be most appreciated. Thanks!

The selectivity of CustomerID will play some part in the query optimiser's decision. If, on one hand, it was unique, then an equality operation will yield at most one result, so a SEEK/LOOKUP operation is almost guaranteed. If, on the other hand, potentially hundreds or thousands of records will match a value for CustomerID, then a clustered-index scan may seem more attractive.
You'd be surprised how selective a filter has to be to preclude a scan. I can't find the article I originally pulled this figure from, but if CustomerID 1234 will match as little as 4% of the records in the table, a scan on the clustered index may be more efficient, or at least appear that way to the optimiser (which doesn't get it right 100% of the time).
It sounds at least plausible that the statistics kept on the non-clustered index on CustomerID is causing the optimiser to toggle between seek/scan based on the selectivity criteria.
You may be able to coax the optimiser towards use of the index by introducing a JOIN or EXISTS operation:
-- Be aware: this approach is untested
select o.*
from Orders o
inner join Customers c on o.CustomerID = c.CustomerID
where c.CustomerID = 1234;
Or:
-- Be aware: this approach is untested
select o.*
from Orders o
where exists (select 1
from Customers c
where c.CustomerID = 1234 and
o.CustomerID = c.CustomerID);
Also be aware that with this EXISTS approach, if you don't have an index on the "join" predicate (in this case, the CustomerID field) in both tables then you will end up with a nested loop which is painfully slow. Using inner joins seems much safer, but the EXISTS approach has its place from time to time when it can exploit indexes.
These are just suggestions; I can't say if they will be effective or not. Just something to try, or for a resident expert to confirm or deny.

You should make your index a covered index so that the bookmark lookup is not required. This is the potentially expensive operation which may be causing the query optimiser to ignore your index.
If you are using SQL Server 2005 or above, you can add them as included columns, otherwise you would have to add them as additional key columns.
A covered index always performs better than a noncovered index, particularly for nonselective queries.

what is the fastest way of getting table record count with condition on SQL Server

As per subject, i am looking for a fast way to count records in a table without table scan with where condition

There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.

The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))

Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight