minimize time when updating table with left join

minimize time when updating table with left join - sql-server

I have 2 tables like:
-table1: id_1, id_2, id_3, ref_id (id_1, id_2 is pk)
-table2: ref_id, id_4
I want id_3 field should be equal to table2's id_4(ref_id is the primary key)
table1 has about 6 million records and table2 has about 2700 records.
I wrote a sql like:
update table1
set id_3 = b.id_3
from table1
left join table2 b on id_1= b.ref_id
By using SQL Server the query takes so much time like about 16 hr and still no response. How can I reduce the query time?

Sounds like it is indeed taking absurdly long, but the lack of indices could be the cause of that. Without indices the database basically has to walk through 2700 records for every single record in your 6M records table.
So start by adding an index (assuming the primary key isn't an index) on ref_id and also add an index on id_1.
To make things easier to monitor (in terms of progress) simply loop through the 2700 records in table 2 and do an update per record (or per 10, 100, etc..) so you can update in parts and see how far it gets.
Also, to make sure you don't do anything useless, I would recommend adding a and table1.id_3 <> table2.id_3

Updating every row in a 6-million row table is likely to be slow regardless.
One way to get a benchmark for the maximum speed of updating every row would be to just time the query:
update table1
set id_3 = 100
Also, do you need to update rows in table1 that have no matching row in table2? In that case, switching the left outer join to an inner join would greatly improve performance.

To answer this question we really need to know what the clustered indexes on the two tables are. I can make a suggestion for the clustered indexes to make this particular query fast, however, other factors should really be considered when choosing clustered indexes.
So with that in mind, see if these indexes help:
table1: UNIQUE CLUSTERED INDEX on (id_1, id_2)
table2: UNIQUE CLUSTERED INDEX on (ref_id)
Basically make your PKs clustered if they are not already.
The other important thing is whether the tables are seeing other traffic while you are running this update. If so, the long runtime may be due to blocking. In this case you should consider batching, i.e. updating only a small portion at a time instead of all in single statement.

Related

Postgres lookup tables over clustered data

Background
This is a simplified version of the postgres database I am managing:
TableA: id,name
TableB: id,id_a,prop1,prop2
This database has a peculiarity: when I select data, I only consider rows of TableB that have the same id_a. So I am never interested in selecting data from TableB with mixed values of id_a. Therefore, queries are always of this kind:
SELECT something FROM TableB INNER JOIN TableA ON TableA.id=id_a
Some time ago, the number of rows in TableA grew up to 20000 rows and TableB up to 10^7 rows.
To first speedup queries I added a binary tree lookup table to the TableB properties. Something like the following:
"my_index" btree (prop1)
The problem
Now I have to insert new data, and the database size will became more than the double of its current size. Inserting data to TableB became too slow.
I understood that the slowness cames from the updating of my_index.
When I add a new row of TableB, the database have to reorder the my_index lookup table.
I feel like this would be speeded up, if my_index was not over all elements.
But I do not need the new row, with a given id_a property to be sorted with a row having a different id_a property
The question
How can I create an index on a table, where the elements are ordered only when they have a same common property (es. a column called id_a)?

You can't.
The question that I would immediately ask you if you want such an index is: Yes, but for what values of id_a do you want the index? And your answer would be “for all of them”.
If you actually would want an index only for some values, you could use a partial index:
CREATE INDEX partidx ON tableb(prop1) WHERE id_a = 42;
But really you want an index for the whole table.
Besides, the INSERT would be just as slow unless the row inserted does not satisfy the WHERE condition of your index.
There are three things you can do to speed up INSERT:
Run as many INSERT statements as possible in a single transaction, ideally all of them.
Then you don't have to pay the price of COMMIT after every single INSERT, and COMMITs are quite expensive: they require data to be written to the disk hardware (not the cache), and this is incredibly slow (1 ms a decent time).
You can speed this up even more if you use prepared statements. That way the INSERT doesn't have to be parsed and prepared every time.
Use the SQL command COPY to insert many rows. COPY is specifically designed for bulk data import and will be faster than INSERT.
If COPY is to slow, usually because you need to INSERT a lot of data, the fatsest way is to drop all indexes, insert the data with COPY and then recreate the indexes. It can speed up the process by an order of magnitude, but of course the database is not fully available while the indexes are dropped.

Speeding up a SQL query with indexes

I have a table called Products.
This table contains over 3 million entries. Every day there are approximately 5000 new entries. which only happens during the night in 2 minutes.
But this table gets queried every night maybe over 20 000 times with this query.
SELECT Price
FROM Products
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate
Table structure:
Code nvarchar(50)
Company nvarchar(10)
CreatedDate datetime
I can see that this query takes about a second to return a result from Products table.
There is no productId column in the table as it is not needed. So there is no primary key in the table.
I would like to somehow improve this query to return the result faster.
I have never used indexes before. What would be the best way to use indexes on this table?
If I provide a primary key do you think it would speed up the query result? Keep in mind that I will still have to query the table by providing 3 parameters as
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate.
This is mandatory.
As I mentioned that the table gets new entries in 2 minutes every day during the night. How would this affect the indexes?
If I use indexes, which column would be the best to use and whether I should use clustered or non-clustered indexes?

The best thing to do would depend on what other fields the table has and what other queries run against that table.
Without more details, a non-clustered index on (code, company, createddate) that included the "price" column will certainly improve performance.
CREATE NONCLUSTERED INDEX IX_code_company_createddate
ON Products(code, company, createddate)
INCLUDE (price);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "code, company, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "price" value for each row.
Regarding the inserts, for each row added, SQL Server will have to add them to the index as well, so performance for inserts will be impacted. In think you should expect the gains on SELECT performance to outweigh the impact on the inserts, but you should test that.
Also, you will be using more space as the index will store all those fields for each row besides the space used by the original table.
As others have noted in the comments, adding a PK to your table (even if that means adding a ProductId column you don't actually need) might be a good idea as well.

which execution plan has better performance?

(my english is not good enough. so bear with me)
I am working on optimizing this query:
Select Title from Groups Where GroupID = WWPMD.GroupID) as GroupName
FROM dbo.Detail AS WWPMD INNER JOIN
dbo.News AS WWPMN ON WWPMD.DetailID = WWPMN.NewsID
WHERE
WWPMD.IsDeleted = 0 AND WWPMD.IsEnabled= 1 AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
in this case, tables have clustered indexes on primary keys which are GUID. in the execution plan, the arrows were a little thick, and the cost of clustered index seek, for table "detail" was 87%, and there was no cost for key lookup
then, I changed indexes of table "detail". I have put clustered index on a datetime column, and 3 unclustered indexes on PK and FKs. now in the execution plan, index seek cost for table detail is 4percent and key lookup is 80 percent, with thin arrows.
I want to know which execution plan is better, and what can I do to improve this query.
UPDATE:
thank all of you for your guidance. one more question. I want to know if 80% cost of a clustered index seek is better, or 80% total cost of a non-clustered index seek and key look up. which one is better?

IN statement is good for selecting littele bit data if you are selecting more data you should use INNER JOIN its better performance than IN for large data
IN is faster than JOIN on DISTINCT
EXISTS is more efficient that IN, because "EXISTS returns only one row"

Clustered index on a guid column is not a good idea when your guids are not sequential, since this will lead to performance loss on inserts. The records in your table are physically ordered based on the clustered index. The clustered index should be put on a column wich has sequential values and doesn't change (often).
if you have a nonclustered index on groupid (table groups), then you could make 'title' an included column on that index. See msdn for included columns.

I suggest that use following query:
INNER JOIN (SELECT DISTINCT ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')z OR WWPMD.GroupID= z.ModuleID
Instead of:
AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
Also must be survey execution plan of this query, It seem that Index on Detail.DetailId with filter (IsDeleted = 0 and IsEnable = 1) was very useful.

Indexes will not be particularly useful unless the leftmost indexed columns are specified in JOIN and/or WHERE clauses. I suggest thise indexes below (unique when possible): dbo.Detail(GroupID), dbo.News(DetailID), dbo.Page_Content_Item(ContentID)
You can fine tine these indexes with included columns and/or filters, but I suspect performance may be good enough without those measures.
Note that primary keys are important. Not only is that a design best practice, a unique index will automatically be created on the key columns(s), which can improve performance of joins to the table on related columns. Consider reviewing your model to ensure you have proper primary keys and relationships.

Why does SQL Server sometimes choose an index scan over a bookmark lookup?

we have a straightforward table such as this:
OrderID primary key / clustered index
CustomerID foreign key / single-column non-clustered index
[a bunch more columns]
Then, we have a query such as this:
SELECT [a bunch of columns]
FROM Orders
WHERE CustomerID = 1234
We're finding that sometimes SQL Server 2008 R2 does a seek on the non-clustered index, and then a bookmark lookup on the clustered index (we like this - it's plenty fast).
But on other seemingly random occasions, SQL Server instead does a scan on the clustered index (very slow - brings our app to a crawl - and it seems to do this at the busiest hours of our day).
I know that we could (a) use an index hint, or (b) enhance our non-clustered index so that it covers our large set of selected columns. But (a) ties the logical to the physical, and regarding (b), I've read that an index shouldn't cover too many columns.
I would first love to hear any ideas why SQL Server is doing what it's doing. Also, any recommendations would be most appreciated. Thanks!

The selectivity of CustomerID will play some part in the query optimiser's decision. If, on one hand, it was unique, then an equality operation will yield at most one result, so a SEEK/LOOKUP operation is almost guaranteed. If, on the other hand, potentially hundreds or thousands of records will match a value for CustomerID, then a clustered-index scan may seem more attractive.
You'd be surprised how selective a filter has to be to preclude a scan. I can't find the article I originally pulled this figure from, but if CustomerID 1234 will match as little as 4% of the records in the table, a scan on the clustered index may be more efficient, or at least appear that way to the optimiser (which doesn't get it right 100% of the time).
It sounds at least plausible that the statistics kept on the non-clustered index on CustomerID is causing the optimiser to toggle between seek/scan based on the selectivity criteria.
You may be able to coax the optimiser towards use of the index by introducing a JOIN or EXISTS operation:
-- Be aware: this approach is untested
select o.*
from Orders o
inner join Customers c on o.CustomerID = c.CustomerID
where c.CustomerID = 1234;
Or:
-- Be aware: this approach is untested
select o.*
from Orders o
where exists (select 1
from Customers c
where c.CustomerID = 1234 and
o.CustomerID = c.CustomerID);
Also be aware that with this EXISTS approach, if you don't have an index on the "join" predicate (in this case, the CustomerID field) in both tables then you will end up with a nested loop which is painfully slow. Using inner joins seems much safer, but the EXISTS approach has its place from time to time when it can exploit indexes.
These are just suggestions; I can't say if they will be effective or not. Just something to try, or for a resident expert to confirm or deny.

You should make your index a covered index so that the bookmark lookup is not required. This is the potentially expensive operation which may be causing the query optimiser to ignore your index.
If you are using SQL Server 2005 or above, you can add them as included columns, otherwise you would have to add them as additional key columns.
A covered index always performs better than a noncovered index, particularly for nonselective queries.

How to Speed Up Simple Join

I am no good at SQL.
I am looking for a way to speed up a simple join like this:
SELECT
E.expressionID,
A.attributeName,
A.attributeValue
FROM
attributes A
JOIN
expressions E
ON
E.attributeId = A.attributeId
I am doing this dozens of thousands times and it's taking more and more as the table gets bigger.
I am thinking indexes - If I was to speed up selects on the single tables I'd probably put nonclustered indexes on expressionID for the expressions table and another on (attributeName, attributeValue) for the attributes table - but I don't know how this could apply to the join.
EDIT: I already have a clustered index on expressionId (PK), attributeId (PK, FK) on the expressions table and another clustered index on attributeId (PK) on the attributes table
I've seen this question but I am asking for something more general and probably far simpler.
Any help appreciated!

You definitely want to have indexes on attributeID on both the attributes and expressions table. If you don't currently have those indexes in place, I think you'll see a big speedup.

In fact, because there are so few columns being returned, I would consider a covered index for this query
i.e. an index that includes all the fields in the query.

Some things you need to care about are indexes, the query plan and statistics.
Put indexes on attributeId. Or, make sure indexes exist where attributeId is the first column in the key (SQL Server can still use indexes if it's not the 1st column, but it's not as fast).
Highlight the query in Query Analyzer and hit ^L to see the plan. You can see how tables are joined together. Almost always, using indexes is better than not (there are fringe cases where if a table is small enough, indexes can slow you down -- but for now, just be aware that 99% of the time indexes are good).
Pay attention to the order in which tables are joined. SQL Server maintains statistics on table sizes and will determine which one is better to join first. Do some investigation on internal SQL Server procedures to update statistics -- it's been too long so I don't have that info handy.
That should get you started. Really, an entire chapter can be written on how a database can optimize even such a simple query.

I bet your problem is the huge number of rows that are being inserted into that temp table. Is there any way you can add a WHERE clause before you SELECT every row in the database?

Another thing to do is add some indexes like this:
attributes.{attributeId, attributeName, attributeValue}
expressions.{attributeId, expressionID}
This is hacky! But useful if it's a last resort.
What this does is create a query plan that can be "entirely answered" by indexes. Usually, an index actually causes a double-I/O in your above query: one to hit the index (i.e. probe into the table), another to fetch the actual row referred to by the index (to pull attributeName, etc).
This is especially helpful if "attributes" or "expresssions" is a wide table. That is, a table that's expensive to fetch the rows from.
Finally, the best way to speed your query is to add a WHERE clause!

If I'm understanding your schema correctly, you're stating that your tables kinda look like this:
Expressions: PK - ExpressionID, AttributeID
Attributes: PK - AttributeID
Assuming that each PK is a clustered index, that still means that an Index Scan is required on the Expressions table. You might want to consider creating an Index on the Expressions table such as: AttributeID, ExpressionID. This would help to stop the Index Scanning that currently occurs.

Tips,
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight