What is the difference between Lookup, Scan and Seek? - sql-server

So I found this query
SELECT MAX(us.[last_user_lookup]) as [last_user_lookup], MAX(us.[last_user_scan])
AS [last_user_scan], MAX(us.[last_user_seek]) as [last_user_seek]
from sys.dm_db_index_usage_stats as us
where us.[database_id] = DB_ID() AND us.[object_id] = OBJECT_ID('tblName')
group by us.[database_id], us.[object_id];
when i look up the documentation on sys.dm_db_index_usage_stats all it says is
last_user_seek datetime Time of last user seek
last_user_scan datetime Time of last user scan.
last_user_lookup datetime Time of last user lookup.
...
Every individual seek, scan, lookup, or update on the specified index by one query execution is counted as a use of that index and increments the corresponding counter in this view. Information is reported both for operations caused by user-submitted queries, and for operations caused by internally generated queries, such as scans for gathering statistics.
Now I understand that when I run the query it's getting the highest time of those 3 fields as sys.dm_db_index_usage_stats can have duplicate database_id and object_id where one or more of the fields may also be NULL (so you can just to a SELECT TOP 1 ... ORDER BY last_user_seek, last_user_scan, last_user_lookup DESC otherwise you potentially miss data) but when I run it I get values like
NULL | 2017-05-15 08:56:29.260 | 2017-05-15 08:54:02.510
but I don't understand what the user has done with the table which is represented by these values.
So what is the difference between Lookup, Scan and Seek?

Basic difference between these operations:
Let's consider that you have two tables. TableA and TableB. Both tables contain more than 1000 000 rows, and both have clustered indexes on Id column. TableB has also nonclustered index on code column. (Remember that your nonclustered index is always pointing at pages of clustered one...)
seek:
Let's consider that you want only 1 record from TableA and your clustered index is on column Id.
Query should be like:
SELECT Name
FROM TableA
WHERE Id = 1
Your result contains fewer than 15% (it is something between 10-20, depends on situation) of your full data set... Sql Server performs index seek in this scenario. (optimizer has found a useful index to retrieve data)
scan:
For example your query needs more than 15% of data from TableA, then it is necessary to scan the whole index to satisfy the query.
Let's consider that TableB has TableA Id column as foreign key from TableA, and TableB contains all Ids from TableA. Query should be like:
SELECT a.Id
FROM TableA a
JOIN TableB b ON a.Id = b.TableAId
Or just
SELECT *
FROM TableA
For index on TableA SQL Server performs use index scan. Because all data (pages) need to satisfy the query...
lookup:
Let's consider that TableB has column dim and also column code and nonclustered index on code (as we mentioned).
SQL Server will use lookup when it needs to retrieve non key data from the data page and nonclustered index is used to resolve the query.
For example key lookup could be used in query like:
SELECT id, dim
FROM TableB
WHERE code = 'codeX'
You can resolve it by covering index (include dim to nonclustered one)

Related

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

How to optimize SQL Server Merge statement running with millions of records

I use SQL Server 2014 and need to update a new added datetime type column in one table. There are two tables related (both have > 30 millions of records):
TableA:
CategoryID, itemID, dataCreated, deleted, some other string properties.
This table contains multiples records for each item with different datecreated.
TableB:
CategoryID, itemID, LatestUpdatedDate (This is the new added column)
both categoryID and itemID are part of an index on this table.
To update tableB's LatestUpdatedDate from table A on matched CategoryID and ItemID, I used the following merge statement:
merge [dbo].[TableB] with(HOLDLOCK) as t
using
(
select CategoryID,itemID, max(DateCreated) as LatestUpdatedDate
from dbo.TableA
where TableA.Deleted = 0
group by CategoryID,itemID
) as s on t.CategoryID = s.CategoryID and t.itemID = s.itemID
when matched then
update
set t.LatestUpdatedDate = s.LatestUpdatedDate
when not matched then
insert (CategoryID, itemID, LatestUpdatedDate)
values (s.CategoryID, s.itemID)
Given the fact that millions of records in both table, How can I optimize this script? Or Is there any other way to update the table with better performance?
Note: This is a one-off script and DB is on live, there would be a trigger added to tableA against insert to update the date in tableB in the future.
As per Optimizing MERGE Statement Performance, the best you can do is:
Create an index on the join columns in the source table that is unique and covering.
Create a unique clustered index on the join columns in the target table.
You may get a performance improvement during MERGE1 by creating an index on TableA on (Deleted, CategoryID, itemID) INCLUDE(DateCreated). However, since this is a one-off operation, the resources (time, CPU, space) required to create this index probably won't offset the performance gains vis-a-vis running the query as-is and relying on your existing index.

Database Index when SQL statement includes "IN" clause

I have SQL statement which takes really a lot of time to execute and I really had to improve it somehow.
select * from table where ID=1 and GROUP in
(select group from groupteam where
department= 'marketing' )
My question is if I should create index on columns ID and GROUP would it help?
Or if not should I create index on second table on column DEPARTMENT?
Or I should create two indexes for both tables?
First table has 249003.
Second table has in total 900 rows while query in that table returns only 2 rows.
That is why I am surprised that response is so slow.
Thank you
You can also use EXISTS, depending on your database like so:
select * from table t
where id = 1
and exists (
select 1 from groupteam
where department = 'marketing'
and group = t.group
)
Create a composite index on individual indexes on groupteam's department and group
Create a composite index or individual indexes on table's id and group
Do an explain/analyze depending on your database to review how indexes are being used by your database engine.
Try a join instead:
select * from table t
JOIN groupteam gt
ON d.group = gt.group
where ID=1 AND gt.department= 'marketing'
Index on table group and id column and table groupteam group column would help too.

Does MS SQL Server automatically create temp table if the query contains a lot id's in 'IN CLAUSE'

I have a big query to get multiple rows by id's like
SELECT *
FROM TABLE
WHERE Id in (1001..10000)
This query runs very slow and it ends up with timeout exception.
Temp fix for it is querying with limit, break this query into 10 parts per 1000 id's.
I heard that using temp tables may help in this case but also looks like ms sql server automatically doing it underneath.
What is the best way to handle problems like this?
You could write the query as follows using a temporary table:
CREATE TABLE #ids(Id INT NOT NULL PRIMARY KEY);
INSERT INTO #ids(Id) VALUES (1001),(1002),/*add your individual Ids here*/,(10000);
SELECT
t.*
FROM
[Table] AS t
INNER JOIN #ids AS ids ON
ids.Id=t.Id;
DROP TABLE #ids;
My guess is that it will probably run faster than your original query. Lookup can be done directly using an index (if it exists on the [Table].Id column).
Your original query translates to
SELECT *
FROM [TABLE]
WHERE Id=1000 OR Id=1001 OR /*...*/ OR Id=10000;
This would require evalutation of the expression Id=1000 OR Id=1001 OR /*...*/ OR Id=10000 for every row in [Table] which probably takes longer than with a temporary table. The example with a temporary table takes each Id in #ids and looks for a corresponding Id in [Table] using an index.
This all assumes that there are gaps in the Ids between 1000 and 10000. Otherwise it would be easier to write
SELECT *
FROM [TABLE]
WHERE Id BETWEEN 1001 AND 10000;
This would also require an index on [Table].Id to speed it up.

Query optimization to avoid matching on a column with only a few distinct values

I have two tables in sql-server.
System{
AUTO_ID -- identity auto increment primary key
SYSTEM_GUID -- index created, unique key
}
File{
AUTO_ID -- identity auto increment primary key
File_path
System_Guid -- foreign key refers system.System_guid, index is created
-- on this column
}
System table has 100,000 rows.
File table has 200,000 rows.
File table has only 10 distinct values for System_guid.
My Query is as follows:
Select * from
File left outer join System
on file.system_guid = system.system_guid
SQL server is using hash match join to give me result which is taking a long time.
I want to optimize this query to make it go faster. The fact that there are only 10 distinct system_guid probably means the hash match wastes energy. How can utilize this knowledge to speed up the query?
When an indexed column has almost non-changing values, the purpose of index fails. If all you want is to extract records from System where system_guid is one of the ones in File then you may be better off (in your case) with a query like:
select * from System
where system_guid in (select distinct system_guid from File).
Is the LEFT join really necessary? How does the query perform as an INNER join? Do you get a different join.
I doubt hash join is much of a problem with this amount of I/O.
You could do a UNION like this... maybe coax a different plan out of it.
Select * from File
WHERE System_Guid NOT IN (SELECT system_guid from system)
union all
Select * from
File inner join System
on file.system_guid = system.system_guid

Resources