Exporting de-aggregated data - sql-server

I'm currently working on a data export feature for a survey application. We are using SQL2k8. We store data in a normalized format: QuestionId, RespondentId, Answer. We have a couple other tables that define what the question text is for the QuestionId and demographics for the RespondentId...
Currently I'm using some dynamic SQL to generate a pivot that joins the question table to the answer table and creates an export, its working... The problem is that it seems slow and we don't have that much data (less than 50k respondents).
Right now I'm thinking "why am I 'paying' to de-aggregate the data for each query? Why don't I cache that?" The data being exported is based on dynamic criteria. It could be "give me respondents that completed on x date (or range)" or "people that like blue", etc. Because of that, I think I have to cache at the respondent level, find out what respondents are being exported and then select their combined cached de-aggregated data.
To me the quick and dirty fix is a totally flat table, RespondentId, Question1, Question2, etc. The problem is, we have multiple clients and that doesn't scale AND I don't want to have to maintain the flattened table as the survey changes.
So I'm thinking about putting an XML column on the respondent table and caching the results of a SELECT * FROM Data FOR XML AUTO WHERE RespondentId = x. With that in place, I would then be able to get my export with filtering and XML calls into the XML column.
What are you doing to export aggregated data in a flattened format (CSV, Excel, etc)? Does this approach seem ok? I worry about the cost of XML functions on larger result sets (think SELECT RespondentId, XmlCol.value('//data/question_1', 'nvarchar(50)') AS [Why is there air?], XmlCol.RinseAndRepeat)...
Is there a better technology/approach for this?
Thanks!
EDIT: SQL Block for testing.
Run steps 1 & 2 to prime the data, test with step 3, clean up with step 4...
At a thousand respondents by one hundred questions, it already seems slower than I'd like.
SET NOCOUNT ON;
-- step 1 - create seed data
CREATE TABLE #Questions (QuestionId INT PRIMARY KEY IDENTITY (1,1), QuestionText VARCHAR(50));
CREATE TABLE #Respondents (RespondentId INT PRIMARY KEY IDENTITY (1,1), Name VARCHAR(50));
CREATE TABLE #Data (QuestionId INT NOT NULL, RespondentId INT NOT NULL, Answer INT);
DECLARE #QuestionTarget INT = 100
,#QuestionCount INT = 0
,#RespondentTarget INT = 1000
,#RespondentCount INT = 0
,#RespondentId INT;
WHILE #QuestionCount < #QuestionTarget BEGIN
INSERT INTO #Questions(QuestionText) VALUES(CAST(NEWID() AS CHAR(36)));
SET #QuestionCount = #QuestionCount + 1;
END;
WHILE #RespondentCount < #RespondentTarget BEGIN
INSERT INTO #Respondents(Name) VALUES(CAST(NEWID() AS CHAR(36)));
SET #RespondentId = SCOPE_IDENTITY();
SET #QuestionCount = 1;
WHILE #QuestionCount <= #QuestionTarget BEGIN
INSERT INTO #Data(QuestionId, RespondentId, Answer)
VALUES(#QuestionCount, #RespondentId, ROUND(((10 - 1 -1) * RAND() + 1), 0));
SET #QuestionCount = #QuestionCount + 1;
END;
SET #RespondentCount = #RespondentCount + 1;
END;
-- step 2 - index seed data
ALTER TABLE #Data ADD CONSTRAINT [PK_Data] PRIMARY KEY CLUSTERED (QuestionId ASC, RespondentId ASC);
CREATE INDEX DataRespondentQuestion ON #Data (RespondentId ASC, QuestionId ASC);
-- step 3 - query data
DECLARE #Columns NVARCHAR(MAX)
,#TemplateSQL NVARCHAR(MAX)
,#RunSQL NVARCHAR(MAX);
SELECT #Columns = STUFF(
(
SELECT DISTINCT '],[' + q.QuestionText
FROM #Questions AS q
ORDER BY '],[' + q.QuestionText
FOR XML PATH('')
), 1, 2, '') + ']';
SET #TemplateSql =
'SELECT *
FROM
(
SELECT r.Name, q.QuestionText, d.Answer
FROM #Respondents AS r
INNER JOIN #Data AS d ON d.RespondentId = r.RespondentId
INNER JOIN #Questions AS q ON q.QuestionId = d.QuestionId
) AS d
PIVOT
(
MAX(d.Answer)
FOR d.QuestionText
IN (xxCOLUMNSxx)
) AS p;';
SET #RunSql = REPLACE(#TemplateSql, 'xxCOLUMNSxx', #Columns)
EXECUTE sys.sp_executesql #RunSql;
-- step 4 - clean up
DROP INDEX DataRespondentQuestion ON #Data;
DROP TABLE #Data;
DROP TABLE #Questions;
DROP TABLE #Respondents;

No, your approach does not seem ok. Keep your normalized data. If you have proper keys, the "cost" to deaggregate will be minimal. To further optimize your performance, stop using dynamic SQL. Write some cleverly written queries and encapsulate them in stored procedures. This will allow SQL server to cache the query plans instead of rebuilding them every time.
Before you do any of this, however, check the query plan. It is also possible that you are forgetting an index on at least one of the fields you are searching on, which will result in a full table scan of data. You may be able to drastically increase your performance with a few well placed indexes.

Related

How to Add a Set of Keys (UniqueIDs) to a Temp table to later INSERT into Production Table

I have the data ready to Insert into my Production table however the ID column is NULL and that needs to be pre-populated with the IDs prior to Insert. I have these IDs in another Temp Table... all I want is to simply apply these IDs to the records in my Temp Table.
For example... Say I have 10 records all simply needing IDs. I have in another temp table exactly 10 IDs... they simply need to be applied to my 10 records in my 'Ready to INSERT' Temp Table.
I worked in Oracle for about 9 years and I would have done this simply by looping over my 'Collection' using a FORALL Loop... basically I would simply loop over my 'Ready to INSERT' temp table and for each row apply the ID from my other 'Collection'... in SQL Server I'm working with Temp Tables NOT Collections and well... there's no FORALL Loop or really any fancy loops in SQL Server other than WHILE.
My goal is to know the appropriate method to accomplish this in SQL Server. I have learned that in the SQL Server world so many of the DML operations are all SET Based whereas when I worked in oracle we handled data via arrays/collections and using CURSORS or LOOPs we would simply iterate thru the data. I've seen in the SQL Server world using CURSORS and/or iterating thru data record by record is frowned upon.
Help me get my head out of the 'Oracle' space I was in for so long and into the 'SQL Server' space I need to be in. This has been a slight struggle.
The code below is how I've currently implemented this however it just seems convoluted.
SET NOCOUNT ON;
DECLARE #KeyValueNewMAX INT,
#KeyValueINuse INT,
#ClientID INT,
#Count INT;
DROP TABLE IF EXISTS #InterOtherSourceData;
DROP TABLE IF EXISTS #InterOtherActual;
DROP TABLE IF EXISTS #InterOtherIDs;
CREATE TABLE #InterOtherSourceData -- Data stored here for DML until data is ready for INSERT
(
UniqueID INT IDENTITY( 1, 1 ),
NewIntOtherID INT,
ClientID INT
);
CREATE TABLE #InterOtherActual -- Prod Table where the data will be INSERTED Into
(
IntOtherID INT,
ClientID INT
);
CREATE TABLE #InterOtherIDs -- Store IDs needing to be applied to Data
(
UniqueID INT IDENTITY( 1, 1 ),
NewIntOtherID INT
);
BEGIN
/* TEST Create Fake Data and store it in temp table */
WITH fakeIntOtherRecs AS
(
SELECT 1001 AS ClientID, 'Jake' AS fName, 'Jilly' AS lName UNION ALL
SELECT 2002 AS ClientID, 'Jason' AS fName, 'Bateman' AS lName UNION ALL
SELECT 3003 AS ClientID, 'Brain' AS fName, 'Man' AS lName
)
INSERT INTO #InterOtherSourceData (ClientID)
SELECT fc.ClientID--, fc.fName, fc.lName
FROM fakeIntOtherRecs fc
;
/* END TEST Prep Fake Data */
/* Obtain count so we know how many IDs we need to create */
SELECT #Count = COUNT(*) FROM #InterOtherSourceData;
PRINT 'Count: ' + CAST(#Count AS VARCHAR);
/* For testing set value OF KeyValuePre to the max key currently in use by Table */
SELECT #KeyValueINuse = 13;
/* Using the #Count let's obtain the new MAX ID... basically Existing_Key + SourceRecordCount = New_MaxKey */
SELECT #KeyValueNewMAX = #KeyValueINuse + #Count /* STORE new MAX ID in variable */
/* Print both keys for testing purposes to review */
PRINT 'KeyValue Current: ' + CAST(#KeyValueINuse AS VARCHAR) + ' KeyValue Max: ' + CAST(#KeyValueNewMAX AS VARCHAR);
/* Using recursive CTE generate a fake table containing all of the IDs we want to INSERT into Prod Table */
WITH CTE AS
(
SELECT (#KeyValueNewMAX - #Count) + 1 AS STARTMINID, #KeyValueNewMAX AS ENDMAXID UNION ALL
/* SELECT FROM CTE to create Recursion */
SELECT STARTMINID + 1 AS STARTMINID, ENDMAXID FROM CTE
WHERE (STARTMINID + 1) < (#KeyValueNewMAX + 1)
)
INSERT INTO #InterOtherIDs (NewIntOtherID)
SELECT c.STARTMINID AS NewIntOtherID
FROM CTE c
;
/* Apply New IDs : Using the IDENTITY fields on both Temp Tables I can JOIN the tables by the IDENTITY columns
| Is there a BETTER Way to do this?... like LOOP over each record rather than having to build up common IDs in both tables using IDENTITY columns?
*/
UPDATE #InterOtherSourceData SET NewIntOtherID = oi.NewIntOtherID
FROM #InterOtherIDs oi
JOIN #InterOtherSourceData o ON o.UniqueID = oi.UniqueID
;
/* View data that is ready for insert */
--SELECT *
--FROM #InterOtherSourceData
--;
/* INSERT DATA INTO PRODUCTION TABLE */
INSERT INTO #InterOtherActual (IntOtherID, ClientId)
SELECT NewIntOtherID, ClientID
FROM #InterOtherSourceData
;
SELECT * FROM #InterOtherActual;
END
To pre-generate key values in SQL Server use a sequence rather than an IDENTITY column.
eg
drop table if exists t
drop table if exists #t_stg
drop sequence t_seq
go
create sequence t_seq start with 1 increment by 1
create table t(id int primary key default (next value for t_seq),a int, b int)
create table #t_stg(id int, a int, b int)
insert into #t_stg(a,b) values (1,2),(3,3),(4,5)
update #t_stg set id = next value for t_seq
--select * from #t_stg
insert into t(id,a,b)
select * from #t_stg

Advanced T-SQL: How Should I Update Multiple Rows While Updating Multiple Columns?

I am looking for that performance 'sweet spot' when trying to update multiple columns for multiple rows...
Background.
I work in an MDM/abstract/hierarchical classification/functional SQL Server environment. This question pertains to a post data driven calculation process where I need to save the results. I pass JSON to a SQL function that will automatically create SQL for inserts/updates (and skip updates if the values match).
The tblDestination looks like
create table tblDestination as
(
sysPrimaryKey bigint identity(1,1)
, sysHeaderId bigint -- udt_ForeignKey
, sysLevel1Id bigint -- udt_ForeignKey (classification level 1)
, strText nvarchar(100) -- from here down: the values that need to be updated
, dtmDate datetime2(7)
, numNumeric flat
, flgFlag bit
, intInteger bigint
, sysRefKey bigint-- ForeignKey
, primary key non clustered (sysPrimaryKey)
)
/* note that the clustered index on this table exists, contains more than the columns listed above, and is physically modeled correctly. you may use any clustered/IX/UI indexes that you need to if you are testing */
#JSON looks like... (ARBITRARY# ranges between 2 and 100)
declare #JSON nvarchar(max)='{"ARBITRARY NAME 1":"3/1/2017","ARBITRARY NAME 2": "Value", "ARBITRARY NAME 3": 45.3}'
The function cursors through the incoming #JSON and builds insert or update statements.
Cursor local static forward_only read_only
for
select [key] as JSON_Key, [value] from openjson(#json)
while ##fetch
-- get the id for ARBITRARY ID plus a classification id
select #level1Id = level1Id, #level2id = level2Id
from tblLevel1 where ProgrammingName = #JSON_Key
-- get a ProgrammingName field for the previously retrieved level2Id
Select #ProgrammingName = /*INTEGER/FLAG/NUMERIC/TEXT/DATE/REFKEY
from tblLevel2 where level2id = #level2id
-- clear variables
set #numeric = null, #integer = null, #text = null etc..
-- check to see if insert or update is required
Select #DestinationID from tblDestination where HeaderId = #header and Level1 = #Level1
if #DestinationId is null
begin
If #ProgrammingName = 'Numeric' begin #Numeric = #JSON_Value end
else if #ProgrammingName = 'Integer' begin #Integer = #JSON_value end
etc..
-- dynamically build the updates here..
/*
'update tblDestination
Set numeric = ' +#numeric
+', flag = '+#flag
+', date = '+#date
.. etc
+'where HeaderId = '+#header + ' and level1Id = '+#Level1Id
end
IE:
Update tblDestination
Set numNumeric = NULL
, flgFlag = NULL
, dtmDate = '3/1/2017'
Where sysPrimaryKey = 33676224
*/
Finally... to the point of this post: has anyone here had experience with multiple row updates on multiple columns?
Something like:
Set TableNumeric
= CASE WHEN Level3Id = 33676224 then null
when leve3id = 33676225 then 3.2
when level3id = 33676226 then null
end
, tableDate = case when level3id = 33676224 then '3/1/2017'
when 33676225 then null
when 33676226 then null
end
where headerId = 23897
and IDs in (33676224, 33676225, 33676226)
I know that the speed varies for Insert statements (Number of Columns inserted vs Number of records), and have that part dialed in.
I am curious to know if anyone has found that 'sweet spot' for updates.
The sweet spot meaning:
How many CASES before I should make a new update block?
Is 'Update tbl set (Column = Case (When Id = ## then ColumnValue)^n END)^n' the proper approach to reduce the number of actual Updates being fired?
Is wrapping Updates in a transaction a faster option (and how many per COMMIT)?
Legibility of the update statement is irrelevant. Nobody will actually see the code.
I have isolated the single update statement chain to be approx 70%+ of the query cost in question (compared to all inserts and 20/80,50/50,80/20 %update/%inserts)

Speed up to offset in SQL Server 2014

I have a table with about 70000000 rows of phone numbers. I use OFFSET to read those 50 by 50 numbers.
But it takes a long time (about 1 min).
However, that full-text index used for search and does not impact for offset.
How I can speed up my query?
SELECT *
FROM tblPhoneNumber
WHERE CountryID = #CountryID
ORDER BY ID
OFFSET ((#NumberCount - 1) * #PackageSize) ROWS
FETCH NEXT #PackageSize ROWS ONLY
Throw a sequence on that table, index it and fetch ranges by sequence. You could alternatively just use the ID column.
select *
FROM tblPhoneNumber
WHERE
CountryID = #CountryID
and Sequence between #NumberCount and (#NumberCount + #PackageSize)
If you're inserting/deleting frequently, this can leave gaps, so depending on the code that utilizes these batches of numbers, this might be a problem, but in general a few gaps here and there may not be a problem for you.
Try using CROSS APPLY instead of OFFSET FETCH and do it all in one go. I grab TOP 2 to show you that you can grab any number of rows.
IF OBJECT_ID('tempdb..#tblPhoneNumber') IS NOT NULL
DROP TABLE #tblPhoneNumber;
IF OBJECT_ID('tempdb..#Country') IS NOT NULL
DROP TABLE #Country;
CREATE TABLE #tblPhoneNumber (ID INT, Country VARCHAR(100), PhoneNumber INT);
CREATE TABLE #Country (Country VARCHAR(100));
INSERT INTO #Country
VALUES ('USA'),('UK');
INSERT INTO #tblPhoneNumber
VALUES (1,'USA',11111),
(2,'USA',22222),
(3,'USA',33333),
(4,'UK',44444),
(5,'UK',55555),
(6,'UK',66666);
SELECT *
FROM #Country
CROSS APPLY(
SELECT TOP (2) ID,Country,PhoneNumber --Just change to TOP(50) for your code
FROM #tblPhoneNumber
WHERE #Country.Country = #tblPhoneNumber.Country
) CA

SQL Update Query with from clause Optimization

I have two tables which are heavily queried by multiple users. Average 100+ (update/select) queries/second requests are made for these tables.
Parent
Child
*GrantParent is not involved in join so, I said only two tables
I need to reorder all children for each parent. There can be 3000-4000 parents and each parent may have around same number of children.
Column Types:
ParentID GUID
ChildIndex int
FileID Varchar
IsDeleted bit
Tables have clustered index on PK and non-clustered index on columns being used in where.
UPDATE C SET C.ChildIndex = T.ReOrderedChildIndex FROM [Child] C INNER JOIN
(
SELECT ROW_NUMBER() OVER (PARTITION BY dbo.Child.[ParentID] ORDER BY [ChildIndex] asc) AS ReOrderedChildIndex,
dbo.Child.ChildIndex,
dbo.Child.FileID,
dbo.Child.ParentID
FROM dbo.Child WITH (NOLOCK) INNER JOIN
dbo.Parent WITH (NOLOCK) ON dbo.Child.ParentID = dbo.Parent.ParentID
WHERE (dbo.Parent.GrandParentID = 1) AND (dbo.Child.IsDeleted = 0)
) T
ON C.FileID =T.FileID AND (C.ParentID=T.ParentID) AND (C.IsDeleted = 0)
It looks above query take longer time and put select queries on wait even I have used WITH (NOLOCK) in all data selection stored procedures.
There is another query which reorder parents in same way as done for childs in above query.
In Activity Monitor the locks are shown for select stored procedures.
What is the best way to reorder perform reordering?
I am having following issues and believe they are stems from these queries:
1- Randomly deadlock occur.
2- Often connection pool time out occurs.
*Database is accessed by a windows application using Entlib 4.0 with connection pooling enabled, pool max size 200.
SQL Server 2008 R2
I'd recommend restructuring your data to a more flexible schema. This schema will allow multiple levels so you can merge GrandParent, Parent, and Child into one logical relationship table and one logical details table. You'll also be able to take advantage of indexes to reduce locks and improve performance.
You'll have to re-build your hierarchy after any relationship changes. The way I wrote the script below should minimize this impact on your system. You will no longer be updating the entire table, just the pieces that have changed.
Schema:
CREATE TABLE dbo.EntityName
(
ID INT IDENTITY(1,1),
ParentID INT -- Todo: Add foreign key back to dbo.EntityName
-- Todo: Add primary key
);
GO
CREATE TABLE dbo.Hierarchy
(
ParentID INT, -- Todo: Add foreign key back to dbo.EntityName
ChildID INT, -- Todo: Add foreign key back to dbo.EntityName
ChildLevel INT
);
GO
Populate script (slightly rough around the edges):
CREATE PROCEDURE [dbo].[uspBuildHierarchy]
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #Hierarchy
(
ParentID INT,
ChildID INT,
ChildLevel INT
);
-- Add the root of your hierarchy
INSERT INTO #Hierarchy VALUES (1, 1, 0);
DECLARE #ChildLevel INT = 1,
#LastCount INT = 1;
WHILE (#LastCount > 0)
BEGIN
INSERT INTO #Hierarchy
SELECT
E.ParentID,
E.ID,
#ChildLevel + 1
FROM dbo.EntityName E
INNER JOIN #Hierarchy H ON H.ChildID = E.ParentID
AND H.ChildLevel = (#ChildLevel - 1)
LEFT JOIN #Hierarchy EH ON EH.ParentID = E.ParentID
AND EH.ChildID = E.ID
WHERE EH.ChildLevel IS NULL;
SET #LastCount = ##ROWCOUNT;
SET #ChildLevel = #ChildLevel + 1;
END
MERGE INTO dbo.Hierarchy OH
USING
(
SELECT
ParentID,
ChildID,
ChildLevel
FROM #Hierarchy
) NH
ON OH.ParentID = NH.ParentID
AND OH.ChildID = NH.ChildID
WHEN MATCHED AND OH.ChildLevel <> NH.ChildLevel THEN
UPDATE
SET ChildLevel = NH.ChildLevel
WHEN NOT MATCHED THEN
INSERT
VALUES
(
NH.ParentID,
NH.ChildID,
NH.ChildLevel
)
WHEN NOT MATCHED BY SOURCE
THEN DELETE;
END
GO
Query for all of an entity's children:
SELECT *
FROM dbo.EntityName E
INNER JOIN dbo.Hierarchy H ON H.ChildID = E.ID
AND H.ParentID = #EntityNameID;

Why is SQL Server using index scan instead of index seek when WHERE clause contains parameterized values

We have found that SQL Server is using an index scan instead of an index seek if the where clause contains parametrized values instead of string literal.
Following is an example:
SQL Server performs index scan in following case (parameters in where clause)
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
On the other hand, the following query performs an index seek:
select
min(id)
from
scor_inv_binaries
where
col1 in ('val1', 'val2')
group by
col1
Has any one observed similar behavior, and how they have fixed this to ensure that query performs index seek instead of index scan?
We are not able to use forceseek table hint, because forceseek is supported on SQL Sserver 2005.
I have updated the statistics as well.
Thank you very much for help.
Well to answer your question why SQL Server is doing this, the answer is that the query is not compiled in a logical order, each statement is compiled on it's own merit,
so when the query plan for your select statement is being generated, the optimiser does not know that #val1 and #Val2 will become 'val1' and 'val2' respectively.
When SQL Server does not know the value, it has to make a best guess about how many times that variable will appear in the table, which can sometimes lead to sub-optimal plans. My main point is that the same query with different values can generate different plans. Imagine this simple example:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 991 1
FROM sys.all_objects a
UNION ALL
SELECT TOP 9 ROW_NUMBER() OVER(ORDER BY a.object_id) + 1
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
All I have done here is create a simple table, and add 1000 rows with values 1-10 for the column val, however 1 appears 991 times, and the other 9 only appear once. The premise being this query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 1;
Would be more efficient to just scan the entire table, than use the index for a seek, then do 991 bookmark lookups to get the value for Filler, however with only 1 row the following query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 2;
will be more efficient to do an index seek, and a single bookmark lookup to get the value for Filler (and running these two queries will ratify this)
I am pretty certain the cut off for a seek and bookmark lookup actually varies depending on the situation, but it is fairly low. Using the example table, with a bit of trial and error, I found that I needed the Val column to have 38 rows with the value 2 before the optimiser went for a full table scan over an index seek and bookmark lookup:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
DECLARE #I INT = 38;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP (991 - #i) 1
FROM sys.all_objects a
UNION ALL
SELECT TOP (#i) 2
FROM sys.all_objects a
UNION ALL
SELECT TOP 8 ROW_NUMBER() OVER(ORDER BY a.object_id) + 2
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
SELECT COUNT(Filler), COUNT(*)
FROM #T
WHERE Val = 2;
So for this example the limit is 3.7% of matching rows.
Since the query does not know the how many rows will match when you are using a variable it has to guess, and the simplest way is by finding out the total number rows, and dividing this by the total number of distinct values in the column, so in this example the estimated number of rows for WHERE val = #Val is 1000 / 10 = 100, The actual algorithm is more complex than this, but for example's sake this will do. So when we look at the execution plan for:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
We can see here (with the original data) that the estimated number of rows is 100, but the actual rows is 1. From the previous steps we know that with more than 38 rows the optimiser will opt for a clustered index scan over an index seek, so since the best guess for the number of rows is higher than this, the plan for an unknown variable is a clustered index scan.
Just to further prove the theory, if we create the table with 1000 rows of numbers 1-27 evenly distributed (so the estimated row count will be approximately 1000 / 27 = 37.037)
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 27 ROW_NUMBER() OVER(ORDER BY a.object_id)
FROM sys.all_objects a;
INSERT #T (val)
SELECT TOP 973 t1.Val
FROM #T AS t1
CROSS JOIN #T AS t2
CROSS JOIN #T AS t3
ORDER BY t2.Val, t3.Val;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
Then run the query again, we get a plan with an index seek:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
So hopefully that pretty comprehensively covers why you get that plan. Now I suppose the next question is how do you force a different plan, and the answer is, to use the query hint OPTION (RECOMPILE), to force the query to compile at execution time when the value of the parameter is known. Reverting to the original data, where the best plan for Val = 2 is a lookup, but using a variable yields a plan with an index scan, we can run:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
GO
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i
OPTION (RECOMPILE);
We can see that the latter uses the index seek and key lookup because it has checked the value of variable at execution time, and the most appropriate plan for that specific value is chosen. The trouble with OPTION (RECOMPILE) is that means you can't take advantage of cached query plans, so there is an additional cost of compiling the query each time.
I had this exact problem and none of query option solutions seemed to have any effect.
Turned out I was declaring an nvarchar(8) as the parameter and the table had a column of varchar(8).
Upon changing the parameter type, the query did an index seek and ran instantaneously. Must be the optimizer was getting messed up by the conversion.
This may not be the answer in this case, but something that's worth checking.
Try
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
OPTION (RECOMPILE)
What datatype is col1?
Your variables are nvarchar whereas your literals are varchar/char; if col1 is varchar/char it may be doing the index scan to implicitly cast each value in col1 to nvarchar for the comparison.
I guess first query is using predicate and second query is using seek predicate.
Seek Predicate is the operation that describes the b-tree portion of the Seek. Predicate is the operation that describes the additional filter using non-key columns. Based on the description, it is very clear that Seek Predicate is better than Predicate as it searches indexes whereas in Predicate, the search is on non-key columns – which implies that the search is on the data in page files itself.
For more details please visit:-
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/36a176c8-005e-4a7d-afc2-68071f33987a/predicate-and-seek-predicate

Resources