MS SQL query performance - in vs table variable - sql-server

I have some list of strings (the number of string varies from 10 to 100) and I need to select values from some large table (100K-5M records)
as efficiently as I can.
It seems to me that I have 3 options basically - to use 'in' clause, to use table variable or to use temp table.
something like this:
select col1, col2, col3, name from my_large_table
where index_field1 = 'xxx'
and index_field2 = 'yyy'
and name in ('name1', 'name2', 'name3', ... 'nameX')
or
declare #tbl table (name nvarchar(50))
insert #tbl(name) values ('name1', 'name2', 'name3', ... 'nameX')
select col1, col2, col3, name
from my_large_table inner join #tbl as tbl on (tbl.name = my_large_table.name)
where index_field1 = 'xxx'
and index_field2 = 'yyy'
The large table has clustered index on (index_field1, index_field2, name, index_field3).
Actually for each set of names I have 4-5 queries from the large table:
select, then update and/or insert and/or delete according to some logic - each time constraining the query on this set of names.
The name set and the queries are built dynamically in .net client, so there is no problems of readability, code simplicity or similar. The only goal is to reach the best performance, since this batch will be executed a lot of times.
So the question is - should I use 'in' clause, table variable or something else to write my condition?

As already mentioned you should avoid using table variables for pretty large data, as they do not allow indexes (more details here).
If I got it correctly, you have multiple queries using the same set of names, so I would suggest the following approach:
1) create a persistent table (BufferTable) to hold words list: PkId, SessionId, Word.
2) for each session using some set of words: bulk insert your words here (SessionId will be unique to each batch of queries). This should be very fast for tens-hundreds of words.
3) write your queries like the one below:
select col1, col2, col3, name
from my_large_table LT
join BufferTable B ON B.SessionId = #SessionId AND name = B.Word
where LT.index_field1 = 'xxx'
and LT.index_field2 = 'yyy'
4) An index on SessionId is name is required for best performance.
This way, you do not have to push the words for every query.
BufferTable is best emptied periodically, as deletes are expensive (truncate it when nobody is doing somehting on it is an option).

Related

Does MS SQL Server automatically create temp table if the query contains a lot id's in 'IN CLAUSE'

I have a big query to get multiple rows by id's like
SELECT *
FROM TABLE
WHERE Id in (1001..10000)
This query runs very slow and it ends up with timeout exception.
Temp fix for it is querying with limit, break this query into 10 parts per 1000 id's.
I heard that using temp tables may help in this case but also looks like ms sql server automatically doing it underneath.
What is the best way to handle problems like this?
You could write the query as follows using a temporary table:
CREATE TABLE #ids(Id INT NOT NULL PRIMARY KEY);
INSERT INTO #ids(Id) VALUES (1001),(1002),/*add your individual Ids here*/,(10000);
SELECT
t.*
FROM
[Table] AS t
INNER JOIN #ids AS ids ON
ids.Id=t.Id;
DROP TABLE #ids;
My guess is that it will probably run faster than your original query. Lookup can be done directly using an index (if it exists on the [Table].Id column).
Your original query translates to
SELECT *
FROM [TABLE]
WHERE Id=1000 OR Id=1001 OR /*...*/ OR Id=10000;
This would require evalutation of the expression Id=1000 OR Id=1001 OR /*...*/ OR Id=10000 for every row in [Table] which probably takes longer than with a temporary table. The example with a temporary table takes each Id in #ids and looks for a corresponding Id in [Table] using an index.
This all assumes that there are gaps in the Ids between 1000 and 10000. Otherwise it would be easier to write
SELECT *
FROM [TABLE]
WHERE Id BETWEEN 1001 AND 10000;
This would also require an index on [Table].Id to speed it up.

How can I use a table.column value for a join using dynamic sql?

I'm creating a data validation procedure for a database, using various models, etc. in the database.
I've created a temp table with a model, a sequence, and 3 columns.
In each of these columns I have the qualified column name (table.column) to use in my query, or a null value. Here's the temp_table structure:
create table #temp_table(model nvarchar(50), seq nvarchar(50), col nvarchar(100), col2 nvarchar(100) , col3 nvarchar(100))
In my dynamic sql I have a join something like this (exteremely simplified):
select *
from
original_table
inner join
...
#temp_table
on
original_table.models = #temp_table.models
inner join
set_model
on
original_tables.models = set_model.models
and #temp_table.col = set_model.val
and #temp_table.col2 = set_model.val2
and #temp_table.col3 = set_model.val3
What I'm working on has many more tables (hence the ... in the middle of the query), so, we'll just assume that all the tables are present and all the columns are valid.
Because #temp_table.col stores a value, when being join to set_model.val the comparison will look something like 'Buildings.year_id' = 2014.
Is there a way to force my dynamic query to use the value in #temp_table.col as part of the join condition?
For example:
If in the query above #temp_table.col = 'Buildings.year_id'
how do I make the join evaluate Buildings.year_id = set_model.val
rather than 'Buildings.year_id' = 2014?
I was trying to create a query which had a different query plan based upon the row queried.
I found a workaround (creating a cursor and looping through n different tables and appending each dynamic query with a ' union '), but I came back and thought about the problem I ran into for a little while.
I realized that I was trying to dynamically create a query based upon data from the query I was using. As a result, no effective query plan could be created, as it would need to run a unique query based upon each row.
If the query had been able to work, it would've been extremely inefficient (as each row would make its own 'select' statement).
Regardless, the question itself was based on bad/incomplete logic.

SQL Script add records with identity FK

I am trying to create an SQL script to insert a new row and use that row's identity column as an FK when inserting into another table.
This is what I use for a one-to-one relationship:
INSERT INTO userTable(name) VALUES(N'admin')
INSERT INTO adminsTable(userId,permissions) SELECT userId,255 FROM userTable WHERE name=N'admin'
But now I also have a one-to-many relationship, and I asked myself whether I can use less SELECT queries than this:
INSERT INTO bonusCodeTypes(name) VALUES(N'1500 pages')
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) SELECT name,N'123456',0 FROM bonusCodeTypes WHERE name=N'1500 pages'
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) SELECT name,N'012345',0 FROM bonusCodeTypes WHERE name=N'1500 pages'
I could also use sth like this:
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed)
SELECT name,bonusCode,0 FROM bonusCodeTypes JOIN
(SELECT N'123456' AS bonusCode UNION SELECT N'012345' AS bonusCode)
WHERE name=N'1500 pages'
but this is also a very complicated way of inserting all the codes, I don't know whether it is even faster.
So, is there a possibility to use a variable inside SQL statements? Like
var lastinsertID = INSERT INTO bonusCodeTypes(name) OUTPUT inserted.id VALUES(N'300 pages')
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) VALUES(lastinsertID,N'123456',0)
OUTPUT can only insert into a table. If you're only inserting a single record, it's much more convenient to use SCOPE_IDENTITY(), which holds the value of the most recently inserted identity value. If you need a range of values, one technique is to OUTPUT all the identity values into a temp table or table variable along with the business keys, and join on that -- but provided the table you are inserting into has an index on those keys (and why shouldn't it) this buys you nothing over simply joining the base table in a transaction, other than lots more I/O.
So, in your example:
INSERT INTO bonusCodeTypes(name) VALUES(N'300 pages');
DECLARE #lastInsertID INT = SCOPE_IDENTITY();
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) VALUES (#lastInsertID, N'123456',0);
SELECT #lastInsertID AS id; -- if you want to return the value to the client, as OUTPUT implies
Instead of VALUES, you can of course join on a table instead, provided you need the same #lastInsertID value everywhere.
As to your original question, yes, you can also assign variables from statements -- but not with OUTPUT. However, SELECT #x = TOP(1) something FROM table is perfectly OK.

efficiently move data between schemas of the same database in postgresql

How can I move data in similar tables (same number of columns, data types. If they are not same, it can be achieved with a view I hope) most efficiently between schemas of the same postgresql database?
EDIT
Sorry for the vagueness. I intend to use the additional schemas as archives for data not often needed (to improve performance). To be more precise data older than 2 years is to be archived. It is okay to take the server offline, but by not more than a day, at most 2. It is an accounting software for a medium sized company. By liberal estimates the number of records in an year wont go near a million.
insert into target_schema.table_one (col1, col2, col3)
select col1, col2, col3
from source_schema.other_table
where <some condition to select the data to be moved>;
If you really want to "move" the data (i.e. delete the rows from the source table), you need to can use
If the table is the target of a foreign key you cannot use truncate in that case you need to use
delete from source_schema.other_table
where <some condition to select the data to be moved>;
You can combine both steps into a single statement, if you want to:
with deleted_data as (
delete from source_schema.other_table
where <some condition to select the data to be moved>;
returning *
)
insert into target_schema.table_one (col1, col2, col3)
select col1, col2, col3
from deleted_data;

Transact SQL parallel query execution

Suppose I have
INSERT INTO #tmp1 (ID) SELECT ID FROM Table1 WHERE Name = 'A'
INSERT INTO #tmp2 (ID) SELECT ID FROM Table2 WHERE Name = 'B'
SELECT ID FROM #tmp1 UNION ALL SELECT ID FROM #tmp3
I would like to run queries 1 & 2 in parallel, and then combine results after they are finished.
Is there a way to do this in pure T-SQL, or a way to check if it will do this automatically?
A background for those who wants it: I investigate a complex search where there're multiple conditions which are later combined (term OR (term2 AND term3) OR term4 AND item5=term5) and thus I investigate if it would be useful to execute those - largely unrelated - conditions in parallel, later combining resulting tables (and calculating ranks, weights, and so on).
E.g. should be several resultsets:
SELECT COUNT(*) #tmp1 union #tmp3
SELECT ID from (#tmp1 union #tmp2) WHERE ...
SELECT * from TABLE3 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
SELECT * from TABLE4 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
You don't. SQL doesn't work like that: it isn't procedural. It leads to race conditions and data issues because of other connections
Table variables are also scoped to the batch and connection so you can't share results over 2 connections in case you're wondering.
In any case, all you need is this, unless you gave us an bad example:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION
SELECT ID FROM Table2 WHERE Name = 'B'
I suspect you're thinking of "run in parallel" because of this procedural thinking. What is your actual desired problem and goal?
Note: table variables do not allow parallel operations: Can queries that read table variables generate parallel exection plans in SQL Server 2008?
You don't decide what to parallelise - SQL Server's optimizer does. And the largest unit of work that the optimizer will work with is a single statement - so, you find a way to express your query as a single statement, and then rely on SQL Server to do its job, which it will usually do quite well.
If, having constructed your query, the performance isn't acceptable, then you can look at applying hints or forcing certain plans to be used. A lot of people break their queries into multiple statements, either believing that they can do a better job than SQL Server, or because it's how they "naturally" think of the task at hand. Both are "wrong" (for certain values of wrong), but if there's a natural breakdown, you may be able to replicate it using Common Table Expressions - these would allow you to name each sub-part of the problem, and then combine them together, all as part of a single statement.
E.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabB AS (
SELECT ID FROM Table2 WHERE Name = 'B'
)
SELECT ID FROM TabA UNION ALL SELECT ID FROM TabB
And this will allow the server to decide how best to resolve this query (e.g. deciding whether to store intermediate results in "temp" tables)
Seeing in one of your other comments you discussing about having to "work with" the intermediate results - this can still be done with CTEs (if it's not just a case of you failing to be able to express the "final" result as a single query), e.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabAWithCalcs AS (
SELECT ID,(ID*5+6) as ModID from TabA
)
SELECT * FROM TabAWithCalcs
Why not just:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION ALL
SELECT ID FROM Table2 WHERE Name = 'B'
then if SQL Server wants to run the two selects in parallel, it will do at its own violition.
Otherwise we need more context for what you're trying to achieve if this isn't practical.

Resources