insert with order by is faster

insert with order by is faster - sql-server

I have the following query:
INSERT INTO table(field, field1)
SELECT value, value1 FROM table2 e
ORDER BY value
which takes less time than this one:
INSERT INTO table(field, field1)
SELECT value, value1 FROM table2 e
Does anyone know why?
The execution plan of the second one shows that sql does an "order by" operation anyway, but is less performant than mine

Insert performance depends on how many indexes you have and on what columns. If there is a clustered index on table.field inserting unsorted values is quite expensive (values not sorted by field).

Do you have any nonclustered index on the value column in the table table2? Do you have clustered index on table on value clause? I could see two possible reason for this.
1.There is some kind of nonclustered index on column value so that optimizer picks this index and avoids sorts (it could be a covering index as well, in this case it will be very fast). The reason why the query without any order by did not pick that index is because it is a simple query and no optimization happened and it did a clustered index or table scan and then sorted the data and it caused the performance degraded as compared to order by. This is the most likely reaosn.
The other reason could be that while inserting the data it inserts the data as it get and then if the data is ordered and same clustered index on order by column. There are no page splits and things will be fine.But if there is no sort then values will be inserted randomly an dcould cause page splits which slightly degrades performance. However, OP has mentioned that optimizer has done a sort before inserting that means this scenario is not applicable here.

Related

Query plan ignores index on large join

I have the following fairly simple query that returns about 1 million rows (I've left out columns as they are just for output), but the query plan doesn't seem to want to use the index and wants me to create one:
SELECT [SAU]
,nr.[Headend]
,[Source]
,[Destination]
,[FibreHop]
,[CableRef]
,[CableSectionRef]
,[nNGAFibres]
,[nEthFibres]
,[FromID]
,[ToID]
,[FromIDTerm]
,[ToIDTerm],Reversed
,#Now
FROM [NodeRouting] nr
join [TargetHeadends] tex ON nr.Headend=tex.Headend
The index is:
CREATE NONCLUSTERED INDEX [NodeRouting_Headend] ON [NodeRouting]
(
[Headend] ASC
)
the other table Headend is the PK
The query plan is this:
If I give it a hint to use the index already created (non-unique, non-clustered) on the id field:
join [TargetHeadends] tex ON nr.id=tex.id (index=NodeRouting_Headend)
It changes to this:
The estimated number of rows, btw, in reality is the first 966,000. The RID 761,000 is a few hundred thousand short and the operator cost seems a lot higher
One thing that is striking me as a little odd, is in the first example where it chose to not use the index it says this:
Missing Index (impact 99): CREATE NONCLUSTERED INDEX <NAME> ON NodeRouting(id) include (....)
CREATE NONCLUSTERED INDEX [<Name>]
ON [NodeRouting] ([Headend])
INCLUDE ([SAU],[Source],[Destination],[FibreHop],[CableRef],[CableSectionRef],[nNGAFibres],[nEthFibres],[FromID],[ToID],[FromIDTerm],[ToIDTerm],[Reversed])
I appreciate i'm returning more columns than in the index but would have thought the index would have still been used without the INCLUDE?

Indexes don't always help and they should not need to be forced into use. For example, for small tables a scan will be used because it's less work because of index overhead. Don't force the use of the index.
For a large table, an index helps when it is "selective" and the query is selective. It will get a few records quickly. It does not get a lot of records quickly. If the index is more than about 5% selective, then it might be used. If not, a scan might be faster than using the non-selective index.
If you are returning all the records, then there is no selectivity. A scan is going to be more efficient. For the join, other methods are more efficient than the lookup for a lot of records.
Using a phonebook analogy, just start at the front of the phone book and read it to the end. Don't start at the start of the index and lookup each name one at a time until you get to the end of the index.
A covered index can help because it can be scanned in place of scanning the original table (clustered index). For example, if you have a two phone books where one has address information and the other does not, then reading the one without address information will be faster if you are not interested in addresses.
FWI: Don't trust the order of the columns for the index suggestions. Also, the index suggested in this case might be a covering index to avoid reading unneeded columns - not for selectivity.

Whats the difference between table scanning a clustered table, vs index scanning

THE SITUATION
I Have a table with only one index, a Clustered index (two columns).
I do a 'SELECT * FROM TABLE' and the optimizer decides a Table scan.
I get the rows kinda sorted by clustered index. I say kinda because it doesn't look randomly sorted, but it has a lot of glitches.
If I force Using the clustered index SELECT * FROM TABLE (index 1 MRU) I get exact the clustered table order.
QUESTIONS
how can the table scan result be different in order than clustered index scan if the data in a clustered table is sorted by its index?
Is the table scan in a clustered index a scan to the leaf level of the table, aren't those sorted?
Is the clustered index scan a scan to all the possible paths of the b-tree in an ordered manner?
excuse my possible lack of knowledge, I'm trying my best to undestand the underlying concepts.
HOW DID I TESTED THIS
I achived this inconsistent ordering results by testing two different clustered indexes (one with two columns and other with one column). creating and dropping the constraint and check the select statement.
after truncating the table and creating the index, the data is correctly sorted, but after dropping the index and creating a different one, that data is not perfectly sorted with a table scan. I need to force index use.
WHY IS THIS IMPORTANT
Because I want to garantee order without using an order by clause in a clustered table.

On 15.0 and upwards ALWAYS specify an order by if you want a specific order as the structure of the data and index varies between allpages and data only locked (DOL) tables.
The optimizer may choose to do parts of the query retrieval in parallel under the covers for example depending on your parallelism settings which is why the order by is important. Just saying select * hasn't requested any specific order.
Just add the order by and you'll be fine because the select * is going to tablescan anyway as you're asking for the whole table and therefore no need for index hints.

THE EXPLANATION
Clustered indexes are logically ordered but not physically ordered.
This means that a table scan if it's done in physical order will return different results than clustered index scan, which is sorted logically.
This logical-physical mapping is controlled by OAM (Object Allocation Map)

Can including columns into the SELECT from the same table slow down the query?

Imagine Foo table has non-clustered indexes on ColA and ColB
and NO Indexes on ColC, ColD
SELECT colA, colB
FROM Foo
takes about 30 seconds.
SELECT colA, colB, colC, colD
FROM Foo
takes about 2 minutes.
Foo table has more than 5 million rows.
Question:
Is it possible that including columns that are not part of the indexes can slow down the query?
If yes, WHY? -Are not they part of the already read PAGEs?

If you write a query that uses a covering index, then the full data pages in the heap/clustered index are not accessed.
If you subsequently add more columns to the query, such that the index is no longer covering, then either additional lookups will occur (if the index is still used), or you force a different data access path entirely (such as using a table scan instead of using an index)
Since 2005, SQL Server has supported the concept of Included Columns in an index. This includes non-key columns in the leaf of an index - so they're of no use during the data-lookup phase of index usage, but still help to avoid performing an additional lookup back in the heap/clustered index, if they're sufficient to make the index a covering index.
Also, in future, if you want to get a better understanding on why one query is fast and another is slow, look into generating Execution Plans, which you can then compare.
Even if you don't understand the terms used, you should at least be able to play "spot the difference" between them and then search on the terms (such as table scan, index seek, or lookup)

Simple answer is: because non-clustered index is not stored in the same page as data so SQL Server has to lookup actual data pages to pick up the rest.
Non-clustered index are stored in separate data structures while clustered indexes are stored in the same place as the actual data. That’s why you can have only one clustered index.

SQL Server Index cost

I have read that one of the tradeoffs for adding table indexes in SQL Server is the increased cost of insert/update/delete queries to benefit the performance of select queries.
I can conceptually understand what happens in the case of an insert because SQL Server has to write entries into each index matching the new rows, but update and delete are a little more murky to me because I can't quite wrap my head around what the database engine has to do.
Let's take DELETE as an example and assume I have the following schema (pardon the pseudo-SQL)
TABLE Foo
col1 int
,col2 int
,col3 int
,col4 int
PRIMARY KEY (col1,col2)
INDEX IX_1
col3
INCLUDE
col4
Now, if I issue the statement
DELETE FROM Foo WHERE col1=12 AND col2 > 34
I understand what the engine must do to update the table (or clustered index if you prefer). The index is set up to make it easy to find the range of rows to be removed and do so.
However, at this point it also needs to update IX_1 and the query that I gave it gives no obvious efficient way for the database engine to find the rows to update. Is it forced to do a full index scan at this point? Does the engine read the rows from the clustered index first and generate a smarter internal delete against the index?
It might help me to wrap my head around this if I understood better what is going on under the hood, but I guess my real question is this. I have a database that is spending a significant amount of time in delete and I'm trying to figure out what I can do about it.
When I display the execution plan for the deletion, it just shows an entry for "Clustered Index Delete" on table Foo which lists in the details section the other indices that need to be updated but I don't get any indication of the relative cost of these other indices.
Are they all equal in this case? Is there some way that I can estimate the impact of removing one or more of these indices without having to actually try it?

Nonclustered indexes also store the clustered keys.
It does not have to do a full scan, since:
your query will use the clustered index to locate rows
rows contain the other index value (c3)
using the other index value (c3) and the clustered index values (c1,c2), it can locate matching entries in the other index.
(Note: I had trouble interpreting the docs, but I would imagine that IX_1 in your case could be defined as if it was also sorted on c1,c2. Since these are already stored in the index, it would make perfect sense to use them to more efficiently locate records for e.g. updates and deletes.)
All this, however has a cost. For each matching row:
it has to read the row, to find out the value for c3
it has to find the entry for (c3,c1,c2) in the nonclustered index
it has to delete the entry from there as well.
Furthermore, while the range query can be efficient on the clustered index in your case (linear access, after finding a match), maintenance of the other indexes will most likely result in random access to them for every matching row. Random access has a much higher cost than just enumerating B+ tree leaf nodes starting from a given match.
Given the above query, more time is spent on the non-clustered index maintenance - the amount depends heavily on the number of records selected by the col1 = 12 AND col2 > 34
predicate.
My guess is that the cost is conceptually the same as if you did not have a secondary index but had e.g. a separate table, holding (c3,c1,c2) as the only columns in a clustered key and you did a DELETE for each matching row using (c3,c1,c2). Obviously, index maintenance is internal to SQL Server and is faster, but conceptually, I guess the above is close.
The above would mean that maintenance costs of indexes would stay pretty close to each other, since the number of entries in each secondary index is the same (the number of records) and deletion can proceed only one-by-one on each index.
If you need the indexes, performance-wise, depending on the number of deleted records, you might be better off scheduling the deletes, dropping the indexes - that are not used during the delete - before the delete and adding them back after. Depending on the number of records affected, rebuilding the indexes might be faster.

Please explain the query plan sql server chooses

In this blog post, I need clarification why SQL server would choose a particular type of scan:
Let’s assume for simplicities sake
that col1 is unique and is ever
increasing in value, col2 has 1000
distinct values and there are
10,000,000 rows in the table, and that
the clustered index consists of col1,
and a nonclustered index exists on
col2.
Imagine the query execution plan
created for the following initially
passed parameters: #P1= 1 #P2=99
These values would result in an
optimal queryplan for the following
statement using the substituted
parameters:
Select * from t where col1 > 1 or col2
99 order by col1;
Now, imagine the query execution plan
if the initial parameter values were:
#P1 = 6,000,000 and #P2 = 550.
As before, an optimal queryplan would
be created after substituting the
passed parameters:
Select * from t where col1 > 6000000
or col2 > 550 order by col1;
These two identical parameterized SQL
Statements would potentially create
and cache very different execution
plans due to the difference of the
initially passed parameter values.
However, since SQL Server only caches
one execution plan per query, chances
are very high that in the first case
the query execution plan will utilize
a clustered index scan because of the
‘col1 > 1’ parameter substitution.
Whereas, in the second case a query
execution plan using index seek would
most likely be created.
from: http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx
Why would the first query use a clustered index, and a index seek in the second query?

Assuming that the columns contain only positive integers:
SQL Server would look at the statistics for the table and see that, for the first query, all rows in the table meet the criteria of col1>1, so it chooses to scan the clustered index.
For the second query, a relatively small proportion of rows would meet the criteria of col1> 6000000, so using an index seek would improve performance.

Notice that in both cases the clustered index will be used. In the first example it is a clustered index SCAN where as in the second example it will be a clustered index SEEK which in most cases will be the faster as the author of the blog states.
SQL Server knows that the clustered index is increasing. Therefore it will do a clustered index scan in the first case.

In cases where the optimizer sees that the majority of the table will be returned in the query, such as the first query, then it's more efficient to perform a scan then a seek.
Where only a small portion of the table will be returned, such as in the second query, then an index seek is more efficient.
A scan will touch every row in the table whether it qualifies or not. The cost is proportional to the total number of rows in the table. A scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
A seek will touch rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight