Optimizing merge join - database

Read this article https://bertwagner.com/posts/visualizing-merge-join-internals-and-understanding-their-implications/
There are such phrase:
"If the optimizer added a sort to the upstream merge join though, it may be worth investigating whether it's possible to presort that data so SQL Server doesn't need to sort it on its own. Often times this can be as simple as redefining an included index column to a key column - if you are adding it as the last key column in the index then regression impact is usually minor but you may be able to allow SQL Server to use the merge join without any additional sorting required."
I dont understand. Author offers to add extra column(the one, sorted by sql server itself) to already existing index as a last one? As i understand index is sorted from 1 column to last.
E.G. table with columns "number"(int), "letter"(varchar) will have index ("number", "letter") like
1 A
1 D
3 A
3 D
So how does presence of "letter" column in index can save server the trouble of sorting it?

A merge join can only merge to data streams that are already sorted by according to the join predicate (forward or backward).
If the join predicate includes both columns (number & letter), but there's an index on number only, the engine won't be able to use the index as a source of a "presorted" data stream. If the engine decides for a merge in this case, you'll notice there plan will include an extra operator "sort" downstream of the merge operator. This may not be efficient if the sort is expensive.
The author is indicating that if you see a case like this one, then you could explore the possibility of changing the existing index by adding the column letter to it. In this new scenario the engine will be able to directly use this index as a presorted data stream, without the need of an extra "sort" operator downstream.
However, changing an index can be tricky. Maybe improving the performance of this query can deteriorate another more important one. Make sure you understand the implications.

Related

Is there a benefit in eliminating the unique-ness of a redundant unique index on SQL Server?

Whilst analyzing the database structure of a legacy application, I discovered in several tables there are 2 unique indices which both have the exact same columns, except in a different order.
Having 2 unique indices covering the same columns is clearly redundant, so my first instinct was to completely drop one of them. But then I thought some of the queries emmitted by the application might be making use of the index I might delete, so I thought to convert it instead into a regular index.
To the best of my knowledge, whenever a row is inserted/updated in a table having a unique index, SQL Server spends some milliseconds validating each unique index/constraint still holds true - so by converting one of these indices into a non-unique I hope processing of this table might be sped up a bit, please confirm or dispel.
On the other hand, I don't understand what's the benefit in having to unique indices covering the same columns on a table. Any ideas what this could be done for? Could something get lost if I convert one of them onto a regular one?
check the index usage stats to see if they are both being used.
sys.dm_db_index_usage_stats.
If not, delete the unused index.
Generally speaking, indexes are used for filtering, then ordering. It is possible that you may have queries that are needing to filter on the leading columns of both indexes. If that is the case, you'll reduce how deep the query can be optimized by getting rid of one. That may not be a big deal as it may still be able to satisfactorily use the remaining index.
For example, if I have 2 indexes with four columns:
1: Columns A, B, C, D
2: Columns A, B, D, C
Any query that currently prefers #2 could still gain benefits by using #1 if #2 is not available. It would just limit the selectivity to column B rather than all the way down to column D.
If you're not sure, try disabling (not deleting) the less used index and see if you notice any problems. If something slows down, it is simple enough to enable it again.
As always, try it in a non-production environment first.
UPDATE
Yes you can safely remove the uniqueness of one of the indexes. It only needs to be enforced by one of them. The only concern would be if the vendor decided to do the same and chooses the other index.
However, since this is from a vendor, I'd recommend you contact them if there are performance concerns. If you're not running into a performance issue worth a support request to them, then just leave it alone.

SQL Server - what kind of index should I create?

I need to make queries such as
SELECT
Url, COUNT(*) AS requests, AVG(TS) AS avg_timeSpent
FROM
myTable
WHERE
Url LIKE '%/myController/%'
GROUP BY
Url
run as fast as possible.
The columns selected and grouped are almost always the same, being the difference, an extra column on the select and group by (the column tenantId)
What kind of index should I create to help me run this scenario?
Edit 1:
If I change my base query to '/myController/%' (note there's no % at the begging) would it be better?
This is a query that cannot be sped up with an index. The DBMS cannot know beforehand how many records will match the condition. It may be 100% or 0.001%. There is no clue for the DBMS to guess this. And access via an index only makes sense when a small percentage of rows gets selected.
Moreover, how can such an index be structured and useful? Think of a telephone book and you want to find all names that contain 'a' or 'rs' or 'ems' or whatever. How would you order the names in the book to find all these and all other thinkable letter combinations quickly? It simply cannot be done.
So the DBMS will read the whole table record for record, no matter whether you provide an index or not.
There may be one exception: With an index on URL and TS, you'd have both columns in the index. So the DBMS might decide to read the whole index rather than the whole table then. This may make sense for instance when the table has hundreds of columns or when the table is very fragmented or whatever. I don't know. A table is usually much easier to read sequentially than an index. You can still just try, of course. It doesn't really hurt to create an index. Either the DBMS uses it or not for a query.
Columnstore indexes can be quite fast at such tasks (aggregates on globals scans). But even they will have trouble handling a LIKE '%/mycontroler/%' predicate. I recommend you parse the URL once into an additional computed field that projects the extracted controller of your URL. But the truth is that looking at global time spent on a response URL reveals very little information. It will contain data since the beginning of time, long since obsolete by newer deployments, and not be able to capture recent trends. A filter based on time, say per hour or per day, now that is a very useful analysis. And such a filter can be excellently served by a columnstore, because of natural time order and segment elimination.
Based on your posted query you should have a index on Url column. In general columns which are involved in WHERE , HAVING, ORDER BY and JOIN ON condition should be indexed.
You should get the generated query plan for the said query and see where it's taking more time. Again based n the datatype of the Url column you may consider having a FULLTEXT index on that column

SQL Server detecting slow vs fast columns

I have an ASP.Net MVC application & I use PetaPoco and SQL Server.
My usecase is I want to allow a search on a table with many fields, but hide fields that are "slow" (ie) unindexed. I'm going to modify the PetaPoco T4 template to decorate this information on the columns.
I found this answer that gives you a list of tables vs indexes. My concern is it shows a lot of columns for a particular table. Is the query given in the answer reliable for my usecase ? (ie) can the columns shown be included in the where clause & it wont be slow ? I have some tables that have 40M rows. I dont want to include slow columns in the where condition.
Or is there a better way to solve this problem ?
There are no slow columns in the sense of your question. You have to distinguish between two uses of a column.
Searching. When the column appears in the WHERE, or JOIN clause, it slows down your query, if there is no index for it.
Returning in recordset. If the column appears in the SELECT clause, its content must be returned with each row, whether you need it, or not. So for queries returning many rows, each additional column to be returned means a performance penalty.
Conclusion: As you can see, the performance impact of SELECTED columns does NOT DEPEND on index, but on the number of the returned rows.
Advice: Create indexes for columns used to search and do not return unnecessary columns. Let your queries be as specific as possible in terms of both, selected columns and returned rows.
I think it will not be that simple. You can check indexed columns using the suggested approach (or similar), but the fact that a column is present in an index does not mean your query will necessarily utilize it efficiently. For example if an index is created on columns A, B and C (in that order) and you only have a 'WHERE' clause on B or C (but not on A) you will probably end up with index scan rather than index seek and your query is likely to be slower than expected.
So your check should take into account the sequence of the columns in the indices - instantly fast columns (in your situation) might probably be considered the first columns of the indices (where ic.index_column_id = 1 in the post you mentioned). Columns that are not first in the indices (i.e. ic.index_column_id > 1) will be fast as long as the first columns are also included in the filter. There are other things you might also need to take into account (e.g. cardinality), but this is important to make sure you drive index seeks rather than scans.

SQL Server not using proper index for query

I have a table on SQL Server with about 10 million rows. It has a nonclustered index ClearingInfo_idx which looks like:
I am running query which isn't using ClearingInfo_idx index and execution plan looks like this:
Can anyone explain why query optimizer chooses to scan clustered index ?
I think it suggests this index because you use a sharp search for the two columns immediate and clearingOrder_clearingOrderId. Those values are numbers, which were good to search. The column status is nvarchar which isn't the best for a search, and due to your search with in, SQL Server needs to search two of those values.
SQL Server would use the two number columns to get a faster result and searching in the status in the second round after the number of possible results is reduced due to the exact search on the two number columns.
Hopefully you get my opinion. :-) Otherwise, just ask again. :-)
As Luaan already pointed out, the likely reason the system prefers to scan the clustered index is because
you're asking for all fields to be returned (SELECT *), change this to fields that are present in the index ( = index fields + clustered index-fields) and you'll probably see it using just the index. If you'd need a couple of extra fields you can consider INCLUDEing those in the index.
the order of the index fields isn't very optimal. Additionally it might well be that the 'content' of the field isn't very helpful either. How many distinct values are present in the index-columns and how are they spread around? If you're WHERE covers 90% of the records there is very little reason to first create a (huge) list of keys and then go fetch those from the clustered index later on. Scanning the latter directly then makes much more sense.
Did you try the suggested index? Not sure what other queries run on the table, but for this particular query it seems like a valid replacement to me. If the replacement will satisfy the other queries is another question off course. Adding extra indexes might negatively impact your IUD operations and it will require more disk-space; there is no such thing as a free lunch =)
That said, if performance is an issue, have you considered a filtered index? (again, no such thing as a free lunch; it's all about priorities)

What are the methods for identifying unnecessary columns within a covering index?

What methods are there for identifying superfluous columns in covering indices: columns which are never searched against, and therefore may be extracted into Includes, or even removed completely without affecting the applicability of the index?
To clarify things
The idea of a covering index is that it also includes columns which may not be searched by (used in the WHERE clause and such) but may be selected (part of the SELECT columns list).
There doesn't seem to be any easy way to assert the existence of unused colums in a covering index. I can only think of a painstaking process below:
For a representative period of time, record all queries being run on the server (or on the table desired)
Filter out (through regular expression) queries not involving the underlying table
For remaining queries, obtain the query plan; discard queries not involving the index in question
For the remaining queries, or rather for each "template" of query (many queries are same but for the search criteria values), make the list of the columns from the index that are either in select or where clause (or in JOIN...)
the columns from the index not found in that list are positively good to go.
Now, there may be a few more [columns to remove] because the process above doesn't check in which context the covering index is used (it is possible that it be used for resolving the where, but that the underlying table is still accessed as well (for example to get to columns not in the covering index...)
The above clinical approach is rather unattractive. An analytical approach may be preferable:
Find all queries "templates" that may be used in all the applications using the server. For each of these patterns, find the ones which may be using the covering index. These are (again a few holes...) queries that:
include a reference to the underlying table
do not cite in any way a column from the underlying table that is not a column in the index
do not use a search criteria from the underlying table that is more selective that the columns of the index (in their very order...)
Or... without even going to the applications: think of all the use cases, and if queries that would serve these cases would benefit of not from all columns in the index. Doing so would imply that you have a relatively good idea of the selectivity of the index, regarding its first few columns.
If you do audits of your use cases and data points, obviously anything that isn't used or caught in the audit is a candidate for deletion. If the database lacks such a thorough audit, you can save a time-window's worth of queries that hit the database by running a trace and saving it. You can analyze the trace and see what type of queries are hitting the database and from there intuit which columns can be dropped.
Trace analysis is typically used to find candidates for missing indices, but I'm guessing that it could be also used to analyze usage trends.

Resources