How to speed up this t-sql query - sql-server

Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).
Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).
I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.
What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?
update NewRecords
set notmatchflag=true
from
newRecords elr
inner join People pro
on
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob

Due to not know exactly the table structure and types of data I'm not 100% sure that this would work, but give it a go anyway!
I would first check the SQL execution plan when you test it, usually there will be some sections that will be taking the most amount of time. From there you should be able to gauge where/if any indexes would help.
My gut feeling though is your function that is being called A LOT by the looks of things, but hopefully the execution plan will determine if that is the case. If it is then a CLR stored procedure might be the way to go.

There seems to be nothing wrong with your query (maybe except for the fact, that you want to find all possible combinations of differing values, which in most scenarios will give a lot of results :)).
Anyway, the problem is your Levenstein functions - I assume that they're written in T-SQL. Even if you optimized them, they're still weak. You really should compile them to CLR (the link that you posted already contain an example) - this will be an order of magnitude faster.
Another idea I'd try with what you've got, is to somehow decrease the number of Levenstein comparisons. Maybe find other conditions, or reverse the query: find all MATCHING records, and then select what's left (it may enable you to introduce those additional conditions.
But Levenstein compiled to CLR is your best option.

For one skip the values that are true so if you run it again then it will not process those.
That distance is expensive - try to eliminate those that don't have a chance first.
If the length differs by more than 2 then I don't think you can have a distance <= 2.
update NewRecords
set notmatchflag=true
from newRecords elr
inner join People pro
on notmatchflag = false
and elr.ssn is null
and elr.dob <> ro.dob
and dbo.Levenshtein_withThreshold( elr.last, pro.last,2) > 2
and dbo.Levenshtein_withThreshold(elr.first,pro.first,2) > 2

Related

Bind Aware Cursor Matching Explanation

Hi I am having a little trouble trying to find a simple explanation for bind aware cursor matching in oracle.. Is bind aware cursor matching basically having Oracle monitor a query with a bind variable over time and seeing if there's an increase in CPU when using some variables. Then from doing this it almost generates a more suitable execution plan say a full table scan then marks the query as bind aware then the next time the query is executed there is a choice of two execution plans? Any help will be greatly appreciated! Cheers!
In the simplest case, imagine that you have an ORDERS table. In that table is a status column. There are only a handful of status values and some are very, very popular while others are very rare. Imagine that the table has 10 million rows. For our purposes, say that 93% are "COMPLETE", 5% are "CANCELLED", and the remaining 2% are spread between a 8 different statuses that track the order flow (INCOMPLETE, COMPLETE, IN FULFILLMENT, IN TRANSIT, etc.).
If you have the most basic statistics on your table, the optimizer knows that there are 10 million rows and 10 distinct statuses. It doesn't know that some status values are more popular than others so it guesses that each status corresponds to 1 million rows. So when it sees a query like
SELECT *
FROM orders
WHERE status = :1
it guesses that it needs to fetch 1 million rows from the table regardless of the bind variable value so it decides to use a full table scan.
Now, a human comes along wondering why Oracle is being silly and doing a full table scan when he asks for the handful of orders that are in an IN TRANSIT status-- clearly an index scan would be preferable there. That human realizes that the optimizer needs more information in order to learn that some status values are more popular than others so that human decides to gather a histogram (there are options that cause Oracle to gather histograms on certain columns automatically as well but I'm ignoring those options to try to keep the story simple).
Once the histogram is gathered, the optimizer knows that the status value is highly skewed-- there are lots of COMPLETED orders but very few IN TRANSIT orders. If it sees a query that is using literals rather than bind variables, i.e.
SELECT *
FROM orders
WHERE status = 'IN TRANSIT'
vs
SELECT *
FROM orders
WHERE status = 'COMPLETED'
then it is very easy for the optimizer to decide to use an index in the first case and table scan in the second. When you have a bind variable, though, the optimizer's job is more difficult-- how is it supposed to determine whether to use the index or to do a table scan...
Oracle's first solution was known as "bind variable peeking". In this approach, when the optimizer sees something like
SELECT *
FROM orders
WHERE status = :1
where it knows (because of the histogram on status) that the query plan should depend on the value passed in for the bind variable, Oracle "peeks" at the first value that is passed in to determine how to optimize the statement. If the first bind variable value is 'IN TRANSIT', an index scan will be used. If the first bind variable value is 'COMPLETE`, a table scan will be used.
For a lot of cases, this works pretty well. Lots of queries really only make sense for either very popular or very rare values. In our example, it's pretty unlikely that anyone would ever really want a list of all 9 million COMPLETE orders but someone might want a list of the couple thousand orders in one of the various transitory states.
But bind variable peeking doesn't work well in other cases. If you have a system where the application sometimes binds very popular values and sometimes binds very rare values, you end up with a situation where application performance depends heavily on who happens to run a query first. If the first person to run the query uses a very rare value, the index scan plan will be generated and cached. If the second person to run the query uses the very common value, the cached plan will be used and you'll get an index scan that takes forever. If the roles are reversed, the second person uses the rare value, gets the cached plan that does a full table scan, and has to scan the entire table to get the couple hundred rows they're interested in. This sort of non-deterministic behavior tends to drive DBAs and developers mad because it can be maddingly hard to diagnose and can lead to rather odd explanations-- Tom Kyte has an excellent example of a customer that concluded they needed reboot the database in the afternoon if it rained Monday morning.
Bind aware cursor matching is the solution to the bind variable peeking problem. Now, when Oracle sees the query
SELECT *
FROM orders
WHERE status = :1
and sees that there is a histogram on status that indicates that some values are more common than others, it is smart enough to make that cursor "bind aware". That means that when you bind a value of IN FULFILLMENT, the optimizer is smart enough to conclude that this is one of the rare values and give you the index plan. When you bind a value of COMPLETE, the optimizer is smart enough to conclude that this is one of the common values and give you the plan with the table scan. So the optimizer now knows of two different plans for the same query and when it sees a new bind value like IN TRANSIT, it checks to see whether that value is similar to others that it has seen before and either gives you one of the existing plans or creates another new plan. In this case, it would decide that IN TRANSIT is roughly as common as IN FULFILLMENT so it re-uses the plan with the index scan rather than generating a third query plan. This, hopefully, leads to everyone getting their preferred plan without having to generate and cache query plans every time a bind variable value changes.
Of course, in reality, there are lots of additional caveats, corner cases, considerations, and complications that I'm intentionally (and unintentionally) glossing over here. But that's the basic idea of what the optimizer is trying to accomplish.

The order of a SQL Select statement without Order By clause

As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.

Small table has very high cost in query plan

I am having an issue with a query where the query plan says that 15% of the execution cost is for one table. However, this table is very small (only 9 rows).
Clearly there is a problem if the smallest table involved in the query has the highest cost.
My guess is that the query keeps on looping over the same table again and again, rather than caching the results.
What can I do about this?
Sorry, I can't paste the exact code (which is quite complex), but here is something similar:
SELECT Foo.Id
FROM Foo
-- Various other joins have been removed for the example
LEFT OUTER JOIN SmallTable as st_1 ON st_1.Id = Foo.SmallTableId1
LEFT OUTER JOIN SmallTable as st_2 ON st_2.Id = Foo.SmallTableId2
WHERE (
-- various where clauses removed for the example
)
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
Take these execution-plan statistics with a wee grain of salt. If this table is "disproportionately small," relative to all the others, then those cost-statistics probably don't actually mean a hill o' beans.
I mean... think about it ... :-) ... if it's a tiny table, what actually is it? Probably, "it's one lousy 4K storage-page in a file somewhere." We read it in once, and we've got it, period. End of story. Nothing (actually...) there to index; no (actual...) need to index it; and, at the end of the day, the DBMS will understand this just as well as we do. Don't worry about it.
Now, having said that ... one more thing: make sure that the "cost" which seems to be attributed to "the tiny table" is not actually being incurred by very-expensive access to the tables to which it is joined. If those tables don't have decent indexes, or if the query as-written isn't able to make effective use of them, then there's your actual problem; that's what the query optimizer is actually trying to tell you. ("It's just a computer ... backwards things says it sometimes.")
Without the query plan it's difficult to solve your problem here, but there is one glaring clue in your example:
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
This is going to be incredibly difficult for SQL Server to optimize because it's nearly impossible to accurately estimate the cardinality. Hover over the elements of your query plan and look at EstimatedRows vs. ActualRows and EstimatedExecutions vs. ActualExecutions. My guess is these are way off.
Not sure what the whole query looks like, but you might want to see if you can rewrite it as two queries with a UNION operator rather than using the OR logic.
Well, with the limited information available, all I can suggest is that you ensure all columns being used for comparisons are properly indexed.
In addition, you haven't stated if you have an actual performance problem. Even if those table accesses took up 90% of the query time, it's most likely not a problem if the query only takes (for example) a tenth of a second.

How does SQL server evaluate the cost of an execution plan which contains a user defined function?

I have a stored procedure which filters based on the result of the DATEADD function - My understanding is that this is similar to using user defined functions in that because SQL server cannot store statistics based on the output of that function it has trouble evaluating the cost of an execution plan.
The query looks a little like this:
SELECT /* Columns */ FROM
TableA JOIN TableB
ON TableA.id = TableB.join_id
WHERE DATEADD(hour, TableB.HoursDifferent, TableA.StartDate) <= #Now
(So its not possible to pre-calculate the outcome of the DATEADD)
What I'm seeing is a terrible terrible execution plan which I believe is due to SQL server incorrectly estimating the number of rows being returned from a part of the tree as being 1, when in fact its ~65,000. I have however seen the same stored procedure execute in a fraction of the time when different (not neccessarily less) data is present in the database.
My question is - in cases like these how does the query optimiser estimate the outcome of the function?
UPDATE: FYI, I'm more interested in understanding why some of the time I get a good execution plan and why the rest of the time I don't - I already have a pretty good idea of how I'm going to fix this in the long term.
It's not the costing of the plan that's the problem here. The function on the columns prevent SQL from doing index seeks. You're going to get an index scan or a table scan.
What I'd suggest is to see if you can get one of the columns out of the function, basically see if you can move the function to the other side of the equality. It's not perfect, but it means that at least one column can be used for an index seek.
Something like this (rough idea, not tested) with an index on TableB.HoursDifference, then an index on the join column in TableA
DATEDIFF(hour, #Now, TableA.StartDate) >= TableB.HoursDifferent
On the costing side, I suspect that the optimiser will use the 30% of the table 'thumb-suck' because it can't use statistics to get an accurate estimate and because it's an inequality. Meaning it's going to guess that 30% of the table will be returned by that predicate.
It's really hard to say anything for sure without seeing the execution plans. You mention an estimate of 1 row and an actual of 65000. In some cases, that's not a problem at all.
http://sqlinthewild.co.za/index.php/2009/09/22/estimated-rows-actual-rows-and-execution-count/
It would help to see the function, but one thing I have seen is burying functions like that in queries can result in poor performance. If you can evaluate some of it beforehand you might be in better shape. For example, instead of
WHERE MyDate < GETDATE()
Try
DECLARE #Today DATETIME
SET #Today = GETDATE()
...
WHERE MyDate < #Today
this seems to perform better
#Kragen,
Short answer: If you are doing queries with ten tables, get used to it. You need to learn all about query hints, and a lot more tricks besides.
Long answer:
SQL server generally generates excellent query plans for up to about three to five tables only. Once you go beyond that in my experience you are basically going to have to write the query plan yourself, using all the index and join hints. (In addition, Scalar functions seem to get estimated at Cost=Zero, which is just mad.)
The reason is it is just too damn complicated after that. The query optimiser has to decide what to do algorithmically, and there are too many possible combinations for even the brightest geniuses on the SQL Server team to create an algorithm which works truly universally.
They say the optimiser is smarter than you. That may be true. But you have one advantage. That advantage is if it doesn't work, you can throw it out and try again! By about the sixth attempt you should have something acceptable, even for a ten-table join, if you know the data. The query optimiser cannot do that, it has to come up with some sort of plan instantly, and it gets no second chances.
My favourite trick is to force the order of the where clause by converting it to a case statement. Instead of:
WHERE
predicate1
AND predicate2
AND....
Use this:
WHERE
case
when not predicate1 then 0
when not predicate2 then 0
when not .... then 0
else 1 end = 1
Order your predicates cheapest to most expensive, and you get an outcome which is logically the same but which SQL server doesn't get to mess around with - it has to do them in the order you say.

Can Multiple Indexes Work Together?

Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.

Resources