Hi I am having a little trouble trying to find a simple explanation for bind aware cursor matching in oracle.. Is bind aware cursor matching basically having Oracle monitor a query with a bind variable over time and seeing if there's an increase in CPU when using some variables. Then from doing this it almost generates a more suitable execution plan say a full table scan then marks the query as bind aware then the next time the query is executed there is a choice of two execution plans? Any help will be greatly appreciated! Cheers!
In the simplest case, imagine that you have an ORDERS table. In that table is a status column. There are only a handful of status values and some are very, very popular while others are very rare. Imagine that the table has 10 million rows. For our purposes, say that 93% are "COMPLETE", 5% are "CANCELLED", and the remaining 2% are spread between a 8 different statuses that track the order flow (INCOMPLETE, COMPLETE, IN FULFILLMENT, IN TRANSIT, etc.).
If you have the most basic statistics on your table, the optimizer knows that there are 10 million rows and 10 distinct statuses. It doesn't know that some status values are more popular than others so it guesses that each status corresponds to 1 million rows. So when it sees a query like
SELECT *
FROM orders
WHERE status = :1
it guesses that it needs to fetch 1 million rows from the table regardless of the bind variable value so it decides to use a full table scan.
Now, a human comes along wondering why Oracle is being silly and doing a full table scan when he asks for the handful of orders that are in an IN TRANSIT status-- clearly an index scan would be preferable there. That human realizes that the optimizer needs more information in order to learn that some status values are more popular than others so that human decides to gather a histogram (there are options that cause Oracle to gather histograms on certain columns automatically as well but I'm ignoring those options to try to keep the story simple).
Once the histogram is gathered, the optimizer knows that the status value is highly skewed-- there are lots of COMPLETED orders but very few IN TRANSIT orders. If it sees a query that is using literals rather than bind variables, i.e.
SELECT *
FROM orders
WHERE status = 'IN TRANSIT'
vs
SELECT *
FROM orders
WHERE status = 'COMPLETED'
then it is very easy for the optimizer to decide to use an index in the first case and table scan in the second. When you have a bind variable, though, the optimizer's job is more difficult-- how is it supposed to determine whether to use the index or to do a table scan...
Oracle's first solution was known as "bind variable peeking". In this approach, when the optimizer sees something like
SELECT *
FROM orders
WHERE status = :1
where it knows (because of the histogram on status) that the query plan should depend on the value passed in for the bind variable, Oracle "peeks" at the first value that is passed in to determine how to optimize the statement. If the first bind variable value is 'IN TRANSIT', an index scan will be used. If the first bind variable value is 'COMPLETE`, a table scan will be used.
For a lot of cases, this works pretty well. Lots of queries really only make sense for either very popular or very rare values. In our example, it's pretty unlikely that anyone would ever really want a list of all 9 million COMPLETE orders but someone might want a list of the couple thousand orders in one of the various transitory states.
But bind variable peeking doesn't work well in other cases. If you have a system where the application sometimes binds very popular values and sometimes binds very rare values, you end up with a situation where application performance depends heavily on who happens to run a query first. If the first person to run the query uses a very rare value, the index scan plan will be generated and cached. If the second person to run the query uses the very common value, the cached plan will be used and you'll get an index scan that takes forever. If the roles are reversed, the second person uses the rare value, gets the cached plan that does a full table scan, and has to scan the entire table to get the couple hundred rows they're interested in. This sort of non-deterministic behavior tends to drive DBAs and developers mad because it can be maddingly hard to diagnose and can lead to rather odd explanations-- Tom Kyte has an excellent example of a customer that concluded they needed reboot the database in the afternoon if it rained Monday morning.
Bind aware cursor matching is the solution to the bind variable peeking problem. Now, when Oracle sees the query
SELECT *
FROM orders
WHERE status = :1
and sees that there is a histogram on status that indicates that some values are more common than others, it is smart enough to make that cursor "bind aware". That means that when you bind a value of IN FULFILLMENT, the optimizer is smart enough to conclude that this is one of the rare values and give you the index plan. When you bind a value of COMPLETE, the optimizer is smart enough to conclude that this is one of the common values and give you the plan with the table scan. So the optimizer now knows of two different plans for the same query and when it sees a new bind value like IN TRANSIT, it checks to see whether that value is similar to others that it has seen before and either gives you one of the existing plans or creates another new plan. In this case, it would decide that IN TRANSIT is roughly as common as IN FULFILLMENT so it re-uses the plan with the index scan rather than generating a third query plan. This, hopefully, leads to everyone getting their preferred plan without having to generate and cache query plans every time a bind variable value changes.
Of course, in reality, there are lots of additional caveats, corner cases, considerations, and complications that I'm intentionally (and unintentionally) glossing over here. But that's the basic idea of what the optimizer is trying to accomplish.
Related
I have a query in SSMS that gives me the same number of rows but in a different order each time I hit the F5 key. A similar problem is described in this post:
Query returns a different result every time it is run
The response given is to include an ORDER BY clause because, as the response in that post explains, SQL Server guesses the order if you don't give it one.
OK, that does fix it, but I'm confused about what it is that SQL Server is doing. Tables have a physical order whether they are heaps or have clustered indexes. The physical order of each table does not change with every execution of the query which also does not change. We should see the same results each time! What's it doing, accessing tables in their physical orders and then, instead of displaying the results by that unchanging physical order, it randomly sorts the results? Why? What am I missing? Thanks!
Simple - if you want records in certain order then ask for them in a certain order.
If you don't asked for an order it does not guess. SQL just does what is convenient.
One way that you can get different ordering is if parallelism is at play. Imagine a simple select (i.e. select * from yourTable). Let's say that the optimizer produces a parallel plan for that query and that the degree of parallelism is 4. Each thread will process (roughly) 1/4 of the table. But, if yours isn't the only workload on the server, each thread will go between status of running and runnable (just by the nature of how the SQLOS schedules threads, they will go into runnable from time to time even if yours is the only workload on the server, but is exacerbated if you have to share). Since you can't control which threads are running at any given time, and since each thread is going to return its results as soon as it's retrieved them (since it doesn't have to do any joins, aggregates, etc), the order in which the rows comes back is non-deterministic.
To test this theory, try to force a serial plan with the maxdop = 1 query hint.
SQL server uses a set of statistics for each table to assist with speed and joins etc... If the stats give ambiguous choice for the fatest route, the choice by SQL can be arbitrary - and could require slightly different indexing to achieve... Hence a different output order. The physical order is only a small factor in predicting order. Any indexes, joins, where clause can affect the order, as SQL will also create and use its own temporary indexes to help with satisfying the query, if the appropriate indexes do not already exist. Try re calculating the statistics on each table involved and see if there is any change or consistency after that.
You are probably not getting random order each time, but rather an arbitrary choice between a handful of similarly weighted pathways to get the same result from the query.
Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).
Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).
I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.
What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?
update NewRecords
set notmatchflag=true
from
newRecords elr
inner join People pro
on
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob
Due to not know exactly the table structure and types of data I'm not 100% sure that this would work, but give it a go anyway!
I would first check the SQL execution plan when you test it, usually there will be some sections that will be taking the most amount of time. From there you should be able to gauge where/if any indexes would help.
My gut feeling though is your function that is being called A LOT by the looks of things, but hopefully the execution plan will determine if that is the case. If it is then a CLR stored procedure might be the way to go.
There seems to be nothing wrong with your query (maybe except for the fact, that you want to find all possible combinations of differing values, which in most scenarios will give a lot of results :)).
Anyway, the problem is your Levenstein functions - I assume that they're written in T-SQL. Even if you optimized them, they're still weak. You really should compile them to CLR (the link that you posted already contain an example) - this will be an order of magnitude faster.
Another idea I'd try with what you've got, is to somehow decrease the number of Levenstein comparisons. Maybe find other conditions, or reverse the query: find all MATCHING records, and then select what's left (it may enable you to introduce those additional conditions.
But Levenstein compiled to CLR is your best option.
For one skip the values that are true so if you run it again then it will not process those.
That distance is expensive - try to eliminate those that don't have a chance first.
If the length differs by more than 2 then I don't think you can have a distance <= 2.
update NewRecords
set notmatchflag=true
from newRecords elr
inner join People pro
on notmatchflag = false
and elr.ssn is null
and elr.dob <> ro.dob
and dbo.Levenshtein_withThreshold( elr.last, pro.last,2) > 2
and dbo.Levenshtein_withThreshold(elr.first,pro.first,2) > 2
As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.
I have a private messaging system for my browser game. When I check most CPU time using queries I see that this table is the most CPU using one. I am not good with indexes, query time optimization. So I would like to get your optimization tips about this table.
Alright now I am going to show you table structure first:
structure image
Alright this following query reads how many unread messages does user have and this query is the most CPU using one since it reads at every page load:
SELECT COUNT([Id]) [Number]
FROM [MyTable]
WHERE [ReceiverUserId] = #1
AND [ReceiverReaded] = #2
AND [ReceiverDeleted] = #3
So what kind of indexes etc might improve my performance?
Why allow NULLs on those columns at all - either it's read or not. Just default to 0. Then index on Read/Deleted/ReceivedUser (in that order they will be "partitioned" if you need a lot of ALL READ access, alternatively, if most reads are just for a single user, put an index on ReceivedUser)
What you want to do is see your index be covering. In your case, you could put an index on ReceiverUserId and INCLUDE columns ReceiverReaded and ReceiverDeleted and it would be covering (for that query). In the execution plan, you should just see an index seek, since you have a single user.
You could capture the workload and then run it through the index tuning wizard in SQL Server and it would probably make pretty good suggestions. You need to interpret what it's telling you, of course.
You always want indexes on the fields you are searching for, so you would probably improve the query performance by adding indexes on [ReceiverUserId], [ReceiverReaded] and [ReceiverDeleted].
Of course the more columns you index, the slower your UPDATES and INSERTS will be.
A fairly simple rule of thumb in db optimization is to index any column that appears as part of a predicate in a WHERE clause or a JOIN. From your example these would include:
ReceiverUserId
ReceiverReaded
ReceiverDeleted
There are also a number of optimizer tools available that will "observe" your db and tell you what columns to index for best performance.
Different approach that may be viable for your application: don't query the messages table at all when the user isn't explicitly requesting any content, e.g. when he's not in the "messages" section of your game.
try extending your user table with integer valued columns indicating how many messages are there and how many are read already. Every time you modify the message table, you also modify the corresponding value in the user table.
This way you won't need to look through the whole table on every refresh. Note though, that the downturn of this method is some extra synchronization work on the programmer's part. If you've encapsulated the modification of the messages table (add message, read message, delete message) properly, this shouldn't be a problem.
Index your search fields (per #PaulStock's answer).
Change your tinyint fields to bit fields (default value = 0)
Does your body really need to be nvarchar(4000)? That's HUGE! Consider much shorter messages (such as nvarchar(300) or smaller -- for reference, Twitter is just 140.)
Say I have a query that returns 10,000 records. When the first record has returned what can I assume about the state of my query?
Has it finished and is just returning records from the server to my instance of SSMS?
Is the query itself still being executed on the server?
What is it that causes the 10,000 records to be slowly returned for one query and nearly instantly for another?
There is potentially some mix of progressive processing on the server side, network transfer of the data, and rendering by the client.
If one query returns 10,000 rows quickly, and another one slowly -- and they are of similar row size, data types, etc., and are both destined for results to grid or results to text -- there is little we can do to analyze the differences unless you show us execution plans and/or client statistics for each one. These are options you can set in SSMS when running a query.
As an aside, switching between results to grid and results to text you might notice slightly different runtimes. This is because in one case Management Studio has to work harder to align the columns etc.
You can not make a generic assumption, a query's plan is composed of a number of different types of operations, or iterators. Some of these are Navigational based, and work like a pipeline, whilst others are set based operations, such as a sort.
If any query contains a set based operation, it requires all the records before it could output the results (i.e an order by clause within your statement.) But if you have no set based iterators you could expect the rows to be streamed to you as they become available.
The answer to each of your individual questions is "it depends."
For example, consider if you include an order by clause, and there isn't an index for the column(s) you're ordering by. In this case, the server has to find all the records that satisfy your query, then sort them, before it can return the first record. This causes a long pause before you get your first record, but you (should normally) get them quite quickly once you start getting any.
Without the order by clause, the server will normally send each record as its found, so the first record will often show up sooner, but you may see a long pause between one record and the next.
As as far simply "why is one query faster than another", a lot depends on what indexes are available, and whether they can be used for a particular query. For example, something like some_column like '%something' will almost always be quite slow. The leading '%' means this won't be able to use an index, even if some_column has one. A search for something% instead of %something% might easily be 100 or 1000 times faster. If you really need the former, you really want to use full-text searching instead (create a full-text index, and use contains() instead of like.
Of course, a lot can also depend simply on whether the database has an index for a particular column (or group of columns). With a suitable index, the query will usually be quite a lot faster.