Well I have a sorted table by id and I want to retrieve n rows offset m rows, but when I use it without orderby it is producing unpredictable results and with order by id its taking too much of time, since my table is already ordered, I just want to retrieve n rows leaving first m rows.
The documentation says:
The query planner takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you use for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
So what you are trying cannot really be done.
I suppose you could play with the planner options or rearrange the query in a clever way to trick the planner into using a plan that suits you, but without actually showing the query, it's hard to say.
SELECT * FROM mytable LIMIT 100 OFFSET 0
You really shouldn't rely on implicit ordering though, because you may not be able to predict the exact order of how data goes into the database.
As pointed out above, SQL does not guarantee anything about order unless you have an ORDER BY clause. LIMIT can still be useful in such a situation, but I can't think of any use for OFFSET. It sounds like you don't have an index on id, because if you do, the query should be extremely fast, clustered or not. Take another look at that. (Also check CLUSTER, which may improve your performance at the margin.)
REPEAT: this is not something about Postgresql. Its behavior here is conforming.
Related
I have a query in SSMS that gives me the same number of rows but in a different order each time I hit the F5 key. A similar problem is described in this post:
Query returns a different result every time it is run
The response given is to include an ORDER BY clause because, as the response in that post explains, SQL Server guesses the order if you don't give it one.
OK, that does fix it, but I'm confused about what it is that SQL Server is doing. Tables have a physical order whether they are heaps or have clustered indexes. The physical order of each table does not change with every execution of the query which also does not change. We should see the same results each time! What's it doing, accessing tables in their physical orders and then, instead of displaying the results by that unchanging physical order, it randomly sorts the results? Why? What am I missing? Thanks!
Simple - if you want records in certain order then ask for them in a certain order.
If you don't asked for an order it does not guess. SQL just does what is convenient.
One way that you can get different ordering is if parallelism is at play. Imagine a simple select (i.e. select * from yourTable). Let's say that the optimizer produces a parallel plan for that query and that the degree of parallelism is 4. Each thread will process (roughly) 1/4 of the table. But, if yours isn't the only workload on the server, each thread will go between status of running and runnable (just by the nature of how the SQLOS schedules threads, they will go into runnable from time to time even if yours is the only workload on the server, but is exacerbated if you have to share). Since you can't control which threads are running at any given time, and since each thread is going to return its results as soon as it's retrieved them (since it doesn't have to do any joins, aggregates, etc), the order in which the rows comes back is non-deterministic.
To test this theory, try to force a serial plan with the maxdop = 1 query hint.
SQL server uses a set of statistics for each table to assist with speed and joins etc... If the stats give ambiguous choice for the fatest route, the choice by SQL can be arbitrary - and could require slightly different indexing to achieve... Hence a different output order. The physical order is only a small factor in predicting order. Any indexes, joins, where clause can affect the order, as SQL will also create and use its own temporary indexes to help with satisfying the query, if the appropriate indexes do not already exist. Try re calculating the statistics on each table involved and see if there is any change or consistency after that.
You are probably not getting random order each time, but rather an arbitrary choice between a handful of similarly weighted pathways to get the same result from the query.
While working I just noted SQL Server order by first, second columns in select list when it has distinct and it is not as per random order or in order records were created.
Can any one please confirm own experience ?
If the Distinct Sort operator is the last one in the query execution plan, then it seems a reasonable assumption that the final order will be by the columns in order. However, if you truly want them to be ordered a certain way, you should not rely on this assumption. You ought to explicitly add an ORDER BY clause. It just isn't worth the potential small performance gain to risk things not coming out right. There are other operators that can ensure DISTINCTness such as an aggregate. Are you sure it will come out sorted?
Hey, what if the engine becomes smart enough to determine that the rows being returned are already distinct and nothing further needs to be done (such as ordering or any kind of aggregate or other stream to discard duplicate rows)? This could be fairly easy, depending on whether primary key columns or unique columns from underlying tables are (all) included. For example, if you have a clustered index on OrderID, any query that contains OrderID where there is no possibility of the value being repeated, is provably distinct already!
It is a bad habit to rely on side-effects of certain operations. Think about views: for a long time, you were allowed to put ORDER BY in a view, and it worked, even though it should never have been relied on. Then, people upgraded their SQL Server databases and found--whoops--that the old behavior was no longer supported. I worked with a SQL 2000 database where, after upgrading to SQL 2005 and changing the compatibility mode, several data tables in the front end were no longer sorted--whoops.
And think about parallelism--this breaks a task up into smaller pieces. Parallelism already has been shown to break Scope_Identity() in some cases in some versions of SQL Server. And if one worker thread completes before another, its results may be streamed to the client before other threads are complete--thus suddenly changing order.
Do it right. Add an ORDER BY. Don't speculate any longer about what would or could or should happen. Or what wouldn't or couldn't or shouldn't.
Checking the execution plan of a query with DISTINCT confirms your observation.
As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.
What can cause SQL server 2005 to return the results of a SELECT in a different order.
Note:
I don't know if that matters but the query has an Order by clause on a non-unique column (Date column to be specific) and the table is a temp table.
Edit:
I know that the order is not guaranteed if I wont specify it, but it usually consistent until something happens.
I want to know what is this "something" that can happen.
Thanks.
If there are duplicate values in the ORDER BY column, then the order of those values is undefined. The order is whatever the database finds most convenient, and it can change at any time for any reason.
If ordered by a non-unique column, then after the guranteed ordering of that column, the order of the rows within a given group is not guranteed - as awm said, it's down to the database. An insert can change the order, joins can change the order, an update can potentially change the order.
If you want guranteed order, then specify a set of columns in your order by that will gurantee order.
To be blunt, it really doesn't matter why it should change, if it ever does.
The query engine specifically documents that the order is unspecified if you don't actually specify it.
As always, the general advice on relying on undocumented behavior is don't do it!
Reasons why it might change (there are undoubtedly others, but why press your luck?):
A recompile of the query
Stale indexes gets updated
Maintenance have run on the server, rebuilding indices
You said
I know that the order is not
guaranteed if I wont specify it, but
it usually consistent until something
happens.
I think you meant to say
I know that the order is not
guaranteed if I don't specify it, so
it's always unreliable, no matter how
consistent it looks.
Anything that affects the query optimizer's decisions about the execution plan can change the order of unordered rows.
Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.