Does snowflake TOP/Limit/Fetch stop processing once it finds enough rows? - snowflake-cloud-data-platform

When you use the Snowflake TOP clause in a query, does the SQL Server engine stop searching for rows once it has enough to satisfy the TOP X needed to be returned?

I think it depends on the rest of your query. For example, if you use TOP 10 but don't supply an order by then yes, it will stop as soon as the 10 records are returned but your results are non-deterministic.
If do you use an order by, then the entire query has to be executed first before the top 10 results can be returned but your results will be deterministic.
Here is a real example. If I run a select on the SAMPLE_DATA.TPCH_SF10000.CUSTOMER table with a limit 10 it returns in 1.8 seconds (no caching). This table has 1,500,000,000 rows in it. If I then check the query plan it has only scanned a tiny portion of the table, 1 out of 6,971 partitions:
You can see that it will return when 10 records have been streamed back from the initial table scan since there is nothing more it has to do.

From my testing and understanding, it does not stop. You can see typically see that the last step in the execution plan is the "limit" step. You can also see what's going on by looking at the execution plans. You will typically see the LIMIT (or whatever) after full processing. Also, if you take a query that runs for say 20 seconds without a LIMIT (or similar) and add the LIMIT, you will typically not see any difference in the execution time (but be aware of fetch time). I typically run query performance testing in the UI to avoid issues with client side tools that can mislead you due to limits on fectching and/or use of cursors.

Related

What do these different properties in SQL Server's (Azure Synapse's) Estimated Execution Plan mean?

I'm trying to work on the statistics, and as a part of it, I'm trying to look at the execution plan of certain SELECT * commands with a WHERE condition on a particular column.
What I keep getting is the below fields (example output):
I don't really know what these properties mean. I'm trying to perform a before-stats-update and after-stats-update difference in these results, and I don't see much change.
Can someone please throw some light here? Would be very helpful to understand this information.
The optimizer uses statistics and row counts to estimate the number of rows that will be consumed and produced by each operator in the query tree. For a simple leaf Get like this, it is estimated to be executed 1 time and will return 13.2M rows. The row width is estimated to be 2544B. If your Get was on the inside of a nested loops join (presumably with another Get on the outer), then you could get multiple scans of the inner table and the Estimated Number of Executions would potentially be > 1. That would then be also shown in the Estimated Number of Rows for All Executions as being a multiple of the 13.2M number. The I/O costs are zero for this case, but they would represent a cost for the scan that helps the optimizer compare this path versus other paths during its search of the plan space.
For a normal user, the way that you can examine if updated/better statistics would help your query is to try running the query with "set statistics profile on" (note: it has a bit of overhead, so don't run like this unless needed to do validations manually) before and after updating stats. You can then look at the per-operator actual vs. estimated row counts to see if things got better. Also, the query store will record runtime information (though not per-operator information) which can give you a summary over your whole workload of how it is performing.

Postgres Query Plan keeps changing - takes a query a minute to finish sometimes and never finishes sometimes

I have huge SQL Query. Probably 15-20 tables involved.
There are 6 to 7 subqueries which are joined again.
This query most of times takes a minute to run and return 5 million records.
So even if this query is badly written, it does have query plan that makes it finish in a minute. I have ensured that query actually ran and didn't use cached results.
Sometimes, the query plan gets jacked up and then it never finishes. I run a vacuum analyze every night on the tables involved in the query. The work_memory is currently set at 200 MB..I have tried increasing this to 2 GB as well. I haven't experienced the query getting messed when work_memory was 2 GB. But when i reduced it and ran the query, it got messed. Now when i increased it back to 2 GB, the query is still messed. Has it got something to do with the query plan not getting refreshed with the new setting ? I tried discard plan on my session.
I can only think of work_mem and vacuum analyze at this point. Any other factors that can affect a smoothly running query that returns results in a minute to go and and not return anything ?
Let me know if you need more details on any settings ? or the query itself ? I can paste the plan too...But the query and the plan or too big to be pasting here..
If there are more than geqo_treshold (typically 12) entries in the range table, the genetic optimiser will kick in, often resulting in random behaviour, as described in the question. You can solve this by:
increasing geqo_limit
move some of your table referencess into a CTE. If you already have some subqueries, promote one (or more) of these to a CTE. It is a kind of black art to identify clusters of tables in your query that will fit in a compact CTE (with relatively few result tuples, and not too many key references to the outer query).
Setting geqo_treshold too high (20 is probably too high ...) will cause the planner to need a lot of time to evaluate all the plans. (the number of plans increases basically exponential wrt the number of RTEs) If you expect your query to need a few minutes to run, a few seconds of planning time will probably do no harm.

Does SQL Server randomly sort results when no ORDER BY is used? Why?

I have a query in SSMS that gives me the same number of rows but in a different order each time I hit the F5 key. A similar problem is described in this post:
Query returns a different result every time it is run
The response given is to include an ORDER BY clause because, as the response in that post explains, SQL Server guesses the order if you don't give it one.
OK, that does fix it, but I'm confused about what it is that SQL Server is doing. Tables have a physical order whether they are heaps or have clustered indexes. The physical order of each table does not change with every execution of the query which also does not change. We should see the same results each time! What's it doing, accessing tables in their physical orders and then, instead of displaying the results by that unchanging physical order, it randomly sorts the results? Why? What am I missing? Thanks!
Simple - if you want records in certain order then ask for them in a certain order.
If you don't asked for an order it does not guess. SQL just does what is convenient.
One way that you can get different ordering is if parallelism is at play. Imagine a simple select (i.e. select * from yourTable). Let's say that the optimizer produces a parallel plan for that query and that the degree of parallelism is 4. Each thread will process (roughly) 1/4 of the table. But, if yours isn't the only workload on the server, each thread will go between status of running and runnable (just by the nature of how the SQLOS schedules threads, they will go into runnable from time to time even if yours is the only workload on the server, but is exacerbated if you have to share). Since you can't control which threads are running at any given time, and since each thread is going to return its results as soon as it's retrieved them (since it doesn't have to do any joins, aggregates, etc), the order in which the rows comes back is non-deterministic.
To test this theory, try to force a serial plan with the maxdop = 1 query hint.
SQL server uses a set of statistics for each table to assist with speed and joins etc... If the stats give ambiguous choice for the fatest route, the choice by SQL can be arbitrary - and could require slightly different indexing to achieve... Hence a different output order. The physical order is only a small factor in predicting order. Any indexes, joins, where clause can affect the order, as SQL will also create and use its own temporary indexes to help with satisfying the query, if the appropriate indexes do not already exist. Try re calculating the statistics on each table involved and see if there is any change or consistency after that.
You are probably not getting random order each time, but rather an arbitrary choice between a handful of similarly weighted pathways to get the same result from the query.

How do I differentiate execution time from the total amount of time it took to run and return rows?

When I run a query which returns millions of rows from within Sql Management tools it looks like the query executes instantly. There is virtually no execution time as far as I can tell. What makes the query take time to complete is returning all the rows.
This got me thinking I've done a good job! But not so fast... As I look at the query in the profiler tool, it states that the query used 7600 CPU. Duration was 15000.
I'm not sure I know how to interpret these stats.
On one hand, the query seems to run fast but the profiler report makes me think otherwise. How come the query is executed instantly in Mgmt Tools? There obviously should be some kind of delayed execution as far as I can tell: at least 7600 ms. I have to wait longer than both the cpu and the duration stats when I run the query in mgmt tools for it to complete the query.
it looks like the query executes instantly
It might be that the query plan allows to start returning the rows quickly.
For example, if you do SELECT * FROM a_large_table you will see some rows immediately, but retrieval of the whole resultset will take some time. What is the actual execution time reported by Mgmt Studio (shown in the status bar after the query is complete)?
If you want to test the query performance without retrieving data to the client, you can do SELECT INTO #temp_table. This would require some additional I/O, but would still give you a rather good estimate of the execution time.
UPD.
You could also run something like SELECT COUNT(*) FROM (<your query here>) or SELECT SUM(<some field>) FROM (<your query here>) - with some luck, it will make the server execute the query and aggregate the result, basically doing the same work plus a little extra. But it is very easy to skew the results this way - query optimizer is smart, and you need to be very careful to be sure you are measuring what you want to measure (because measuring a query with a different execution plan makes no sense at all).
I suggest you to think again on what you want to measure and why. In any real-life scenario you are not interested in "pure" query duration - because you never want to discard the query result (the result is why you need this query in the first place, right?). So you either need to return the result to the client, or store it somewhere, or join it with another table and so on - and usually you want to measure query execution including the time used for processing its result.
One final notice. If you hope you can somehow force the server to execute this query in 1 second because you think that server does nothing for other 13 seconds, you are wrong.. As they say, SELECT ain't broken.
What might help is query optimization - and for a single query a profiler won't help you much with it. Analyze the query plan, tune your table structure, try to rewrite the query, post another question on SO if in trouble.

What is SQL Server doing between the time my first record is returned and when my last record is returned?

Say I have a query that returns 10,000 records. When the first record has returned what can I assume about the state of my query?
Has it finished and is just returning records from the server to my instance of SSMS?
Is the query itself still being executed on the server?
What is it that causes the 10,000 records to be slowly returned for one query and nearly instantly for another?
There is potentially some mix of progressive processing on the server side, network transfer of the data, and rendering by the client.
If one query returns 10,000 rows quickly, and another one slowly -- and they are of similar row size, data types, etc., and are both destined for results to grid or results to text -- there is little we can do to analyze the differences unless you show us execution plans and/or client statistics for each one. These are options you can set in SSMS when running a query.
As an aside, switching between results to grid and results to text you might notice slightly different runtimes. This is because in one case Management Studio has to work harder to align the columns etc.
You can not make a generic assumption, a query's plan is composed of a number of different types of operations, or iterators. Some of these are Navigational based, and work like a pipeline, whilst others are set based operations, such as a sort.
If any query contains a set based operation, it requires all the records before it could output the results (i.e an order by clause within your statement.) But if you have no set based iterators you could expect the rows to be streamed to you as they become available.
The answer to each of your individual questions is "it depends."
For example, consider if you include an order by clause, and there isn't an index for the column(s) you're ordering by. In this case, the server has to find all the records that satisfy your query, then sort them, before it can return the first record. This causes a long pause before you get your first record, but you (should normally) get them quite quickly once you start getting any.
Without the order by clause, the server will normally send each record as its found, so the first record will often show up sooner, but you may see a long pause between one record and the next.
As as far simply "why is one query faster than another", a lot depends on what indexes are available, and whether they can be used for a particular query. For example, something like some_column like '%something' will almost always be quite slow. The leading '%' means this won't be able to use an index, even if some_column has one. A search for something% instead of %something% might easily be 100 or 1000 times faster. If you really need the former, you really want to use full-text searching instead (create a full-text index, and use contains() instead of like.
Of course, a lot can also depend simply on whether the database has an index for a particular column (or group of columns). With a suitable index, the query will usually be quite a lot faster.

Resources