what is the difference between the execution of sort and sortwithlimit?
How order by works with limit?
The combination of ORDER BY and LIMIT is implemented as a special operator, shown as "SortWithLimit" in the query profile.
This operator does not spill to disk when processing more than a certain amount of data and therefore can run out of memory.
Details: https://community.snowflake.com/s/article/Out-of-memory-error-caused-by-LIMIT-and-or-OFFSET-clause
Related
I'm trying to work on the statistics, and as a part of it, I'm trying to look at the execution plan of certain SELECT * commands with a WHERE condition on a particular column.
What I keep getting is the below fields (example output):
I don't really know what these properties mean. I'm trying to perform a before-stats-update and after-stats-update difference in these results, and I don't see much change.
Can someone please throw some light here? Would be very helpful to understand this information.
The optimizer uses statistics and row counts to estimate the number of rows that will be consumed and produced by each operator in the query tree. For a simple leaf Get like this, it is estimated to be executed 1 time and will return 13.2M rows. The row width is estimated to be 2544B. If your Get was on the inside of a nested loops join (presumably with another Get on the outer), then you could get multiple scans of the inner table and the Estimated Number of Executions would potentially be > 1. That would then be also shown in the Estimated Number of Rows for All Executions as being a multiple of the 13.2M number. The I/O costs are zero for this case, but they would represent a cost for the scan that helps the optimizer compare this path versus other paths during its search of the plan space.
For a normal user, the way that you can examine if updated/better statistics would help your query is to try running the query with "set statistics profile on" (note: it has a bit of overhead, so don't run like this unless needed to do validations manually) before and after updating stats. You can then look at the per-operator actual vs. estimated row counts to see if things got better. Also, the query store will record runtime information (though not per-operator information) which can give you a summary over your whole workload of how it is performing.
When you use the Snowflake TOP clause in a query, does the SQL Server engine stop searching for rows once it has enough to satisfy the TOP X needed to be returned?
I think it depends on the rest of your query. For example, if you use TOP 10 but don't supply an order by then yes, it will stop as soon as the 10 records are returned but your results are non-deterministic.
If do you use an order by, then the entire query has to be executed first before the top 10 results can be returned but your results will be deterministic.
Here is a real example. If I run a select on the SAMPLE_DATA.TPCH_SF10000.CUSTOMER table with a limit 10 it returns in 1.8 seconds (no caching). This table has 1,500,000,000 rows in it. If I then check the query plan it has only scanned a tiny portion of the table, 1 out of 6,971 partitions:
You can see that it will return when 10 records have been streamed back from the initial table scan since there is nothing more it has to do.
From my testing and understanding, it does not stop. You can see typically see that the last step in the execution plan is the "limit" step. You can also see what's going on by looking at the execution plans. You will typically see the LIMIT (or whatever) after full processing. Also, if you take a query that runs for say 20 seconds without a LIMIT (or similar) and add the LIMIT, you will typically not see any difference in the execution time (but be aware of fetch time). I typically run query performance testing in the UI to avoid issues with client side tools that can mislead you due to limits on fectching and/or use of cursors.
I'm evaluating Flink specifically for the streaming window support for possible alert generation. My concern is the memory usage so if someone could help with this it would be appreciated.
For example, this application will be consuming potentially a significant amount of data from the stream within a given tumbling window of say 5 minutes. At the point of evaluation, if there were say a million documents for example that matched the criteria, would they all be loaded into memory?
The general flow would be:
producer -> kafka -> flinkkafkaconsumer -> table.window(Tumble.over("5.minutes").select("...").where("...").writeToSink(someKafkaSink)
Additionally, if there is some clear documentation that describes how memory is being dealt with in these cases that I may have overlooked that someone could out that would be helpful.
Thanks
The amount of data that is stored for a group window aggregation depends on the type of the aggregation. Many aggregation functions such as COUNT, SUM, and MIN/MAX can be preaggregated, i.e., they only need to store a single value per window. Other aggregation functions, such as MEDIAN or certain user-defined aggregation functions, need to store all values before they can compute their result.
The data that needs to be stored for an aggregation is stored in a state backend. Depending on the choice of the state backend, the data might be stored in-memory on the JVM heap or on disk in a RocksDB instance.
Table API queries are also optimized by a relational optimizer (based on Apache Calcite) such that filters are pushed as far towards the sources as possible. Depending on the predicate, the filter might be applied before the aggregation.
Finally, you need to add a groupBy() between window() and select() in your example query (see the examples in the docs).
Is it really bad to get 'Table spool' in sql server execution plan? If not how it is advantageous? Do we really look for getting rid of Table Spool?
According to MSDN:
The Lazy Spool logical operator stores each row from its input in a
hidden temporary object stored in the tempdb database. If the operator
is rewound (for example, by a Nested Loops operator) but no rebinding
is needed, the spooled data is used instead of rescanning the input.
If rebinding is needed, the spooled data is discarded and the spool
object is rebuilt by rescanning the (rebound) input.
It's always better to have no operator than to have one. Advantages are described above (no rescanning). Disadvantage is that rows must be stored in tempdb (usually fits in memory for faster access).
Usually it's not bad to have this operator unless everything fits in memory. You must share execution plan/query for more datailed explanation and possible tweaks.
I have a query in SSMS that gives me the same number of rows but in a different order each time I hit the F5 key. A similar problem is described in this post:
Query returns a different result every time it is run
The response given is to include an ORDER BY clause because, as the response in that post explains, SQL Server guesses the order if you don't give it one.
OK, that does fix it, but I'm confused about what it is that SQL Server is doing. Tables have a physical order whether they are heaps or have clustered indexes. The physical order of each table does not change with every execution of the query which also does not change. We should see the same results each time! What's it doing, accessing tables in their physical orders and then, instead of displaying the results by that unchanging physical order, it randomly sorts the results? Why? What am I missing? Thanks!
Simple - if you want records in certain order then ask for them in a certain order.
If you don't asked for an order it does not guess. SQL just does what is convenient.
One way that you can get different ordering is if parallelism is at play. Imagine a simple select (i.e. select * from yourTable). Let's say that the optimizer produces a parallel plan for that query and that the degree of parallelism is 4. Each thread will process (roughly) 1/4 of the table. But, if yours isn't the only workload on the server, each thread will go between status of running and runnable (just by the nature of how the SQLOS schedules threads, they will go into runnable from time to time even if yours is the only workload on the server, but is exacerbated if you have to share). Since you can't control which threads are running at any given time, and since each thread is going to return its results as soon as it's retrieved them (since it doesn't have to do any joins, aggregates, etc), the order in which the rows comes back is non-deterministic.
To test this theory, try to force a serial plan with the maxdop = 1 query hint.
SQL server uses a set of statistics for each table to assist with speed and joins etc... If the stats give ambiguous choice for the fatest route, the choice by SQL can be arbitrary - and could require slightly different indexing to achieve... Hence a different output order. The physical order is only a small factor in predicting order. Any indexes, joins, where clause can affect the order, as SQL will also create and use its own temporary indexes to help with satisfying the query, if the appropriate indexes do not already exist. Try re calculating the statistics on each table involved and see if there is any change or consistency after that.
You are probably not getting random order each time, but rather an arbitrary choice between a handful of similarly weighted pathways to get the same result from the query.