Does SQL server optimize what data needs to be read?

Does SQL server optimize what data needs to be read? - sql-server

I've been using BigQuery / Spark for a few years, but I'm not very familiar with SQL server.
If I have a query like
with x AS (
SELECT * FROM bigTable1
),
y AS (SELECT * FROM bigTable2)
SELECT COUNT(1) FROM x
Will SQL Server be "clever" enough to ignore the pointless data fetching?
Note: Due to the configuration of my environment I don't have access to the query planner for troubleshooting.

Like most of the leading professional DBMS's SQL Server has a statistical optimizer that will indeed eliminate datasources that are never used and cannot affect the results.
Note, however, that this does not apply to certain kinds of errors, so if your bigTable1 or bigTable2 do not exist (or you cannot access them), the query will throw a compile error, even though it would never actually use those tables.

SQL Server has certainly and actually the most advanced optimizer of all professionnal RDBMS (IBM DB2, Oracle...).
Before optimizing the algebriser transform the query, that is just a "demand" (not an execution code) into a mathematical formulae known has "algebraic tree". This mathematical object is formulae of relational algebrae that supports the Relational DBMS (a mathematical theory developped by Franck Edgar Codd in the early 70's).
The very first step of optimisation is done at this step by simplifying the mathematical formulae like things you've done with polynomial expression including x and y ( as an example : 2x - 3y = 3x² - 5x + 7 <=> y = (3x² - 3x + 7 ) / 3).
Query example (from Chris Date "A Cure for Madness") :
With the table :
CREATE TABLE T_MAD (TYP CHAR(4), VAL VARCHAR(16));
INSERT INTO T_MAD VALUES
('ALFA', 'ted'),('ALFA', 'chris'),('ALFA', 'michael'),
('NUM', '123'),('NUM', '4567'),('NUM', '89');
This query will fail :
SELECT * FROM T_MAD
WHERE VAL > 1000;
Because VAL is a string datatype uncompatible with a number value when comparing in WHERE.
But our table distinguish ALFA values from NUM values. By adding a restriction on the value with the TYP column like this :
SELECT * FROM T_MAD
WHERE TYP = 'NUM' AND VAL > 1000;
My query give the right result...
But all those queries too :
SELECT * FROM T_MAD
WHERE VAL > 1000 AND TYP = 'NUM';
SELECT * FROM
(
SELECT * FROM T_MAD WHERE TYP = 'NUM'
) AS T
WHERE VAL > 1000;
SELECT * FROM
(
SELECT * FROM T_MAD WHERE VAL > 1000
) AS T
WHERE TYP = 'NUM';
The last one is very important because the subquery is the first one that fails...
So what happen to this subquery that suddenly don't fail ?
In fact the algebriser rewrites all those query to a simplier form that conducts to a similar (not to say identical) formulae...
Just have a look on the query execution plans, which appear to be strictly equals !
NOTE that non professional DBMS like MySQL, MariaDB or PostGreSQL will fail for the last one.... An optimizer use a very huge set of IT developpers and researchers that open/free cannot mimic !
Second the optimizer have heuristic rules, that applies essentially at the semantic level. The execution plan is simplified when some contradictory conditions appears in the query text...
Just have a look over those two queries :
SELECT * FROM T_MAD WHERE 1 = 2;
SELECT * FROM T_MAD WHERE 1 = 1;
The first one will have no row returned while the second will have all the rows of the table returned... What does the optimizer ? The query execution plan give the answer :
The terms "Analyse de constante" in the query execution plan, means that the optimizer will not access to the table... That will be similar to what you will have with your last subquery ...
Note that every constrainst (PK, FK, UNIQUE, CHECK) can help the optimizer to simplify the query execution plan to optimize performances !
Third the optimizer will use statistics that are histogram computed on data distribution to predicts how many rows will be manipulated in every step of the query execution plan...
There is much more things to say about SQL Server query optimizer, like the fct that it works reversely from all the other optimizer, and with this technics it can predict the missing indexes since 18 years that all other RDBMS cannot !
PS : sorry to use the french version of SSMS... I am working in France and helps professionnals to optimize there databases !

Related

Oracle Query with Unknown WHERE clause

We have a table with 100 columns. The end user is allowed to write any query trying to search on practically any column.
Essentially, they construct the query dynamically from a screen and the WHERE condition can have any number or combination of columns.
Example
select * from my_tab where col1=x
select * from my_tab where col1=x and col2=y and ....col10 =q
select * from my_tab where col10=a and col20=4 and col30=r
While this is possible syntactically, the biggest problem is performance because you cannot have all possible combination of indexes.
I know this seems to be "query from hell" but still :
What can be other approaches ( both - technical and non-technical ) to this problem ?

Snowflake Performance issues when querying large tables

I am trying to query a table which has 1Tb of data clustered by Date and Company. A simple query is taking long time
Posting the query and query profile
SELECT
sl.customer_code,
qt_product_category_l3_sid,
qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
dollars_spent,
units,
user_pii_sid,
promo_flag,
media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl
WHERE
transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND qt_product_category_l3_sid IN (SELECT DISTINCT qt_product_category_l3_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_category_l1_sid IN (246))
AND qt_product_brand_sid IN (SELECT qt_product_brand_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_major_brand_sid IN (246903, 430138))
enter image description here

"simple query" I am not sure there is such a thing. A naive query, sure.
select * from really_large_table where column1 = value;
will perform really badly if you only care for 1 or 2 of the columns. As snowflake has to load all the data. You will get a column data to row data ratio improvement by using
select column1, column2 from really_large_table where column1 = value;
now only two columns of data need to be read form the data store.
Maybe you are looking for data where the value is > 100 because you think that should not happen. Then
select column1, column2 from really_large_table where column1 > 100 limit 1;
will perform much better than
select column1, column2 from really_large_table order by column1 desc limit 50;
but if what you are doing is doing the minimum work is can to have a correct answer, you next option is to increase the warehouse size. Which for IO bound work gives a scalar improvement, but some aggregation steps don't scale as linear.
Another thing to look for with regards is sometime your calculation can produce too much intermediate state, and it "external spills" (exact wording not correct) which is much like running out of ram and going to swap disk.
Then we have seen memory pressure when doing too much work in a JavaScript UDF, that slowed things down.
But most of these can be spotted by looking at the query profile and looking at the hotspots.

99% of the time was spent scanning the table. The filters within the query do not match your clustering keys, therefore it won't help much. Depending how much historical data you have on this table, and whether you will continue to read a year's worth of data, you might be better off (or creating a materialized view) clustering by qt_product_brand_sid or qt_product_category_l3_sid, depending on which one is going to be filtering the data quicker.

A big change requires changing the data structure of the transaction date to a true date field vs varchar.
second you have an IN clause w/ a single value. Use = instead.
but for the other IN clauses, I would suggest re-writing the query to separate out those sub-queries as CTE and then just join to those CTE.

Use this query :
SELECT
sl.customer_code,
s1.qt_product_category_l3_sid,
s1.qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
s1.dollars_spent,
s1.units,
s1.user_pii_sid,
s1.promo_flag,
s1.media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_cat,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_brand
WHERE
s1.transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND s1.qt_product_category_l3_sid =prod_cat.qt_product_category_l3_sid
AND prod_cat.qt_product_category_l1_sid =246
AND prod_cat.qt_product_brand_sid=prod_brand.qt_product_brand_sid
AND prod_brand.qt_product_major_brand_sid IN (246903, 430138)

Apparently performance is an area of focus for Snowflake R&D. After struggling to make complex queries perform on big data we got 100x improvements with Exasol, no tuning whatsoever.

How to speed up query that use postgis extension?

I have the following query that checks whether is point (T.latitude, T.longitude) is inside a POLYGON
query = """
SELECT id
FROM T
WHERE ST_Intersects(ST_Point(T.latitude, T.longitude), 'POLYGON(({points}))')
"""
But it works slow, how can I speed up it if I have the following index:
(latitude, longitude)?
The query is slow because it must compute the formula for every possible pair of points. So it makes the postgress server do a lot of math, and it forces it to scan through your whole location table. How can we optimize this? Maybe we can eliminate the points that are too far north or too far south or too far east or west?

1) Add a geometry column of type Geometry(Point) and fill it:
ALTER TABLE T add COLUMN geom geometry(Point);
UPDATE T SET geom = ST_Point(T.latitude, T.longitude);
2) Create a spatial index:
CREATE INDEX t_gix ON t USING GIST (geom);
3) Use ST_DWithin instead of ST_Intersect:
WHERE ST_DWithin('POLYGON(({points}))', geom, 0)
You want actually find the points which are within a polygon, so ST_DWithin() is what you need. From the documentation:
This function call will automatically include a bounding box
comparison that will make use of any indexes that are available
PS:
If you for some reason cannot make the points 1 and 2, so at least use ST_Dwithin instead of ST_Intersect:
WHERE ST_DWithin('POLYGON(({points}))', ST_Point(T.latitude, T.longitude), 0)
The last parameter is the tolerance.

You can easly speed up your spatial queries that adding t1.geom&&t2.geom condition to your scripts
This condition;
required spatial indexies so your spatial columns must have spatial indexies
returns approximate result (but with st_ Operators gives exact result)
Here is a example at my database and query timings;
select p.id,k.id, p.poly&&k.poly as intersects
from parcel p , enterance k
where st_contains(p.poly,k.poly) and p.poly&&k.poly
--without && 10.4 sec
--with && 1.6 sec
select count(*) from parcel --34797
select count(*) from enterance --70715
https://postgis.net/docs/overlaps_geometry_box2df.html

What is the meaning of the "Missing Index Impact %" in a SQL Server 2008 execution plan?

I was just examining an estimated execution plan in SSMS. I noticed that a query had query cost of 99% (relative to the batch). I then examined the plan displayed below. That cost was almost entirely coming from a "Clustered Index Delete" on table A. However, the Missing Index recommendation is for Table B. And the Missing Index Impact is said to be 95%.
The query is a DELETE statement (obviously) which relies on a nested loops INNER JOIN with TableB. If nearly all the cost according to the plan is coming from the DELETE operation, why would the index suggestion be on Table B which -- even though it was a scan -- had a cost of only 0%? Is the impact of 95% an impact against the neglible cost of the scan (listed as on 0%) and not the overall cost of the query (said to be nearly ALL of the batch)?
Please explain IMPACT if possible. Here is the plan:

This is query 27 in the batch.
Probably the impact it is showing you actually belongs to an entirely different statement (1-26).
This seems to be a problem with the way that the impacts are displayed for estimated plans in SSMS.
The two batches below contain the same two statements with the order reversed. Notice in the first case it claims both statements would be helped equally with an impact of 99.38 and in the second 49.9818.
So it is showing the estimated impact for the first instance encountered of that missing index - Not the one that actually relates to the statement.
I don't see this issue in the actual execution plans and the correct impact is actually shown in the plan XML next to each statement even in the estimated plan.
I've added a Connect item report about this issue here. (Though possibly you have encountered another issue as 10% impact seems to be the cut off point for the missing index details being included in the plan and it is difficult to see how that would be possible for the same reasons as described in the question)
Example Data
CREATE TABLE T1
(
X INT,
Y CHAR(8000)
)
INSERT INTO T1
(X)
SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY ##spid)
FROM sys.all_objects o1,
sys.all_objects o2
Batch 1
SELECT *
FROM T1
WHERE X = -1
SELECT *
FROM T1
WHERE X = -1
UNION ALL
SELECT *
FROM T1
Batch 2
SELECT *
FROM T1
WHERE X = -1
UNION ALL
SELECT *
FROM T1
SELECT *
FROM T1
WHERE X = -1
The XML for the first plan (heavily truncated) is below, showing that the correct information is in the plan itself.
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML>
<BatchSequence>
<Batch>
<Statements>
<StmtSimple StatementCompId="1">
<QueryPlan>
<MissingIndexes>
<MissingIndexGroup Impact="99.938">
<MissingIndex Database="[tempdb]" Schema="[dbo]" Table="[T1]">
<ColumnGroup Usage="EQUALITY">
<Column Name="[X]" ColumnId="1" />
</ColumnGroup>
</MissingIndex>
</MissingIndexGroup>
</MissingIndexes>
</QueryPlan>
</StmtSimple>
</Statements>
<Statements>
<StmtSimple StatementCompId="2">
<QueryPlan>
<MissingIndexes>
<MissingIndexGroup Impact="49.9818">
<MissingIndex Database="[tempdb]" Schema="[dbo]" Table="[T1]">
<ColumnGroup Usage="EQUALITY">
<Column Name="[X]" ColumnId="1" />
</ColumnGroup>
</MissingIndex>
</MissingIndexGroup>
</MissingIndexes>
</QueryPlan>
</StmtSimple>
</Statements>
</Batch>
</BatchSequence>
</ShowPlanXML>

Assuming that interpretation of missing impact % is identical or similar with that of avg_user_impact column from sys.dm_db_missing_index_group_stats system view then missing impact % represents (more or less):
Average percentage benefit that user queries could experience if this
missing index group was implemented. The value means that the query
cost would on average drop by this percentage if this missing index
group was implemented.

Thanks for the information everyone. Martin Smith I believe did find a bug as a result of this though I am not sure if it the same bug as what I am seeing. In fact I am not sure if my issue is a bug or by design. Let me elaborate on some new observations:
In looking through this rather large execution plan (62 queries), I noticed the the Missing Index recommendation (and respective Impact %) that i mentioned in the original question is listed on nearly every query in the 62 query batch. Oddly, many of these queries do not even call the table the index is recommended for! After observing this, I opened the XML and searched the element 'MissingIndexes' which showed about 10 different indexes missing all with varying Impact %'s, naturally. Why the execution plan does not show this visually and instead shows just one Missing Indezx, I do not know. I presume it is either 1) a bug or 2) it only shows the missing index with the HIGHEST impact % - which is the one I see riddled throughout my entire plan.
A suggestion if you are experiencing this too: Get comfortable with the XML over the visual execution plan. Search the xml element 'MissingIndexes' and match that up with the statements to get proper results.
I also read from Microsoft http://technet.microsoft.com/en-us/library/ms345524(v=sql.105).aspx
that the missing index stats come from a group of DMVs. If the Impact % is in fact from these DMVs, then I would also presume that Impact % is based on MUCH MUCH more than just the Query/Statement in the execution plan were the index is recommended. So take it with a grain of salt, and use them wisely based your own knowledge of your database.
I am going to leave this opened-ended and not mark anything as an "answer" just yet. Feel free to chime in folks!
Thanks again.

Okay so let me see if I can clarify here.
There will still be costs to those other operations the 0% is because the DELETE on a loop is taking the vast majority of of your processor and IO time. That doesn't however mean those other operations don't have processor/memory/IO costs that can be improved on this query by adding that index. Especially if you are doing a loop essentially your mapping to tableB for one record then deleting out of tableA over and over. Therefore having an index that makes it easier to match those rows will speed up your delete.

Two radically different queries against 4 mil records execute in the same time - one uses brute force

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:

Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.

In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.

Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?

Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight