Optimizing queries vs. adding indexes

Optimizing queries vs. adding indexes - sql-server

I have this very old and SLOW query that I am trying to optimize, but I am not sure I can do anything to it, but add more indexes on columns involved in WHERE, JOIN and ORDER BY.
Query:
SELECT TOP 400 jobticket.jobnumber, jobticket.typeform, jobticket.filename, jobticket.req_number, jobticket.reqd_del_date, jobticket.point_of_contact, jobticket.status, jobticket.DapsDate, jobticket.elpod, job_info.IDOrderMaskedStatus, job_info.job_status, job_info.SalesID, job_info.location, job_info.TOMetadataID
FROM jobticket WITH (NOLOCK)
INNER JOIN job_info WITH (NOLOCK) ON job_info.jobnumber = jobticket.jobnumber
WHERE
(
NOT(
(jobticket.status = 'Complete' OR jobticket.status = 'Completed')
and (job_info.job_status = 'Actualized' OR job_info.job_status = ''
OR job_info.job_status = 'Actualized Credit Billed'
OR job_info.job_status = 'DWAS Actualized' OR job_info.job_status = 'DWAS Actualized Credit Billed'
)
)
or
((SELECT COUNT(job_status) AS Expr1 FROM tblConsolidatedBilling AS tblConsolidatedBilling_1 WITH (NOLOCK)
WHERE (job_status <> 'Actualized'
AND job_status <> 'Actualized Credit Billed')
AND (master_jobnumber = jobticket.jobnumber)) > 0)
)
and (jobticket.status != 'Waiting Approval' or (jobticket.status = 'Waiting Approval' and jobticket.DPGType is null))
and jobticket.typeform <> 'todpg'
and ((job_info.isHidden <> 1 or job_info.isHidden is null) and job_info.isInConcurrentRelease is null)
and job_info.deleted != '1'
and jobticket.status != 'New Job'
and jobticket.status != 'PRFYCLSFD'
ORDER BY
job_info.expediencyLevel DESC,
jobticket.jobnumber DESC
Execution Plan:
In all honesty I don't know what to do with this query.
Should I add individual nonclustered indexes on all columns involved in WHERE JOIN and ORDER BY?
There are many indexes on these tables, but I am not sure whether they are helpful in this query:

Looking at this SQL, I don't really see any clear criteria that is being used to fetch the rows. It looks like it's just excluding a lot of rows with different criteria. My guess is that the tickets usually end up in a state where most of the rows are, and those are not included in the results?
The problem with this is, that it doesn't really have any clear criteria for that, and it has a lot of different rules there, so that's why it ends up doing a clustered index scan + key lookups for all the rows. The scan starts from jobinfo, but I'm not sure if it would make any difference if it would start from jobticket.
Removing most of the indexes is probably a good place to start, but it won't speed up the select at all.
The query looks quite complex, so my guess is that you can't create an index view that would contain this data. That might help assuming this query is executed often and data is not changed that much (and the overhead of maintaining huge number of indexes would have been removed), but this might not be possible.
Another idea would be to investigate the rules when the rows can be excluded, and is there a possibility have more clear rules for that, so it could be indexed, maybe by adding a persisted computed column into the table.
You haven't mentioned how long this actually takes, and how many rows there are in the tables, so everything is basically just a guess. Including more data + statistics io output into the question might help.
ps. I don't personally recommend using NOLOCK except in really special cases, because it can cause problems that are really hard to solve, like reading the same data more than once or skipping rows totally.

A simple fix would be to make the indexes on job_info and tblConsolidatedBilling covering because a ton of time is spent in key lookups there. That should give an integer factor speedup. If that's not enough we need to investigate further.

Related

Making a postgres query less expensive for the DB

In a SQL query I have to join many tables and its very expensive for the DB.
In the DB a hostgroup has many host, there are like 20 hostgroups, and there is 4 hostgroups that I don't use...
I was wandering if I add a "not in" operator in my query, excluding those 4 hostgroup, the query will be less expensive? or just make thing worst using more resources on the db?
this is my query, just in case...
select history.clock, hstgrp.name as hostgroup, hstgrp.groupid as hgid , hosts.name as hostname ,
items.name as item, hosts.hostid, history.value as porcentaje, items.key_ as key ,items.itemid,
applications.name as appname, applications.applicationid as appid
FROM history
join items_applications on history.itemid = items_applications.itemid
join applications on items_applications.applicationid = applications.applicationid
join items on items.itemid = history.itemid
join hosts on items.hostid = hosts.hostid
join hosts_groups on hosts.hostid = hosts_groups.hostid
join hstgrp on hosts_groups.groupid = hstgrp.groupid
where lower(items.name) SIMILAR TO lower('Used disk space%|Used disk space on%')
and hstgrp.name not in ('Discovered', 'Discover VMs') <==============

The additional filter sure cannot harm, but unless it is very selective, it will probably not reduce the execution time significantly.
I am reduced to guessing, since you didn't add EXPLAIN (ANALYZE, BUFFERS) output to the question, but I'd assume that the query returns a lot of rows and is bound to be slow.
You could change the SIMILAR TO condition to
WHERE lower(items.name) LIKE lower('Used disk space%')
and support it with an index:
CREATE INDEX ON items (lower(name) text_pattern_ops);
Perhaps that will speed up the execution somewhat.

Snowflake Performance issues when querying large tables

I am trying to query a table which has 1Tb of data clustered by Date and Company. A simple query is taking long time
Posting the query and query profile
SELECT
sl.customer_code,
qt_product_category_l3_sid,
qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
dollars_spent,
units,
user_pii_sid,
promo_flag,
media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl
WHERE
transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND qt_product_category_l3_sid IN (SELECT DISTINCT qt_product_category_l3_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_category_l1_sid IN (246))
AND qt_product_brand_sid IN (SELECT qt_product_brand_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_major_brand_sid IN (246903, 430138))
enter image description here

"simple query" I am not sure there is such a thing. A naive query, sure.
select * from really_large_table where column1 = value;
will perform really badly if you only care for 1 or 2 of the columns. As snowflake has to load all the data. You will get a column data to row data ratio improvement by using
select column1, column2 from really_large_table where column1 = value;
now only two columns of data need to be read form the data store.
Maybe you are looking for data where the value is > 100 because you think that should not happen. Then
select column1, column2 from really_large_table where column1 > 100 limit 1;
will perform much better than
select column1, column2 from really_large_table order by column1 desc limit 50;
but if what you are doing is doing the minimum work is can to have a correct answer, you next option is to increase the warehouse size. Which for IO bound work gives a scalar improvement, but some aggregation steps don't scale as linear.
Another thing to look for with regards is sometime your calculation can produce too much intermediate state, and it "external spills" (exact wording not correct) which is much like running out of ram and going to swap disk.
Then we have seen memory pressure when doing too much work in a JavaScript UDF, that slowed things down.
But most of these can be spotted by looking at the query profile and looking at the hotspots.

99% of the time was spent scanning the table. The filters within the query do not match your clustering keys, therefore it won't help much. Depending how much historical data you have on this table, and whether you will continue to read a year's worth of data, you might be better off (or creating a materialized view) clustering by qt_product_brand_sid or qt_product_category_l3_sid, depending on which one is going to be filtering the data quicker.

A big change requires changing the data structure of the transaction date to a true date field vs varchar.
second you have an IN clause w/ a single value. Use = instead.
but for the other IN clauses, I would suggest re-writing the query to separate out those sub-queries as CTE and then just join to those CTE.

Use this query :
SELECT
sl.customer_code,
s1.qt_product_category_l3_sid,
s1.qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
s1.dollars_spent,
s1.units,
s1.user_pii_sid,
s1.promo_flag,
s1.media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_cat,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_brand
WHERE
s1.transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND s1.qt_product_category_l3_sid =prod_cat.qt_product_category_l3_sid
AND prod_cat.qt_product_category_l1_sid =246
AND prod_cat.qt_product_brand_sid=prod_brand.qt_product_brand_sid
AND prod_brand.qt_product_major_brand_sid IN (246903, 430138)

Apparently performance is an area of focus for Snowflake R&D. After struggling to make complex queries perform on big data we got 100x improvements with Exasol, no tuning whatsoever.

Query with multiple joins and optional parameters

I have something of a Search App. There are 7 fields (first name, last name, phone, street, city, shop number, credit card number) where user can write parameters and it's gonna find him clients in the database. Everything is working with AND condition, so when first name is 'Andy' and last name is 'Larkin' is only gonna find Andy Larkins etc. User can leave a field empty, that means when first name is 'Andy' then it should find all the Andys etc. Database looks like this:
The 'Relation' table is to connect person and a shop. Person must have 1 address, 1 shop, can have multiple addresses, multiple shops and no credit card/multiple credit cards. Now, I have to handle all the filtering in a single query, I can't check some conditions before and then construct the query another way, I just don't have that option.
When I search by first name or last name it's fast (both in Person table), but when I search by phone number, or credit card number - it's taking a lot of time. There is a lot of data in the database, but still, my query is bad, I'm not really good at writing queries, especially in Oracle. Here's the query:
SELECT
PERSON.personId,
PERSON.firstName,
PERSON.lastName
ADDRESS.street,
ADDRESS.city,
ADDRESS.phoneNumber
FROM
PERSON
LEFT JOIN ADDRESS ON PERSON.personId = ADDRESS.personId,
LEFT JOIN RELATION ON PERSON.personId = RELATION.personId,
LEFT JOIN SHOPS ON RELATION.shopId = SHOPS.shopId
LEFT JOIN CREDITCARDS ON PERSON.personId = CREDITCARDS.personId
WHERE
PERSON.firstName = NVL(?, PERSON.firstName),
PERSON.lastName = NVL(?, PERSON.lastName),
ADDRESS.phoneNumber = NVL(?, ADDRESS.phoneNumber),
ADDRESS.street = NVL(?, ADDRESS.street),
ADDRESS.city = NVL(?, ADDRESS.city),
SHOPS.shopNumber = NVL(?, SHOPS.shopNumber),
CREDITCARDS.creditCardNumber = NVL(?, CREDITCARDS.creditCardNumber);
The parameters that user left empty are passed as NULLS, that's why I use NVL. When I delete all conditions and leave let's say a credit card number, then it's fast, so I guess that means that all the unnecessary condition checking is slowing the query, and I don't really need that condition checking in most cases, it's just there in case a user passes something.
If I would have the option to check for conditions and only then construct a query then I would just add the conditions that are needed, but I don't have that option. I was thinking about adding some 'IFs' in the query, but I'm not sure that's even possible, all I could find was 'IF/CASE WHEN' but couldn't find any examples that apply to my case. I also tried this:
...WHERE (? IS NULL OR (PERSON.firstName = NVL(?, PERSON.firstName))) AND...
That didn't help, and I got tons of duplicated (different only in address or something - person can have multiple addresses) results (even with 'DISTINCT').
It's not homework, that database is huge with lot of other fields, but I simplified it here, there is also a lot of data there. Thanks for help.

A few things to think about here.
Be careful about queries that might not make sense; such as those that query a credit card number and an address. Queries of that nature fall into a fan trap.
Creating referential integrity constraints in the database will allow the optimizer to do join elimination.
It would be much better for the optimizer, if you could build the query "where clause" dynamically, rather than using NVL functions.

A nested select on the shops might improve performance especially considering it's outer joined. The query below should be enough to get you the idea.
Regarding de-duplication- it's hard because you are selecting Id's and 'distinct' won't help. You'd probably have to use the group by syntax and that might slow the query even more.
If sorting can be done on the client it might help with performance. If the amount of data being returned is significant due to fan-out of relational data and group by isn't a good option then creating a stored procedure might be the best option so most the work is done on database and minimal data over the wire.
SELECT
p.personId,
p.firstName,
a.city,
a.phoneNumber,
shop.shopNumber
FROM
PERSON p,
ADDRESS a,
CREDITCARDS c,
(select ss.personId, ss.shopId, ss.shopNumber from shop s, relation r
where s.shopId = r.shopId) as shop
WHERE
p.personId = a.personId AND
p.personId = c.personId AND
p.personId = shop.personId (+)

CTE performance on execution plan. Is it displayed two times, or two times processed?

This is the SQL with CommonTableExpression. Note, that USERS_PROJECTS_CTE used twice.
WITH USERS_PROJECTS_CTE (PRO_ID, SHOW_IAS, USERNAME)
AS
(
SELECT up.PRO_ID, up.SHOW_IAS, ISNULL(u.FIRST_NAME, '') + ' ' + ISNULL(u.SECOND_NAME, '')
FROM SFMIS07_PRO.USERS_PROJECTS up
INNER JOIN SFMIS07_ADM.USERS AS u
ON up.USER_ID = u.ID
WHERE up.IS_RESP_PERSON = 1 AND up.valid_to is null
)
SELECT up.PRO_ID,
up1.USERNAME as RESP_USER1,
up2.USERNAME as RESP_USER2,
up.COUNT_
FROM SFMIS07_PRO.PRO_RESP_USERS_KERNEL_MV AS up
LEFT JOIN USERS_PROJECTS_CTE AS up1 ON up.PRO_ID = up1.PRO_ID AND up1.SHOW_IAS=1
LEFT JOIN USERS_PROJECTS_CTE AS up2 ON up.PRO_ID = up2.PRO_ID AND up2.SHOW_IAS=0
The Execution Plan. Note that CTE displayed twice:
Questions:
am I right that CTE is not only displayed twice but processed twice?
is it possible to inform QO to reuse CTE ?
is it possible for QO in principle to detect "the same SQL fragment" and reuse results (I imagine the realization of this - by coping already prepared data)?
how to optimize the query (without using temporal tables :) ?

Am I right that CTE is not only displayed twice but processed twice?
Yes
Is it possible to inform QO to reuse CTE ?
Not directly but there are some hacks to encourage this.
is it possible for QO in principle to detect "the same SQL fragment"
and reuse results (I imagine the realization of this - by coping
already prepared data)?
In principle yes. See Microsoft Research Paper Efﬁcient Exploitation of Similar Subexpressions for Query
Processing for examples.
how to optimize the query (without using temporal tables :) ?
The most reliable way would be to use a temporary (not temporal) table. See Provide a hint to force intermediate materialization of CTEs or derived tables for a more hacky workaround.

Two radically different queries against 4 mil records execute in the same time - one uses brute force

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:

Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.

In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.

Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?

Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).