Does join change the sorting of original table? - sql-server

Does join change the sorting of original table?
For example
SELECT * FROM #empScheduled
AND
SELECT es.job_code,es.shift_id, es.unit_id,jd.job_description,s.shift_description,u.unit_code,
[group] = (SELECT job_description FROM dbo.job_desc WHERE job_code = jd.job_group_with),es.new_group
FROM #empScheduled es JOIN job_desc jd ON es.job_code = jd.job_code
JOIN dbo.shifts s ON es.shift_id = s.shift_id JOIN dbo.units u
ON es.unit_id = u.unit_id
Will produce the same records but in different sorting orders! Why does that happen and what is the way to stop it and produce the result in same sort order as the orginal table WITHOUT having to use Order by ? Thanks.

If you don't specify an ORDER BY, there is no defined order on the result tuples. A simple SELECT * FROM x will return the tuples in the order they are stored, since only a table scan is done. However, if a join or similar is done, there is no way of knowing how the intermediate results are handled and the order of the results is undefined.
It is never a good idea to want the result "in the same order as the original table", since this is not a definition of an order. If you want to preserve the insert order, use an ascending primary key and simply order by that. This way, you can always restore the "original" order of your tuples.

There is no inherent sort order in SQL Server.
In your case, it is likely due to indexes on your joined table and their sorted order.
If you want consistently sorted results, you must use an ORDER BY clause.

You might try the FORCE ORDER hint to explicitly force the join order. Otherwise the optimizer can order the joins any way it thinks is best. Not sure why you cannot use ORDER BY though.

Related

MSSQL select query with prioritized OR

I need to build one MSSQL query that selects one row that is the best match.
Ideally, we have a match on street, zip code and house number.
Only if that does not deliver any results, a match on just street and zip code is sufficient
I have this query so far:
SELECT TOP 1 * FROM realestates
WHERE
(Address_Street = '[Street]'
AND Address_ZipCode = '1200'
AND Address_Number = '160')
OR
(Address_Street = '[Street]'
AND Address_ZipCode = '1200')
MSSQL currently gives me the result where the Address_Number is NOT 160, so it seems like the 2nd clause (where only street and zipcode have to match) is taking precedence over the 1st. If I switch around the two OR clauses, same result :)
How could I prioritize the first OR clause, so that MSSQL stops looking for other results if we found a match where the three fields are present?
The problem here isn't the WHERE (though it is a "problem"), it's the lack of an ORDER BY. You have a TOP (1), but you have nothing that tells the data engine which row is the "top" row, so an arbitrary row is returned. You need to provide logic, in the ORDER BY to tell the data engine which is the "first" row. With the rudimentary logic you have in your question, this would like be:
SELECT TOP (1)
{Explicit Column List}
realestates
WHERE Address_Street = '[Street]'
AND Address_ZipCode = '1200'
ORDER BY CASE Address_Number WHEN '160' THEN 1 ELSE 2 END;
You can't prioritize anything in the WHERE clause. It always results in ALL the matching rows. What you can do is use TOP or FETCH to limit how many results you will see.
However, in order for this to be effective, you MUST have an ORDER BY clause. SQL tables are unordered sets by definition. This means without an ORDER BY clause the database is free to return rows in any order it finds convenient. Mostly this will be the order of the primary key, but there are plenty of things that can change this.

Optimizing queries vs. adding indexes

I have this very old and SLOW query that I am trying to optimize, but I am not sure I can do anything to it, but add more indexes on columns involved in WHERE, JOIN and ORDER BY.
Query:
SELECT TOP 400 jobticket.jobnumber, jobticket.typeform, jobticket.filename, jobticket.req_number, jobticket.reqd_del_date, jobticket.point_of_contact, jobticket.status, jobticket.DapsDate, jobticket.elpod, job_info.IDOrderMaskedStatus, job_info.job_status, job_info.SalesID, job_info.location, job_info.TOMetadataID
FROM jobticket WITH (NOLOCK)
INNER JOIN job_info WITH (NOLOCK) ON job_info.jobnumber = jobticket.jobnumber
WHERE
(
NOT(
(jobticket.status = 'Complete' OR jobticket.status = 'Completed')
and (job_info.job_status = 'Actualized' OR job_info.job_status = ''
OR job_info.job_status = 'Actualized Credit Billed'
OR job_info.job_status = 'DWAS Actualized' OR job_info.job_status = 'DWAS Actualized Credit Billed'
)
)
or
((SELECT COUNT(job_status) AS Expr1 FROM tblConsolidatedBilling AS tblConsolidatedBilling_1 WITH (NOLOCK)
WHERE (job_status <> 'Actualized'
AND job_status <> 'Actualized Credit Billed')
AND (master_jobnumber = jobticket.jobnumber)) > 0)
)
and (jobticket.status != 'Waiting Approval' or (jobticket.status = 'Waiting Approval' and jobticket.DPGType is null))
and jobticket.typeform <> 'todpg'
and ((job_info.isHidden <> 1 or job_info.isHidden is null) and job_info.isInConcurrentRelease is null)
and job_info.deleted != '1'
and jobticket.status != 'New Job'
and jobticket.status != 'PRFYCLSFD'
ORDER BY
job_info.expediencyLevel DESC,
jobticket.jobnumber DESC
Execution Plan:
In all honesty I don't know what to do with this query.
Should I add individual nonclustered indexes on all columns involved in WHERE JOIN and ORDER BY?
There are many indexes on these tables, but I am not sure whether they are helpful in this query:
Looking at this SQL, I don't really see any clear criteria that is being used to fetch the rows. It looks like it's just excluding a lot of rows with different criteria. My guess is that the tickets usually end up in a state where most of the rows are, and those are not included in the results?
The problem with this is, that it doesn't really have any clear criteria for that, and it has a lot of different rules there, so that's why it ends up doing a clustered index scan + key lookups for all the rows. The scan starts from jobinfo, but I'm not sure if it would make any difference if it would start from jobticket.
Removing most of the indexes is probably a good place to start, but it won't speed up the select at all.
The query looks quite complex, so my guess is that you can't create an index view that would contain this data. That might help assuming this query is executed often and data is not changed that much (and the overhead of maintaining huge number of indexes would have been removed), but this might not be possible.
Another idea would be to investigate the rules when the rows can be excluded, and is there a possibility have more clear rules for that, so it could be indexed, maybe by adding a persisted computed column into the table.
You haven't mentioned how long this actually takes, and how many rows there are in the tables, so everything is basically just a guess. Including more data + statistics io output into the question might help.
ps. I don't personally recommend using NOLOCK except in really special cases, because it can cause problems that are really hard to solve, like reading the same data more than once or skipping rows totally.
A simple fix would be to make the indexes on job_info and tblConsolidatedBilling covering because a ton of time is spent in key lookups there. That should give an integer factor speedup. If that's not enough we need to investigate further.

Non standard join

I ran into a problem and was trying to find a general solution to it as a join.
I have 2 tables :
http://pastebin.com/q5yws5Ym (not sure how to enforce the styling)
And i want to generate something like
http://pastebin.com/GscBUrYS
(while there are more parameters i'm interested in how i would do it for something like this)
While i was able to reach a similar effect with self-joins and equi-joins it would generate a lot of unneeded rows, which i'm not sure how to delete automatically.
Try something along the lines of:
SELECT user.user_id, j1.user_param, j1.user_value, j2.user_param, j2.user_value
FROM user
JOIN Users_info j1 ON user.user_id = j1.user_id
JOIN users_info j2 on user.user_id = j2.user_id
where j1.user_param != j2.user_param
GROUP BY user.user_id
It may be possible that you will need some more "exclusion" clauses for the where to make sure that every row is only selected once but the general idea should work (for a given and limited number of different user_param`s).

Two radically different queries against 4 mil records execute in the same time - one uses brute force

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:
Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.
In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.
Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?
Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).

Why Is My Inline Table UDF so much slower when I use variable parameters rather than constant parameters?

I have a table-valued, inline UDF. I want to filter the results of that UDF to get one particular value. When I specify the filter using a constant parameter, everything is great and performance is almost instantaneous. When I specify the filter using a variable parameter, it takes a significantly larger chunk of time, on the order of 500x more logical reads and 20x greater duration.
The execution plan shows that in the variable parameter case the filter is not applied until very late in the process, causing multiple index scans rather than the seeks that are performed in the constant case.
I guess my questions are: Why, since I'm specifying a single filter parameter that is going to be highly selective against an indexed field, does my performance go into the weeds when that parameter is in a variable? Is there anything I can do about this?
Does it have something to do with the analytic function in the query?
Here are my queries:
CREATE FUNCTION fn_test()
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT DISTINCT GCN_SEQNO, Drug_package_version_ID
FROM
(
SELECT COALESCE(ndctbla.GCN_SEQNO, ndctblb.GCN_SEQNO) AS GCN_SEQNO,
dpv.Drug_package_version_ID, ROW_NUMBER() OVER (PARTITION BY dpv.Drug_package_version_id ORDER BY
ndctbla.GCN_SEQNO DESC) AS Predicate
FROM dbo.Drug_Package_Version dpv
LEFT JOIN dbo.NDC ndctbla ON ndctbla.NDC = dpv.Sp_package_code
LEFT JOIN dbo.NDC ndctblb ON ndctblb.SPC_NDC = dpv.Sp_package_code
) iq
WHERE Predicate = 1
GO
GRANT SELECT ON fn_test TO public
GO
-- very fast
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = 10000
GO
-- comparatively slow
DECLARE #dpvid int
SET #dpvid = 10000
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = #dpvid
Once you create a new projection through a UDF, it can't be expected that your indexes will still apply on the columns that are indexed on the original table and included in the projection. When you filter on the projection (and not in the UDF against the original table with the indexes) the indexes no longer apply.
What you want to do is parameterize the function to take in the parameter.
If you find that you have too many fields that you want to set parameters on, then you might want to take a look at indexed views, as you can create your projection and index it as well and then run queries against that.
Simply, the constant is easy to evaluate in the plan. The local variable is not. Especially with the ranking function and filter Predicate = 1
Paraphrasing casparOne, you need to push the filter as far inwards as possible so that you filter on dpv.Drug_package_version_id inside the iq derived table.
If you do that, then you also have no need for the PARTITION BY because you have only a single dpv.Drug_package_version_id. Then you can do a cleaner ...TOP 1 ... ORDER BY ndctbla.GCN_SEQNO DESC.
The responses I got were good, and I learned from them, but I think I've found an answer that satisfies me.
I do think it's the use of the PARTITION BY clause that is causing the problem here. I reformulated the UDF using a variant of the self-join idiom:
SELECT t1.A, t1.B, t1.C
FROM T t1
INNER JOIN
(
SELECT A, MAX(C) AS C
FROM T
GROUP BY A
) t2 ON t1.A = t2.A AND t1.C = t2.C
Ironically, this is more performant than using the SQL 2008-specific query, and also the optimizer doesn't have a problem with joining this version of the query using variables rather than constants. At this point, I'm concluding that the optimizer just doesn't handle the more recent SQL extensions as well as the older stuff. As a bonus, I can make use of the UDF now, in my pre-upgraded SQL 2000 platforms.
Thanks for your help, everyone!

Resources