Need help optimizing database query. Very unexperienced with indices - database

I need to optimize this query. The professor recommends using indices, but i'm very confused about how. If I could get just one example of what a good index is and why, and the actual code needed, I could definitely do the rest by myself. Any help would be awesome. (PSQL btw)
SELECT
x.enteredBy
, x.id
, count(DISTINCT xr.id)
, count(DISTINCT c.id)
, 'l'
FROM
((locationsV x left outer join locationReviews xr on x.id = xr.lid)
left outer join reviews r on r.id = xr.id)
left outer join comments c on xr.id = c.reviewId
WHERE
x.vNo = 0
AND (r.enteredBy IS NULL OR
(r.enteredBy <> x.enteredBy
AND c.enteredBy <> x.enteredBy
AND r.enteredBY NOT IN
(SELECT requested FROM friends WHERE requester = x.enteredBY)
AND r.enteredBY NOT IN
(SELECT requester FROM friends WHERE requested = x.enteredBY)))
AND (c.enteredBy IS NULL OR
(c.enteredBY NOT IN
(SELECT requested FROM friends WHERE requester = x.enteredBY)
AND c.enteredBY NOT IN
(SELECT requester FROM friends WHERE requested = x.enteredBY)))
GROUP BY
x.enteredBy
, x.id
I tried adding something like this to the beginning, but the overall time it took didn't change.
CREATE INDEX friends1_idx ON friends(requested);
CREATE INDEX friends2_idx ON friends(requester);

I think the SQL itself could be optimized to improve performance in addition to looking at indexes. Having those IN clauses in the WHERE clause may cause the optimizer do full table scans. So if you could move those to be tables in the FROM section you would have better performance. Also, having the COUNT(DISTINCT ...) clauses in in the SELECT statement seems problematic. You would likely be better off if you could make changes so the DISTINCT clauses were necessary there and simply use the COUNT aggregate function.
Consider using a SQL statement in the FROM clause before you do the left join--a structure something like this:
SELECT ...
FROM Table1 LEFT JOIN
(SELECT ... FROM Table2 INNER JOIN Table3 ON ...) AS Table4 ON
Table1.somecolumn = Table4.somecolumn
...
I know this isn't giving you the solution, but hopefully it will help you to think about other aspects of the problem and to explore other ways to address performance.

Related

SQL Server conditional join techniques

I have come across this situation multiple times wherein I need to grab data from one or another table based on some parameter to the stored procedure. Let me clarify with an example. Suppose we need to grab some data from either an archived table or an online table and a bunch of other tables. I can think of 3 ways to accomplish this:
Use an if condition and store result in a temp table and then join temp table to other tables
Use an if condition and grab data either from archive table or online table and join other tables. The entire query will be duplicated except for the part of archive table or online table.
Use a union subquery
Query for Approach 1
create table #archivedOrOnline (Id int);
declare #archivedData as bit = 1;
if (#archivedData = 1)
begin
insert into #archivedOrOnline
select
at.Id
from
dbo.ArchivedTable at
end
else
begin
insert into #archivedOrOnline
select
ot.Id
from
dbo.OnlineTable ot
end
select
*
from
#archivedOrOnline ao
inner join dbo.AnotherTable at on ao.Id = at.Id;
-- Lots more joins and subqueries irrespective of #archivedData
Query for Approach 2
declare #archivedData as bit = 1;
if (#archivedData = 1)
begin
select
*
from
dbo.ArchivedTable at
inner join dbo.AnotherTable another on at.Id = another.Id
-- Lots more joins and subqueries irrespective of #archivedData
end
else
begin
select
*
from
dbo.OnlineTable ot
inner join dbo.AnotherTable at on ot.Id = at.Id
-- Lots more joins and subqueries irrespective of #archivedData
end
Query for Approach 3
declare #archivedData as bit = 1;
select
*
from
(
select
m.Id
from
dbo.OnlineTable ot
where
#archivedData = 0
union
select
m.Id
from
dbo.ArchivedTable at
where
#archivedData = 1
) archiveOrOnline
inner join dbo.AnotherTable at on at.Id = archiveOrOnline.Id;
-- Lots more joins and subqueries irrespective of #archivedData
Basically I am asking which approach to choose or if there is a better approach. Approach 2 will have a lot of duplicate code. The other 2 approaches remove code duplication. I even have the query plans but my knowledge of making decisions based on the query plan is limited. I always go with the approach which removes code duplication. If there is a performance issue, I may choose another approach.
Your approach 3 can work fine. You should definitely use UNION ALL not UNION though so SQL Server does not add operations to remove duplicates from the tables.
For best chances of success with approach 3 you would need to add an OPTION (RECOMPILE) hint so that SQL Server simplifies out the unneeded table reference at compile time at the expense of recompiling it on each execution.
If the query is executed too frequently to make that approach attractive then you may get an OK plan without it and filters with startup predicates to only access the relevant table at run time - but you may have problems with cardinality estimates with this more generic approach and it might limit the optimisations available and give you a worse plan than option 2.
If you don't mind extra unused columns in your results, you can represent such "IF"s with additional join conditions.
SELECT stuff
FROM MainTable AS m
LEFT JOIN ArchiveTable AS a ON #archivedData = 1 AND m.id = a.id
LEFT JOIN OnlineTable AS o ON #archivedData <> 1 AND m.id = o.id
;
If the Archive and Online tables have the same fields, you can even avoid extra result fields with select expressions like COALESCE(a.field1, b.field1) AS field1
If there are following joins that are dependent on values from ArchiveTable OnlineTable, this can be simplified by performing these core joins in a subquery (at least some coalesces will be necessary though)
SELECT stuff
FROM (
SELECT m.stuff, a.stuff, o.stuff
, COALESCE(a.field1, b.field1) AS xValue
, COALESCE(a.field2, b.field2) AS yValue
, COALESCE(a.field3, b.field3) AS zValue
FROM MainTable AS m
LEFT JOIN ArchiveTable AS a ON #archivedData = 1 AND m.id = a.id
LEFT JOIN OnlineTable AS o ON #archivedData <> 1 AND m.id = o.id
) AS coreQuery
INNER JOIN xTable AS x ON x.something = coreQuery.xValue
INNER JOIN yTable AS y ON y.something = coreQuery.yValue
INNER JOIN zTable AS z ON z.something = coreQuery.zValue
;
If there is criteria narrowing down the MainTable rows to be used, the WHERE for them should be included in the subquery to minimize the amount of Archive/Online carried out of the subquery.
If the Archive/Online table is actually the "main" table, the question's option 3 should work, but I would suggest putting any filtering criteria relevant to those tables in the their UNIONed subqueries.
If there is no filtering criteria on whatever table is "main", I would consider just maintaining two queries (or building one dynamically) so that the subqueries these approaches necessitate are not needed and will not interfere with index use.

Join with Or Condition

Is there a more efficient way to write this? I'm not sure this is the best way to implement this.
select *
from stat.UniqueTeams uTeam
Left Join stat.Matches match
on match.AwayTeam = uTeam.Id or match.HomeTeam = uTeam.id
OR in JOINS is a bad practice, because MSSQL can not use indexes in right way.
Better way - use two selects with UNION:
SELECT *
FROM stat.UniqueTeams uTeam
LEFT JOIN stat.Matches match
ON match.AwayTeam = uTeam.Id
UNION
SELECT *
FROM stat.UniqueTeams uTeam
LEFT JOIN stat.Matches match
ON match.HomeTeam = uTeam.id
Things to be noted while using LEFT JOIN in query:
1) First of all, left join can introduce NULL(s) that can be a performance issue because NULL(s) are treated separately by server engine.
2) The table being join as null-able should not be bulky otherwise it will be costly to execute (performance + resource).
3) Try to include column(s) that has been already indexed. Otherwise, if you need to include such column(s) than better first you build some index(es) for them.
In your case you have two columns from the same table to be left joined to another table. So, in this case a good approach would be if you can have a single table with same column of required data as I have shown below:
; WITH Match AS
(
-- Select all required columns and alise the key column(s) as shown below
SELECT match1.*, match1.AwayTeam AS TeamId FROM stat.Matches match1
UNION
SELECT match2.*, match2.HomeTeam AS TeamId FROM stat.Matches match2
)
SELECT
*
FROM
stat.UniqueTeams uTeam
OUTER APPLY Match WHERE Match.TeamId = uTeam.Id
I have used OUTER APPLY which is almost similar to LEFT OUTER JOIN but it is different during query execution. It works as Table-Valued Function that can preform better in your case.
my answer is not to the point, but i found this question seeking for "or" condition for inner join, so it maybe be useful for the next seeker
we can use legacy syntax for case of inner join:
select *
from stat.UniqueTeams uTeam, stat.Matches match
where match.AwayTeam = uTeam.Id or match.HomeTeam = uTeam.id
note - this query has bad perfomance (cross join at first, then filter). but it can work with lot of conditions, and suitable for dirty data research(for example t1.id=t2.id or t1.name=t2.name)

Having trouble converting SQL query to Access

I have this query which works fine in SQL Server, but not in Access, and I'm having trouble converting it. I've always heard that JET is missing some TSQL features, and I suppose complex joins is one of them.
SELECT C.[Position], TT.[Description] as TrainingType, T.ProgramTitle, T.ProgramSubTitle, T.ProgramCode, ET.CompletedDate
from HR_Curriculum C
LEFT JOIN HR_Trainings T ON C.TrainingID = T.TrainingID
LEFT JOIN HR_TrainingTypes TT ON T.TrainingTypeID = TT.TrainingTypeID
LEFT JOIN HR_EmployeeTrainings ET ON C.TrainingID = ET.TrainingID
AND (ET.AvantiRecID IS NULL OR ET.AvantiRecID = '637')
where ( c.[Position] = 'Customer Service Representative'
OR C.[Position] = 'All Employees')
order by [Position], Description, ProgramTitle
I tried putting the extra join clause down in the WHERE clause, but for some reason this does not yield the proper count of records.
When you have more than one JOIN in ms-access you need to wrap them with parenthesis like this:
SELECT C.[Position], TT.[Description] as TrainingType, T.ProgramTitle, T.ProgramSubTitle, T.ProgramCode, ET.CompletedDate
from (((HR_Curriculum C
LEFT JOIN HR_Trainings T ON C.TrainingID = T.TrainingID)
LEFT JOIN HR_TrainingTypes TT ON T.TrainingTypeID = TT.TrainingTypeID)
LEFT JOIN HR_EmployeeTrainings ET ON C.TrainingID = ET.TrainingID
AND (ET.AvantiRecID IS NULL OR ET.AvantiRecID = '637'))
where ( c.[Position] = 'Customer Service Representative'
OR C.[Position] = 'All Employees')
order by [Position], Description, ProgramTitle
or you will have the Missing Operator error
Check that your table alias are delacred with 'as'. Access doesn't like [tablename] [alias], instead try [tablename] as [alias]. I know the complex left joins shouldn't be a problem, but Access might be choking an the alias delcarations if it's returning some join error. I would also try quering out the restriction on the ET table, and then joining that to the larger query. I've noticed that trying to put restrictions on records involved in left or right joins will often not produce the correct records as Access will limit the set after the join.

NOT IN Subquery Optimization

I have a dynamic query that runs indentifying CDs that members have not rented yet. I am using the NOT IN subquery but when I have large member table it makes them really slow. Any suggestions how to optimize the query
SELECT DVDTitle AS "DVD Title"
FROM DVD
WHERE DVDId NOT IN
(SELECT DISTINCT DVDId FROM Rental WHERE MemberId = AL240);
thanks
Using NOT EXISTS will have slightly better performance because it can "short circuit" rather than evaluating the entire set for each match. At the very least, it will be "no worse" than NOT IN or an OUTER JOIN, though there are exceptions to every rule. Here is how I would write this query:
SELECT DVDTitle AS [DVD Title]
FROM dbo.DVD AS d
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.Rental
WHERE MemberId = 'AL240'
AND DVDId = d.DVDId
);
I would guess you will optimize performance better by investigating the execution plan and ensuring that your indexes are best suited for this query (without causing negative impact to other parts of your workload).
Also see Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
SELECT DVDTitle AS "DVD Title"
FROM DVD d
left outer join Rental r on d.DVDId = r.DVDId
WHERE r.MemberId = 'AL240'
and r.DVDId is null
Make sure you have indexes on:
d.DVDId
r.DVDId
r.MemberId

SQL Query execution shortcut OR logic?

I have three tables:
SmallTable
(id int, flag1 bit, flag2 bit)
JoinTable
(SmallTableID int, BigTableID int)
BigTable
(id int, text1 nvarchar(100), otherstuff...)
SmallTable has, at most, a few dozen records. BigTable has a few million, and is actually a view that UNIONS a table in this database with a table in another database on the same server.
Here's the join logic:
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE
(s.flag1=1 OR b.text1 NOT LIKE 'pattern1%')
AND (s.flag2=1 OR b.text1 <> 'value1')
Average joined size is a few thousand results. Everything shown is indexed.
For most SmallTable records, flag1 and flag2 are set to 1, so there's really no need to even access the index on BigTable.text1, but SQL Server does anyway, leading to a costly Indexed Scan and Nested Loop.
Is there a better way to hint to SQL Server that, if flag1 and flag2 are both set to 1, it shouldn't even bother looking at text1?
Actually, if I can avoid the join to BigTable completely in these cases (JoinTable is managed, so this wouldn't create an issue), that would make this key query even faster.
SQL Boolean evaluation does NOT guarantee operator short-circuit. See On SQL Server boolean operator short-circuit for a clear example showing how assuming operator short circuit can lead to correctness issues and run-time errors.
On the other hand the very example in my link shows what does work for SQL Server: providing an access path that SQL can use. So, as with all SQL performance problems and questions, the real problem is not in the way the SQL text is expressed, but in the design of your storage. Ie. what indexes has the query optimizer at its disposal to satisfy your query?
I don't believe SQL Server will short-circuit conditions like that unfortunately.
SO I'd suggest doing 2 queries and UNION them together. First query with s.flag1=1 and s.flag2=1 WHERE conditions, and the second query doing the join on to BigTable with the s.flag1<>1 a s.flag2<>1 conditions.
This article on the matter is worth a read, and includes the bottom line:
...SQL Server does not do
short-circuiting like it is done in
other programming languages and
there's nothing you can do to force it
to.
Update:
This article is also an interesting read and contains some good links on this topic, including a technet chat with the development manager for the SQL Server Query Processor team which briefly mentions that the optimizer does allow short-circuit evaluation. The overall impression I get from various articles is "yes, the optimizer can spot the opportunity to short circuit but you shouldn't rely on it and you can't force it". Hence, I think the UNION approach may be your best bet. If it's not coming up with a plan that takes advantage of an opportunity to short cut, that would be down to the cost-based optimizer thinking it's found a reasonable plan that does not do it (this would be down to indexes, statistics etc).
It's not elegant, but it should work...
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE
(s.flag1 = 1 and s.flag2 = 1) OR
(
(s.flag1=1 OR b.text1 NOT LIKE 'pattern1%')
AND (s.flag2=1 OR b.text1 <> 'value1')
)
SQL Server usually grabs the subquery hint (though it's free to discard it):
SELECT *
FROM (
SELECT * FROM SmallTable where flag1 <> 1 or flag2 <> 1
) s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
...
No idea if this will be faster without test data... but it sounds like it might
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE
(s.flag1=1) AND (s.flag2=1)
UNION ALL
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE
(s.flag1=0 AND b.text1 NOT LIKE 'pattern1%')
AND (s.flag2=0 AND b.text1 <> 'value1')
Please let me know what happens
Also, you might be able to speed this up by just returning just a unique id for this query and then using the result of that to get all the rest of the data.
edit
something like this?
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE
(s.flag1=1) AND (s.flag2=1)
UNION ALL
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE EXISTS
(SELECT 1 from BigTable b
WHERE
(s.flag1=0 AND b.text1 NOT LIKE 'pattern1%')
AND (s.flag2=0 AND b.text1 <> 'value1')
)
Hope this works - careful of shortcut logic in case statements around aggregates but...
SELECT * FROM
SmallTable s
INNER JOIN JoinTable j ON j.SmallTableID = s.ID
INNER JOIN BigTable b ON b.ID = j.BigTableID
WHERE 1=case when (s.flag1 = 1 and s.flag2 = 1) then 1
when (
(s.flag1=1 OR b.text1 NOT LIKE 'pattern1%')
AND (s.flag2=1 OR b.text1 <> 'value1')
) then 1
else 0 end

Resources