Database Join Performance Comparison [duplicate]

Database Join Performance Comparison [duplicate] - sql-server

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 7 years ago.
I am on a project where much of the queries are performed by including multiple tables in the FROM clause. I know this is legal, but I have always used explicit JOINs instead.
For example, two tables (using SQL Server DDL)
CREATE TABLE Manufacturers(
ManufacturerID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
Name varchar(100))
CREATE TABLE Cars (
ModelID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ManufacturerID INT CONSTRAINT FOREIGN KEY FK_Manufacturer REFERENCES Manufacturers(ManufacturerID),
ModelName VARCHAR(100))
If I want to find the models for GM, I could do either:
SELECT ModelName FROM Cars c, Manufacturers m WHERE c.ManufacturerID=m.ManufacturerID AND m.Name='General Motors'
or
SELECT ModelName FROM Cars c INNER JOIN Manufacturers m ON c.ManufacturerID=m.ManufacturerID WHERE m.Name='General Motors'
My question is this: does one form perform better than the other? Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server? What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?

My question is this: does one form perform better than the other?
They should not. You can check your execution plans to be certain, but every RDBMS I've worked with generates the same plans for comma (ANSI-89) joins as they do for ANSI-92 explicit joins. (Note that comma joins didn't stop being ANSI SQL, it's just that ANSI-92 is where the explicit syntax first appeared.)
Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server?
As far as the server is concerned, no.
What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?
It's possible. With comma joins, I'm not sure it's possible to control the JOIN order with parentheses like you can with explicit joins:
SELECT *
FROM Table1 t1
INNER JOIN (
Table2 t2
INNER JOIN Table3 t3
t2.id = t3.id)
ON t1.id = t2.id
This can affect overall query performance (for better or worse). I'm not sure how to accomplish the same level of control with comma joins, but I've never fully learned comma join syntax. I don't know if you can say Table1 t1, (Table2 t2, Table3 t3), but I don't believe you can. I think you'd have to use subqueries to do that.
The primary advantages of explicit joins are:
Easier to read. It makes it very clear which conditions are used with which join. You won't ever see Table1 t1, Table2 t2, Table3 t3 and then have to dig into the WHERE clause to figure out if one of those joins is an outer join. It also means the WHERE clause isn't stuffed full of all these join conditions you know you don't care about changing when you modify a query.
Easier to use outer joins. In the case of outer joins, you can even specify literal filter values in the outer table without having to handle nulls from the outer join.
Easier to reuse existing joins. If you just want to query from the same relations, you can just grab the FROM clause. You don't have to worry about what bits from the WHERE clause you want and what bits you don't.
Identical syntax across RDBMSs. When you spend all day switching between Oracle and SQL Server, the last thing you want to worry about is confusing + and *= to get your outer joins right.
All of the above make the explicit join syntax more maintainable, which is a very important factor for software.

Related

Select from one table where not in another in SQL Server

I have a SQL Server database on my computer, and there are two tables in it.
This is the first one:
SELECT
[ParticipantID]
,[ParticipantName]
,[ParticipantNumber]
,[PhoneNumber]
,[Mobile]
,[Email]
,[Address]
,[Notes]
,[IsDeleted]
,[Gender]
,[DOB]
FROM
[Gym].[dbo].[Participant]
and this is the second one
SELECT
[ParticipationID]
,[ParticipationNumber]
,[ParticpationTypeID]
,[AddedByEmployeeID]
,[AddDate]
,[ParticipantID]
,[TrainerID]
,[ParticipationDate]
,[EndDate]
,[Fees]
,[PaidFees]
,[RemainingFees]
,[IsPeriodParticipation]
,[NoOfVisits]
,[Notes]
,[IsDeleted]
FROM
[Gym].[dbo].[Participation]
Now I need to write a T-SQL query that can return
SELECT
Participant.ParticipantNumber,
Participation.ParticipationDate,
Participation.EndDate
FROM
Participation
WHERE
Participant.ParticipantID = Participation.ParticipantID;
and I'm going to be thankful

SQL Server performs sort, intersect, union, and difference operations using in-memory sorting and hash join technology. Using this type of query plan, SQL Server supports vertical table partitioning, sometimes called columnar storage.
SQL Server employs three types of join operations:
Nested Loops joins
Merge joins
Hash joins
Join Fundamentals
By using joins, you can retrieve data from two or more tables based on logical relationships between the tables. Joins indicate how Microsoft SQL Server should use data from one table to select the rows in another table.
A join condition defines the way two tables are related in a query by:
Specifying the column from each table to be used for the join. A typical join condition specifies a foreign key from one table and its associated key in the other table.
Specifying a logical operator (for example, = or <>,) to be used in comparing values from the columns.
Inner joins can be specified in either the FROM or WHERE clauses. Outer joins can be specified in the FROM clause only. The join conditions combine with the WHERE and HAVING search conditions to control the rows that are selected from the base tables referenced in the FROM clause.
Follow this link to help you understand joins better in mssql:
link to joins

Query Optimization on SQL server 2008

I have a small sql query that runs on SQL Server 2008. It uses the following tables and their row counts:
dbo.date_master - 245424
dbo.ers_hh_forecast_consumption - 436061472
dbo.ers_hh_forecast_file - 15105
dbo.ers_ed_supply_point - 8485
I am quite new to the world of SQL Server and am learning. Please guide me on how I'll be able to optimize this query to run much faster.
I'll be quite happy to learn if anyone can mention my mistakes and what I am doing that makes it take sooo long to query the resulting table.
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
),
CTE_MPAN AS
(
SELECT T2.FORECAST_FILE_ID
,T2.MPAN_CORE
FROM CTE_CONS AS T1
LEFT JOIN dbo.ers_hh_forecast_file AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
),
CTE_GSP AS
(
SELECT T2.MPAN_CORE
,T2.GSP_GROUP_ID
FROM CTE_MPAN AS T1
LEFT JOIN dbo.ers_ed_supply_point AS T2 ON T1.MPAN_CORE=T2.MPAN_CORE
)
SELECT T1.CONVERTED_DATE
,T1.TOTAL
,T2.MPAN_CORE
,T1.TOTAL
FROM CTE_CONS AS T1
LEFT JOIN CTE_MPAN AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
LEFT JOIN CTE_GSP AS T3 ON T2.MPAN_CORE=T3.MPAN_CORE

Basically, without looking at the actual table design and indices, it is difficult to tell exactly what all you would need to change. But for starters, you could definitely consider two things:
In your CTE_CONS query, you are doing a left join on a Datetime field. This is definitely not a good idea unless you have some kind of index on that field. I would strongly urge you to create a index if there isn't one already.
CREATE NONCLUSTERED INDEX IX_UTC_DATETIME ON dbo.ers_hh_forecast_consumption
(UTC_DATETIME ASC) INCLUDE (
FORECAST_FILE_ID
,FORECAST_CONSUMPTION
);
The other thing you could consider doing would be partitioning your table dbo.ers_hh_forecast_consumption. That way, your read is much less on the table and becomes lot quicker to retrieve records as well. Here is a quick guide on How To Decide if You Should Use Table Partitioning.
Hope this helps!

Apart from the fact that you'll need to offer quite a bit more info for us to get a good idea on what's going on, I think I spotted a bit of an issue with your query here:
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
)
On first sigth you're trying to SUM() the values of T1.FORECAST_CONSUMPTION per T2.CONVERTED_DATE ,T1.FORECAST_FILE_ID combination. However, in the GROUP BY you also add T1.FORECAST_CONSUMPTION again? This will have the exact same effect as doing a DISTINCT over the three fields. Either removed the field you're SUM()ing on from the GROUP BY or use a DISTINCT and get rid of the SUM() and GROUP BY; depending on what effect you're after.
Anyway, could you add the following things to your question :
EXEC sp_helpindex <table_name> for all tables involved.
if possible, a screenshot of the Execution Plan (either from SSMS, or from SQL Sentry Plan Explorer).

How can I speed up this SQL view?

I'm a beginner at this so hope you can help. I'm working in SQL server 2008R2 and have a view that is comprised from four tables all joined together:
SELECT DISTINCT ad.award_id,
bl.funding_id,
bl.budget_line,
dd4.monthnumberofyear AS month,
dd4.yearcalendar AS year,
CASE
WHEN frb.full_value IS NULL THEN '0'
ELSE frb.full_value
END AS Expenditure_value,
bl.budget_id,
frb.accode,
'Actual' AS Type
FROM dw.dbo.dimdate5 AS dd4
LEFT OUTER JOIN dbo.award_data AS ad
ON dd4.fulldate BETWEEN ad.usethisstartdate AND
ad.usethisenddate
LEFT OUTER JOIN dbo.budget_line AS bl
ON bl.award_id = ad.award_id
LEFT OUTER JOIN dw.dbo.fctresearchbalances AS frb
ON frb.el3 = bl.award_id
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The view has 9 columns and 1.5 million rows and growing. A select * from this view was taking 20 minutes for all the rows. I added indexes on the fields in the tables that are joined on and that improved it to 10 minutes. My question is what else could I do to get the select to run faster?
Many thanks, Violet.

Try getting rid of the case statement.
If you have 1.5 million rows, if you're interesting in the aggregation of those rows rather than the whole set, you might want to sum the rows in fctResearchBalances first and then do the joins.
(It's a bit difficult to determine what else you might benefit from, without seeing the access plan.)

1- You can use stored procedure to have buffer cache.
2- you can use indexed view , this means creating index on schemabound views.
3- you can use query hints in join to order the query optimizer to use special kind of join.
4- you can use table partitioning .

SELECT DISTINCT --#1 - potential bottleneck
ad.award_id
, bl.funding_id
, bl.budget_line
, [month] = dd4.monthnumberofyear
, [year] = dd4.yearcalendar
, Expenditure_value = ISNULL(frb.full_value, '0')
, bl.budget_id
, frb.accode
, [type] = 'Actual'
FROM dbo.dimdate5 dd4
LEFT JOIN dbo.award_data ad ON dd4.fulldate BETWEEN ad.usethisstartdate AND ad.usethisenddate
LEFT JOIN dbo.budget_line bl ON bl.award_id = ad.award_id
LEFT JOIN dbo.fctresearchbalances frb ON frb.el3 = bl.award_id --#2 - join by multiple columns
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period

The CASE statement can be replace by
COALESCE(frb.full_value,'0') AS Expenditure_value
Without more info it's not possible to tell exactly what is wrong but just to give you some pointers.
When you have so many LEFT JOINS the order of the joins can make a difference.
Do you have standard indexes or covering indexes with included columns?
If you don't have covering indexes, then primary keys matter in the joins. Including all the primary key columns in the join condition will speed up the query.
Then look at your data - do you need all those LEFT JOINS base on the foreign keys between those tables? Depending on your keys a LEFT JOIN may be equivalent to an INNER JOIN.
And with all those LEFT JOINS is having a DISTINCT really useful?
How much RAM do you have? If you have 8GB+ then 1.5m rows is nothing for SQL Server. You need to optimise those joins.

Inner join vs select statements on multiple tables

THe below 2 queries performs the same operation, but wondering which would be the fastest and most preferable?
NUM is the primary key on table1 & table2...
select *
from table1 tb1,
table2 tb2
where tb1.num = tb2.num
select *
from table1 tb1
inner join
table2 tb2
on tb1.num = tb2.num

They are the same query. The first is an older alternate syntax, but they both mean do an inner join.
You should avoid using the older syntax. It's not just readability, but as you build more complex queries, there are things that you simply can't do with the old syntax. Additionally, the old syntax is going through a slow process of being phased out, with the equivalent outer join syntax marked as deprecated in most products, and iirc dropped already in at least one.

The 2 SQL statements are equivalent. You can look at the execution plan to confirm. As a rule, given 2 SQL statements which affect/return the same rows in the same way, the server is free to execute them the same way.

They're equivalent queries - both are inner joins, but the first uses an older, implicit join syntax. Your database should execute them in exactly the same way.
If you're unsure, you could always use the SQL Management Studio to view and compare the execution plans of both queries.

The first example is what I have seen referred to as an Oracle Join. As mentioned already there appears to be little performance difference. I prefer the second example from a readability standpoint because it separates join conditions from filter conditions.

SQL Server CTE referred in self joins slow

I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos

I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight