The relationship between join sequence and index setting - sql-server

I heard that setting index is closely related to the sequence of joining tables. Could you provide some examples or article about this point?
Thanks.

Not so much the join sequence as written by you, but the join sequence that the Query Optimizer elects to use.
So... if you're joining on two fields called CustomerId, then indexing that field in both tables (using an index which also incorporates fields that are needed by the query), then you can get a Merge Join happening.
Bear in mind that if your query filters a table using somefield = somevalue, then your filter should be on somefield first.

Related

Is Cross join with on equivalent to inner join [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
INNER JOIN versus WHERE clause — any difference?
What is the difference between an INNER JOIN query and an implicit join query (i.e. listing multiple tables after the FROM keyword)?
For example, given the following two tables:
CREATE TABLE Statuses(
id INT PRIMARY KEY,
description VARCHAR(50)
);
INSERT INTO Statuses VALUES (1, 'status');
CREATE TABLE Documents(
id INT PRIMARY KEY,
statusId INT REFERENCES Statuses(id)
);
INSERT INTO Documents VALUES (9, 1);
What is the difference between the below two SQL queries?
From the testing I've done, they return the same result. Do they do the same thing? Are there situations where they will return different result sets?
-- Using implicit join (listing multiple tables)
SELECT s.description
FROM Documents d, Statuses s
WHERE d.statusId = s.id
AND d.id = 9;
-- Using INNER JOIN
SELECT s.description
FROM Documents d
INNER JOIN Statuses s ON d.statusId = s.id
WHERE d.id = 9;
There is no reason to ever use an implicit join (the one with the commas). Yes for inner joins it will return the same results. However, it is subject to inadvertent cross joins especially in complex queries and it is harder for maintenance because the left/right outer join syntax (deprecated in SQL Server, where it doesn't work correctly right now anyway) differs from vendor to vendor. Since you shouldn't mix implicit and explict joins in the same query (you can get wrong results), needing to change something to a left join means rewriting the entire query.
If you do it the first way, people under the age of 30 will probably chuckle at you, but as long as you're doing an inner join, they produce the same result and the optimizer will generate the same execution plan (at least as far as I've ever been able to tell).
This does of course presume that the where clause in the first query is how you would be joining in the second query.
This will probably get closed as a duplicate, btw.
The nice part of the second method is that it helps separates the join condition (on ...) from the filter condition (where ...). This can help make the intent of the query more readable.
The join condition will typically be more descriptive of the structure of the database and the relation between the tables. e.g., the salary table is related to the employee table by the EmployeeID column, and queries involving those two tables will probably always join on that column.
The filter condition is more descriptive of the specific task being performed by the query. If the query is FindRichPeople, the where clause might be "where salaries.Salary > 1000000"... thats describing the task at hand, not the database structure.
Note that the SQL compiler doesn't see it that way... if it decides that it will be faster to cross join and then filter the results, it will cross join and filter the results. It doesn't care what is in the ON clause and whats in the WHERE clause. But, that typically wont happen if the on clause matches a foreign key or joins to a primary key or indexed column. As far as operating correctly, they are identical; as far as writing readable, maintainable code, the second way is probably a little better.
there is no difference as far as I know is the second one with the inner join the new way to write such statements and the first one the old method.
The first one does a Cartesian product on all record within those two tables then filters by the where clause.
The second only joins on records that meet the requirements of your ON clause.
EDIT: As others have indicated, the optimization engine will take care of an attempt on a Cartesian product and will result in the same query more or less.
A bit same. Can help you out.
Left join vs multiple tables in SQL (a)
Left join vs multiple tables in SQL (b)
In the example you've given, the queries are equivalent; if you're using SQL Server, run the query and display the actual exection plan to see what the server's doing internally.

DW acceptable left join?

In a report I have the next join from a FACT table:
Join…
LEFT JOIN DimState AS s
ON s.StateCode = l.Province AND l.Locale LIKE (s.CountryCode + '%')
More information:
Fact table has 59,567,773 rows
L.Province can match a StateCode in DimState: 42,346,471 rows 71%
L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).
L.Province is NULL in 3,500,000 rows (6%)
4 questions:
-The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?
-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?
-Is it correct to join to a dimension on 2 columns?
-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?
In order of your four questions:
Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.
Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.
Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.
Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.
To build on what mallan1121 said:
1:There are generally three different meanings for null/blank in data warehousing.
A. I don't know the value
B. The value is known and it is blank
C. The value does not apply.
Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.
2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA
JOIN DimB --filters result
JOIN DimC --filters result
If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored
This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.
4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.
Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.
The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).
The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.
One step to achieve this is the definition of NA and Other as decribed in 1.
The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.
Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).
What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.
If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.

Just curious about SQL Joins

I'm just curious here. If I have two tables, let's say Clients and Orders.
Clients have a unique and primary key ID_Client. Orders have an ID_Client field also and a relation to maintain integrity to Client's table by ID_Client field.
So when I want to join both tables i do:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients ON Clients.ID_Client = Orders.ID_Client
So if I took the job to create the primary key, and the relation between the tables,
Is there a reason why I need to explicitly include the joined columns in on clause?
Why can't I do something like:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients
So SQL should know which columns relate both tables...
I had this same question once and I found a great explanation for it on Database Administrator Stack Exchange, the answer below was the one that I found to be the best, but you can refer to the link for additional explanations as well.
A foreign key is meant to constrain the data. ie enforce
referential integrity. That's it. Nothing else.
You can have multiple foreign keys to the same table. Consider the following where a shipment has a starting point, and an ending point.
table: USA_States
StateID
StateName
table: Shipment
ShipmentID
PickupStateID Foreign key
DeliveryStateID Foreign key
You may want to join based on the pickup state. Maybe you want to join on the delivery state. Maybe you want to perform 2 joins for
both! The sql engine has no way of knowing what you want.
You'll often cross join scalar values. Although scalars are usually the result of intermediate calculations, sometimes you'll have a
special purpose table with exactly 1 record. If the engine tried to
detect a foriegn key for the join.... it wouldn't make sense because
cross joins never match up a column.
In some special cases you'll join on columns where neither is unique. Therefore the presence of a PK/FK on those columns is
impossible.
You may think points 2 and 3 above are not relevant since your questions is about when there IS a single PK/FK relationship
between tables. However the presence of single PK/FK between the
tables does not mean you can't have other fields to join on in
addition to the PK/FK. The sql engine would not know which fields you
want to join on.
Lets say you have a table "USA_States", and 5 other tables with a FK to the states. The "five" tables also have a few foreign keys to
each other. Should the sql engine automatically join the "five" tables
with "USA_States"? Or should it join the "five" to each other? Both?
You could set up the relationships so that the sql engine enters an
infinite loop trying to join stuff together. In this situation it's
impossible fore the sql engine to guess what you want.
In summary: PK/FK has nothing to do with table joins. They are separate unrelated things. It's just an accident of nature that you
often join on the PK/FK columns.
Would you want the sql engine to guess if it's a full, left, right, or
inner join? I don't think so. Although that would arguably be a lesser
sin than guessing the columns to join on.
If you don't explicitly give the field names in the query, SQL doesn't know which fields to use. You won't always have fields that are named the same and you won't always be joining on the primary key. For example, a relationship could be between two foreign key fields named "Client_Address" and "Delivery_Address". In that case, you can easily see how you would need to give the field name.
As an example:
SELECT o.*, c.Name
FROM Clients c
INNER JOIN Orders o
ON o.Delivery_Address = c.Client_Address
Is there a reason why do i need to explicit include then joinned fields in on clause?
Yes, because you still need to tell the database server what you want. "Do what I mean" is not within the capabilities of any software system so far.
Foreign keys are tools for enforcing data integrity. They do not dictate how you can join tables. You can join on any condition that is expressible through an SQL expression.
In other words, a join clause relates two tables to each other by a freely definable condition that needs to evaluate to true given the two rows from left hand side and the right hand side of the join. It does not have to be the foreign key, it can be any condition.
Want to find people that have last names equal to products you sell?
SELECT
Products.Name,
Clients.LastName
FROM
Products
INNER JOIN Clients ON Products.Name = Clients.LastName
There isn't even a foreign key between Products and Clients, still the whole thing works.
It's like that. :)
The sql standard says that you have to say on which columns to join. The constraints are just for referential integrity. With mysql the join support "join table using (column1, column2)" but then those columns have to be present in both tables
Reasons why this behaviour is not default
Because one Table can have multiple columns referencing back to one column in another table.
In a lot of legacy databases there are no Foreign key constraints but yet the columns are “Supposed to be” referencing some column in some other table.
The join conditions are not always as simple as A.Column = B.Column . and the list goes on…….
Microsoft developers were intelligent enough to let us make this decision rather than them guessing that it will always be A.Column = B.Column

Composite indexes for columns in joins and where

If I have a query:
Select *
From tableA
Inner Join tableB ON tableA.bId = tableB.id
Inner Join tableC ON tableA.cId = tableC.id
where
tableA.someColumn = ?
Do I get any performance benefit from creating a composite index(bId,cId,someColumn)?
I'm using DB2 Database for this activity.
Indexing joins depends on the join algorithm used by the database. You'll see that in the execution plan.
You will probably need an index on tableA that starts with someColumn for the where clause. Everything else depends on the join algorithm and join order.
You will probably get a more specific answer if you post the execution plan. You can also read the chapter "The Join Operation" on my site about sql indexing and try yourself.
If there are no indexes now, I'd guess that the composite index might be used in one or both inner joins. I doubt that it would be used in the WHERE clause.
But I've been doing this stuff for a long time. Guessing, like hoping, doesn't scale well.
Instead of guessing, you're better off learning how to use DB2's explain and design advisor utilities. Expect to test things like indexing first on a development computer. Building a three-column index on a 500 million row table that's in production will not make you popular.

How to Speed Up Simple Join

I am no good at SQL.
I am looking for a way to speed up a simple join like this:
SELECT
E.expressionID,
A.attributeName,
A.attributeValue
FROM
attributes A
JOIN
expressions E
ON
E.attributeId = A.attributeId
I am doing this dozens of thousands times and it's taking more and more as the table gets bigger.
I am thinking indexes - If I was to speed up selects on the single tables I'd probably put nonclustered indexes on expressionID for the expressions table and another on (attributeName, attributeValue) for the attributes table - but I don't know how this could apply to the join.
EDIT: I already have a clustered index on expressionId (PK), attributeId (PK, FK) on the expressions table and another clustered index on attributeId (PK) on the attributes table
I've seen this question but I am asking for something more general and probably far simpler.
Any help appreciated!
You definitely want to have indexes on attributeID on both the attributes and expressions table. If you don't currently have those indexes in place, I think you'll see a big speedup.
In fact, because there are so few columns being returned, I would consider a covered index for this query
i.e. an index that includes all the fields in the query.
Some things you need to care about are indexes, the query plan and statistics.
Put indexes on attributeId. Or, make sure indexes exist where attributeId is the first column in the key (SQL Server can still use indexes if it's not the 1st column, but it's not as fast).
Highlight the query in Query Analyzer and hit ^L to see the plan. You can see how tables are joined together. Almost always, using indexes is better than not (there are fringe cases where if a table is small enough, indexes can slow you down -- but for now, just be aware that 99% of the time indexes are good).
Pay attention to the order in which tables are joined. SQL Server maintains statistics on table sizes and will determine which one is better to join first. Do some investigation on internal SQL Server procedures to update statistics -- it's been too long so I don't have that info handy.
That should get you started. Really, an entire chapter can be written on how a database can optimize even such a simple query.
I bet your problem is the huge number of rows that are being inserted into that temp table. Is there any way you can add a WHERE clause before you SELECT every row in the database?
Another thing to do is add some indexes like this:
attributes.{attributeId, attributeName, attributeValue}
expressions.{attributeId, expressionID}
This is hacky! But useful if it's a last resort.
What this does is create a query plan that can be "entirely answered" by indexes. Usually, an index actually causes a double-I/O in your above query: one to hit the index (i.e. probe into the table), another to fetch the actual row referred to by the index (to pull attributeName, etc).
This is especially helpful if "attributes" or "expresssions" is a wide table. That is, a table that's expensive to fetch the rows from.
Finally, the best way to speed your query is to add a WHERE clause!
If I'm understanding your schema correctly, you're stating that your tables kinda look like this:
Expressions: PK - ExpressionID, AttributeID
Attributes: PK - AttributeID
Assuming that each PK is a clustered index, that still means that an Index Scan is required on the Expressions table. You might want to consider creating an Index on the Expressions table such as: AttributeID, ExpressionID. This would help to stop the Index Scanning that currently occurs.
Tips,
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition

Resources