I'm just curious here. If I have two tables, let's say Clients and Orders.
Clients have a unique and primary key ID_Client. Orders have an ID_Client field also and a relation to maintain integrity to Client's table by ID_Client field.
So when I want to join both tables i do:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients ON Clients.ID_Client = Orders.ID_Client
So if I took the job to create the primary key, and the relation between the tables,
Is there a reason why I need to explicitly include the joined columns in on clause?
Why can't I do something like:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients
So SQL should know which columns relate both tables...
I had this same question once and I found a great explanation for it on Database Administrator Stack Exchange, the answer below was the one that I found to be the best, but you can refer to the link for additional explanations as well.
A foreign key is meant to constrain the data. ie enforce
referential integrity. That's it. Nothing else.
You can have multiple foreign keys to the same table. Consider the following where a shipment has a starting point, and an ending point.
table: USA_States
StateID
StateName
table: Shipment
ShipmentID
PickupStateID Foreign key
DeliveryStateID Foreign key
You may want to join based on the pickup state. Maybe you want to join on the delivery state. Maybe you want to perform 2 joins for
both! The sql engine has no way of knowing what you want.
You'll often cross join scalar values. Although scalars are usually the result of intermediate calculations, sometimes you'll have a
special purpose table with exactly 1 record. If the engine tried to
detect a foriegn key for the join.... it wouldn't make sense because
cross joins never match up a column.
In some special cases you'll join on columns where neither is unique. Therefore the presence of a PK/FK on those columns is
impossible.
You may think points 2 and 3 above are not relevant since your questions is about when there IS a single PK/FK relationship
between tables. However the presence of single PK/FK between the
tables does not mean you can't have other fields to join on in
addition to the PK/FK. The sql engine would not know which fields you
want to join on.
Lets say you have a table "USA_States", and 5 other tables with a FK to the states. The "five" tables also have a few foreign keys to
each other. Should the sql engine automatically join the "five" tables
with "USA_States"? Or should it join the "five" to each other? Both?
You could set up the relationships so that the sql engine enters an
infinite loop trying to join stuff together. In this situation it's
impossible fore the sql engine to guess what you want.
In summary: PK/FK has nothing to do with table joins. They are separate unrelated things. It's just an accident of nature that you
often join on the PK/FK columns.
Would you want the sql engine to guess if it's a full, left, right, or
inner join? I don't think so. Although that would arguably be a lesser
sin than guessing the columns to join on.
If you don't explicitly give the field names in the query, SQL doesn't know which fields to use. You won't always have fields that are named the same and you won't always be joining on the primary key. For example, a relationship could be between two foreign key fields named "Client_Address" and "Delivery_Address". In that case, you can easily see how you would need to give the field name.
As an example:
SELECT o.*, c.Name
FROM Clients c
INNER JOIN Orders o
ON o.Delivery_Address = c.Client_Address
Is there a reason why do i need to explicit include then joinned fields in on clause?
Yes, because you still need to tell the database server what you want. "Do what I mean" is not within the capabilities of any software system so far.
Foreign keys are tools for enforcing data integrity. They do not dictate how you can join tables. You can join on any condition that is expressible through an SQL expression.
In other words, a join clause relates two tables to each other by a freely definable condition that needs to evaluate to true given the two rows from left hand side and the right hand side of the join. It does not have to be the foreign key, it can be any condition.
Want to find people that have last names equal to products you sell?
SELECT
Products.Name,
Clients.LastName
FROM
Products
INNER JOIN Clients ON Products.Name = Clients.LastName
There isn't even a foreign key between Products and Clients, still the whole thing works.
It's like that. :)
The sql standard says that you have to say on which columns to join. The constraints are just for referential integrity. With mysql the join support "join table using (column1, column2)" but then those columns have to be present in both tables
Reasons why this behaviour is not default
Because one Table can have multiple columns referencing back to one column in another table.
In a lot of legacy databases there are no Foreign key constraints but yet the columns are “Supposed to be” referencing some column in some other table.
The join conditions are not always as simple as A.Column = B.Column . and the list goes on…….
Microsoft developers were intelligent enough to let us make this decision rather than them guessing that it will always be A.Column = B.Column
Related
In a report I have the next join from a FACT table:
Join…
LEFT JOIN DimState AS s
ON s.StateCode = l.Province AND l.Locale LIKE (s.CountryCode + '%')
More information:
Fact table has 59,567,773 rows
L.Province can match a StateCode in DimState: 42,346,471 rows 71%
L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).
L.Province is NULL in 3,500,000 rows (6%)
4 questions:
-The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?
-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?
-Is it correct to join to a dimension on 2 columns?
-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?
In order of your four questions:
Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.
Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.
Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.
Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.
To build on what mallan1121 said:
1:There are generally three different meanings for null/blank in data warehousing.
A. I don't know the value
B. The value is known and it is blank
C. The value does not apply.
Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.
2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA
JOIN DimB --filters result
JOIN DimC --filters result
If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored
This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.
4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.
Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.
The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).
The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.
One step to achieve this is the definition of NA and Other as decribed in 1.
The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.
Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).
What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.
If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.
Is it possible to link one table to another table even if its not clear in which column the foreign key appear?
Example:
Table 'server' has (among others) two fields -> 'internal ip' and 'external ip'
Another table 'server_details' has only a field 'ip'.
'server_details' and 'server' should be joined at the ip.
The problem is, we don't know if the ip in the server_details is the external or internal ip, so it could appear in the one or in the other column, but every ip (should be) unique for the whole database and will match definitely one dataset in one of two possible fields.
Can somebody tell me how to realise this? Or isn't it possible at all?
I have to map this behaviour to doctrine entitys at last ...
I think you are going at it the opposite way.
server.internal_ip and server.external_ip can have a foreign key relationship to server_details.ip
The idea of linking tables is a relic of the network data model. In relational databases, we can define foreign key constraints (integrity constraints which ensure one column's values exist in another, but which doesn't create or limit access paths) and join tables (on any condition, regardless of FK constraints). However, your use of Doctrine may limit you in this regard since object-relational mappers try to reinvent network model databases.
You can easily join server to server_details:
SELECT *
FROM server_details sd
INNER JOIN server s ON sd.ip = s.internal_ip OR sd.ip = s.external_ip
or possibly
INNER JOIN server s ON sd.ip = COALESCE(s.external_ip, s.internal_ip)
An FK constraint is a separate matter. If none of your columns uniquely represent the domain of IPs, an FK constraint may not be appropriate. It may be possible to refactor your design to make it more amenable to integrity constraints. If you post your schema or list functional dependencies, I could amend my answer.
You should consider adding an id field to server that can be declared as a PK, unless server already has a PK. Then, put a field called server_id in server_details and declare that as a FK that references server.id. Now, the join is easy:
SELECT *
FROM server_details sd
INNER JOIN server s ON sd.server_id = s.id
The id field serves no other purpose than to uniquely identify a server. You can use the autonumber feature of your DBMS to assign new values.
I'm not sure how this is affected by Doctrine.
The problem was gone because at least we redefined our datamodel anyway. Thanks for your help anyway!
First of all a little bit of context:
TableA as ta
TableB as tb
One 'ta' has many 'tb', but one 'tb' can only be owned by one 'ta'.
I'd often just ad a FK to 'tb' pointing to 'ta' and it's done. Now i'm willing to model it differently (to improve it's readability); i want to use a join table, be it named 'ta_tb' and set a PK to 'tb_id', to enforce the 'one-to-many' clause.
Are there any performance issues when using the approach b in spite of approach a?
If this is a clear 1:n-relation (and ever will be!) there is no need (and no advantage) of a new table in between.
Such a joining table you would use to build a m:n-relation.
There could be one single reason to use a joining table with a 1:n-relation: If you want to manage additional data specifying details of this relation.
HTH
Whenever you normalize your database, there is always a performance hit. If you do a join table (or sometimes referred to as a cross reference) the dbms will need to do work to join the right records.
DBMS's these days do pretty well with creating indexes and reducing these performance hits. It just depends on your situation.
Is it more important to have readability and normalization? Then use a join/xref table.
Is this for a small application that you want to perform well? Just make Table B have a FK to its parent.
If you index correctly, there should be very little performance impact although there will be a very slight extra cost that is likely not noticeable unless your database is maxed out already.
However, if you do this and want to maintain the idea that each id from column b must have one and only 1 a, then you need to make sure to put a unique index on the id for table b in the join table. Later if you need to change that to a many to many relationship, you can remove the index which is certainly easier than changing the database structure. Otherwise this design puts your data integrity at risk.
So the join table might be the structure I woudl recommend if I knew that eventually a many to many relationship was possible. Otherwise, I would probably stick with the simpler structure of the FK in table b.
Yes, at least you will need one more join to access TableB fields and this will impact the performance. (there is a question regarding this here When and why are database joins expensive?)
Also relations that uses a join table is known as many to many, in your case you have a PK on the middle table "making" this relation one to many, and this is less readable than the approach A.
Keep it simple as possible, the approach A is perfect for one to many relationships.
There are 2 tables A & B. Each has say 10 colums.
Table A has 8 columns as FK to other tables. Table B uses enums and std colunms without any FK.
So which table is faster / better to
use?
If i do any action with table A,
i assume I only have to touch colunms
I am relating the action too and do
not have to join all the 10 FK tables
even if i only need 1 FK colunm?
If i do
need to perform any action on a FK,
like write, update or delete a value,
do i need to join to the parent
table?
If i understand correctly,
EAV model is better than a expanded
colunm table because if i need to
display two text from the table then
i need to use a inner join for the
colunm table for for a EAV table i
can use a regular select only with no join?
For only a few values and if the amount of values doesn't change, ENUM can be faster and takes up less space. However, to later add possible values, you'll need to alter the entire table, which is not good design. Table A is in most cases the better option.
Offcourse you only join the table A with the tables you need.
No, you can just modify the table containing the value, unless you change the PK. You should however design your tables in such way that changing the PK is not often needed - use artificial PK's (autoincrements are perfect). Even countries cease to exist or change names...
No, for your EAV you'll need the join. However, joining on keys is extremely fast... this is what relational databases are all about, it's their strong point.
I have a database that has 50 tables and all the tables have primary key on a field named ID. For example, Employee.ID, Customer.ID, order.ID, every single table has ID as its primary key. Should it not be Employee.Employee_ID, Customer.Customer_ID and so on?
Is there any drawback of using ID as name of every ID field in each table? If so please explain or give link to explanation.
I'd use Employee.Id and Customer.Id with the qualifying table names. Employee.EmployeeId and Customer.CustomerId seem a bit redundant to me.
There are different schools of thought on this. Personally, I prefer to use identical column names only for columns that have primary key / foreign key relationships -- it helps make it easier to write complex joins.
Also, the apparent redundancy disappears when you use table aliases. For example:
SELECT o.OrderDate, c.CustomerName
FROM Customers c
INNER JOIN Orders o ON o.CustomerID = c.CustomerID
It's largely a matter of personal style.
However, if your queries always use appropriate aliases for your tables, and you always use aliases when referencing columns, then it's not too bad.
Please don't fall into the habit (as many do) of saying "There's only one Customer_EmployeeID field in my query, so I'll leave off the alias". It's really ordinary, and drives SQL people crazy when they look at code that's done this way (you end up wanting to query sys.columns to see which table contains a column called latest_status).
So if you're writing your queries nicely, it shouldn't really matter how you name your columns, so long as you're consistent. Using a plain old "ID" for any integer identity field is just fine, so long as that's what you always do.
I typically follow the same naming conventions (Entity.Id rather than Entity.EntityId).
If you think of your tables as entities, adding the table name to the ID field is redundant, forces you to write more code, and makes your queries harder to read.