Postgresql: Arrays instead of extra table in a many-to-many relationship? - database

I was wondering whether it is cleaner or faster using arrays of integers instead of creating an extra new table in a m-to-m relationship and using foreign keys to ensure data integrity.
What do you think? Do you say using arrays instead is a no-go?

Arrays to implement an m-to-n relationship are a no-go.
Write down the join between the two tables. You will notice that instead of two joins with an = as join condition, you now have a single join with #> as join condition (or a LATERAL unnest of the array, which is worse). That means that you are reduced to nested loop joins which will be slow if the tables are big (even if you create a GIN index to support the #>).
You cannot have referential integrity (foreign key constraints) that way.

Related

DW acceptable left join?

In a report I have the next join from a FACT table:
Join…
LEFT JOIN DimState AS s
ON s.StateCode = l.Province AND l.Locale LIKE (s.CountryCode + '%')
More information:
Fact table has 59,567,773 rows
L.Province can match a StateCode in DimState: 42,346,471 rows 71%
L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).
L.Province is NULL in 3,500,000 rows (6%)
4 questions:
-The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?
-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?
-Is it correct to join to a dimension on 2 columns?
-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?
In order of your four questions:
Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.
Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.
Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.
Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.
To build on what mallan1121 said:
1:There are generally three different meanings for null/blank in data warehousing.
A. I don't know the value
B. The value is known and it is blank
C. The value does not apply.
Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.
2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA
JOIN DimB --filters result
JOIN DimC --filters result
If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored
This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.
4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.
Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.
The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).
The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.
One step to achieve this is the definition of NA and Other as decribed in 1.
The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.
Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).
What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.
If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.

To use a FK in a one-to-many relationship versus using a join table

First of all a little bit of context:
TableA as ta
TableB as tb
One 'ta' has many 'tb', but one 'tb' can only be owned by one 'ta'.
I'd often just ad a FK to 'tb' pointing to 'ta' and it's done. Now i'm willing to model it differently (to improve it's readability); i want to use a join table, be it named 'ta_tb' and set a PK to 'tb_id', to enforce the 'one-to-many' clause.
Are there any performance issues when using the approach b in spite of approach a?
If this is a clear 1:n-relation (and ever will be!) there is no need (and no advantage) of a new table in between.
Such a joining table you would use to build a m:n-relation.
There could be one single reason to use a joining table with a 1:n-relation: If you want to manage additional data specifying details of this relation.
HTH
Whenever you normalize your database, there is always a performance hit. If you do a join table (or sometimes referred to as a cross reference) the dbms will need to do work to join the right records.
DBMS's these days do pretty well with creating indexes and reducing these performance hits. It just depends on your situation.
Is it more important to have readability and normalization? Then use a join/xref table.
Is this for a small application that you want to perform well? Just make Table B have a FK to its parent.
If you index correctly, there should be very little performance impact although there will be a very slight extra cost that is likely not noticeable unless your database is maxed out already.
However, if you do this and want to maintain the idea that each id from column b must have one and only 1 a, then you need to make sure to put a unique index on the id for table b in the join table. Later if you need to change that to a many to many relationship, you can remove the index which is certainly easier than changing the database structure. Otherwise this design puts your data integrity at risk.
So the join table might be the structure I woudl recommend if I knew that eventually a many to many relationship was possible. Otherwise, I would probably stick with the simpler structure of the FK in table b.
Yes, at least you will need one more join to access TableB fields and this will impact the performance. (there is a question regarding this here When and why are database joins expensive?)
Also relations that uses a join table is known as many to many, in your case you have a PK on the middle table "making" this relation one to many, and this is less readable than the approach A.
Keep it simple as possible, the approach A is perfect for one to many relationships.

Just curious about SQL Joins

I'm just curious here. If I have two tables, let's say Clients and Orders.
Clients have a unique and primary key ID_Client. Orders have an ID_Client field also and a relation to maintain integrity to Client's table by ID_Client field.
So when I want to join both tables i do:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients ON Clients.ID_Client = Orders.ID_Client
So if I took the job to create the primary key, and the relation between the tables,
Is there a reason why I need to explicitly include the joined columns in on clause?
Why can't I do something like:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients
So SQL should know which columns relate both tables...
I had this same question once and I found a great explanation for it on Database Administrator Stack Exchange, the answer below was the one that I found to be the best, but you can refer to the link for additional explanations as well.
A foreign key is meant to constrain the data. ie enforce
referential integrity. That's it. Nothing else.
You can have multiple foreign keys to the same table. Consider the following where a shipment has a starting point, and an ending point.
table: USA_States
StateID
StateName
table: Shipment
ShipmentID
PickupStateID Foreign key
DeliveryStateID Foreign key
You may want to join based on the pickup state. Maybe you want to join on the delivery state. Maybe you want to perform 2 joins for
both! The sql engine has no way of knowing what you want.
You'll often cross join scalar values. Although scalars are usually the result of intermediate calculations, sometimes you'll have a
special purpose table with exactly 1 record. If the engine tried to
detect a foriegn key for the join.... it wouldn't make sense because
cross joins never match up a column.
In some special cases you'll join on columns where neither is unique. Therefore the presence of a PK/FK on those columns is
impossible.
You may think points 2 and 3 above are not relevant since your questions is about when there IS a single PK/FK relationship
between tables. However the presence of single PK/FK between the
tables does not mean you can't have other fields to join on in
addition to the PK/FK. The sql engine would not know which fields you
want to join on.
Lets say you have a table "USA_States", and 5 other tables with a FK to the states. The "five" tables also have a few foreign keys to
each other. Should the sql engine automatically join the "five" tables
with "USA_States"? Or should it join the "five" to each other? Both?
You could set up the relationships so that the sql engine enters an
infinite loop trying to join stuff together. In this situation it's
impossible fore the sql engine to guess what you want.
In summary: PK/FK has nothing to do with table joins. They are separate unrelated things. It's just an accident of nature that you
often join on the PK/FK columns.
Would you want the sql engine to guess if it's a full, left, right, or
inner join? I don't think so. Although that would arguably be a lesser
sin than guessing the columns to join on.
If you don't explicitly give the field names in the query, SQL doesn't know which fields to use. You won't always have fields that are named the same and you won't always be joining on the primary key. For example, a relationship could be between two foreign key fields named "Client_Address" and "Delivery_Address". In that case, you can easily see how you would need to give the field name.
As an example:
SELECT o.*, c.Name
FROM Clients c
INNER JOIN Orders o
ON o.Delivery_Address = c.Client_Address
Is there a reason why do i need to explicit include then joinned fields in on clause?
Yes, because you still need to tell the database server what you want. "Do what I mean" is not within the capabilities of any software system so far.
Foreign keys are tools for enforcing data integrity. They do not dictate how you can join tables. You can join on any condition that is expressible through an SQL expression.
In other words, a join clause relates two tables to each other by a freely definable condition that needs to evaluate to true given the two rows from left hand side and the right hand side of the join. It does not have to be the foreign key, it can be any condition.
Want to find people that have last names equal to products you sell?
SELECT
Products.Name,
Clients.LastName
FROM
Products
INNER JOIN Clients ON Products.Name = Clients.LastName
There isn't even a foreign key between Products and Clients, still the whole thing works.
It's like that. :)
The sql standard says that you have to say on which columns to join. The constraints are just for referential integrity. With mysql the join support "join table using (column1, column2)" but then those columns have to be present in both tables
Reasons why this behaviour is not default
Because one Table can have multiple columns referencing back to one column in another table.
In a lot of legacy databases there are no Foreign key constraints but yet the columns are “Supposed to be” referencing some column in some other table.
The join conditions are not always as simple as A.Column = B.Column . and the list goes on…….
Microsoft developers were intelligent enough to let us make this decision rather than them guessing that it will always be A.Column = B.Column

Comparison of lookup times: foreign key is or is not present

A colleague recently described to me a plan to re-architect a database. The new database will conform to a simple star schema: the parent table will consist of a key and some contextual information, and that key will serve as a foreign key field in other tables. The foreign key field may appear in the same child table multiple times.
Pseudocode:
TABLE Parent
INT key PRIMARY_KEY
INT foo
...
TABLE Child1
INT key FOREIGN_KEY REFERENCES Parent.key
BLOB bar
...
TABLE Child2
INT key FOREIGN_KEY REFERENCES Parent.key
VARCHAR tar
...
The motivation behind the design is to simplify JOINs between Parent and Child<n>, which was complicated with the previous schema.
In an effort to further speed up JOINs, my colleage wishes to minimize the use of OUTER JOINs. Specifically, she wants to emulate OUTER JOINs by using JOINS and by maintaining the data in the children tables in a particular way: populating all of them such that for each key in Parent, there is at least one row in Child<n> with that key value, even if the row is otherwise full of nulls. This way, any JOIN performed between Parent and Child<n> on key would return at least one result for every key in Parent, much an OUTER JOIN.
Putting aside the question of whether or not maintaining the data in this way is worth the effort, is this approach more performant than doing OUTER JOINS, assuming all key fields are properly indexed and about half the childrens' rows are nulled out?
The question seems to boil down to "is faster to do an index lookup for a value that is present in the index rather than a value that is not present?" Assuming the index operates like a B-tree or a hash, the answer strikes me as "no," but I don't know enough to be certain.
Personally, I have not noticed major performance differences between outer joins and inner joins. Why does your colleague believe that they are slower?
Adding additional records has two effects on performance. The original data gets larger, requiring more pages to store the data. This can have a large effect on performance, particularly if the additional pages (with no useful data) are competing for space with more useful structures (say the indexes).
The second effect is on the index. It will need to be larger, which can result in a deeper index and more index pages. Both of these can have an effect on performance.
There is another issue as well, not related to performance. Users/developers writing queries would need to fully understand that these empty records exist. It is quite easy to do a COUNT(*) or COUNT() and expect the result to accurately reflect the number of records with data. If this isn't the case, you might cause coding problems down the road.
I don't think this method will improve performance.
Inner joins are usually faster than outer joins. This is because inner joins are more restrictive, giving the optimizer more opportunities to reduce the result set earlier in the plan.
But if you artificially add data, your inner joins aren't more restrictive anymore.

Database joins / FK - basic questions

There are 2 tables A & B. Each has say 10 colums.
Table A has 8 columns as FK to other tables. Table B uses enums and std colunms without any FK.
So which table is faster / better to
use?
If i do any action with table A,
i assume I only have to touch colunms
I am relating the action too and do
not have to join all the 10 FK tables
even if i only need 1 FK colunm?
If i do
need to perform any action on a FK,
like write, update or delete a value,
do i need to join to the parent
table?
If i understand correctly,
EAV model is better than a expanded
colunm table because if i need to
display two text from the table then
i need to use a inner join for the
colunm table for for a EAV table i
can use a regular select only with no join?
For only a few values and if the amount of values doesn't change, ENUM can be faster and takes up less space. However, to later add possible values, you'll need to alter the entire table, which is not good design. Table A is in most cases the better option.
Offcourse you only join the table A with the tables you need.
No, you can just modify the table containing the value, unless you change the PK. You should however design your tables in such way that changing the PK is not often needed - use artificial PK's (autoincrements are perfect). Even countries cease to exist or change names...
No, for your EAV you'll need the join. However, joining on keys is extremely fast... this is what relational databases are all about, it's their strong point.

Resources