Snapshot Tables with Foreign Keys vs. Snapshot Tables with Real Values - sql-server

On one of our client's database there are a few snapshot tables that summarize useful information from many other tables (e.g. what was the state of each customer in each period, etc).
The snapshot tables however, contain mostly foreign keys to their original tables. Therefore in order to obtain useful information about the snapshot, we have to join them multiple times to their corresponding tables. And these joins often take very long. Adding indexes to all FL columns in databases (or at least on columns in WHERE clauses in our queries) on the other hand, slows down the database significantly.
So my question is, wouldn't it be better to have snapshot tables with real values instead of foreign keys? And if the answer is negative, wouldn't it beat the purpose of snapshot tables if the original tables are updated (e.g. if an was called 'Candle' and now 'Lamp' of course are snapshot remains consistent but is it really snapshot in this case?)

I'd lean towards storing the actual data rather than FK values for the reason you mentioned. That said, a better solution might be to relocate this historical data along with relevant attributes (IE Dimensions) and restructure it for analysis. Data warehousing is certainly a solution for this, although these can be very large-scale projects so you'd need to understand the value and scope it appropriately. However, even a light-weight star schema that targets the specific events they're trying to capture could be a better solution than a large historical table with relationships to transaction-based tables (especially if the query logic against the related tables is complex).

Related

Star snd snowflake schema in OLAP systems

I was of the impression that in OLAP , we try to store data in a denormalized fashion to reduce the number of joins and make query processing faster. Normalization that avoids data redundancy was more for OLTP systems.
But then again, 2 of the common modelling approaches (star and snowflake schema) are essentially normalized schemas.
Can you help me connect the dots?
Actually, that's very perceptive and the vast majority of people accept it. The truth is that a star is partially denormalized - the dimension tables are highly denormalized; they typically come from joining together a lot of related tables into one. A well designed fact table, however, is normalized - Each record is a bunch of values identified by a single, unique, primary key which is composed of the intersection of a set of foreign keys.
Snowflake schemas are, as you surmised, even more normalized. They effectively take the dimension tables and break them into small values that are all joined together when needed. While there are constant arguments over whether this is better or worse than a star, many folks believe that these are inexpensive joins and, depending on your thinking, may be worth it.
Initially, snowflakes were sold as a way of saving disk space because they do take up less room than dimension tables but disk space is rarely an issue nowadays.
I personally prefer a hybrid approach that allows me to build a few levels of dimension table that ultimately can provide referential integrity to both my atomic level data but also to my aggregate fact tables.
When people use the term "normalised" they normally mean something that is in, or at least close to, 3rd normal form (3NF).
Unless you mean something significantly different by the term normalised then neither Star or Snowflake schemas are normalised. Why do you think they are normalised?

Why are there "relations" on databases instead of just using SQL's join?

I always see in database articles or tutorials or... just everywhere where they use databases, they use a thing called relations. It comes to my mind instantaneously those little boxes with lists of field names and one field connected to another field in another box with a line.
I'm not an expert on databases (as you can probably tell) but the little bit I've used, I never needed relations. They always seemed to be redundant as I can always use JOIN to achieved what it seemed to me they are made for. Are they redundant or is there anything you can do with relations that you cannot do with JOIN? Or am I just talking nonsense?
Relations are not just about joins for SQL queries. Relations provide many benefits:
Data integrity
Query convenience
Third party tool integration benefits
"Self-describing" data model to future DBAs/developers working with the database
Etc
Data integrity:
Relations help to ensure that your "order records" can't exist without a "customer record" for example. Simply by defining a relationship between customer and order, the database will ensure that this cannot happen. This helps to make sure that your database doesn't become a big pile of junk data
Query convenience:
Relations can make it easier to do certain types of queries. Deleting a customer record can automatically have the customer's orders deleted at the same time, thanks to the relationship between customer and order
Third party tool integration benefits
Many third party tools (O/R tools come to mind) rely on relations in order to work properly
Really, the list could go on and on...you should use them, they're very beneficial. Even if you don't perceive the value today, if you're working on a database project that will continue to grow over a long period of time, it would be to your benefit to set relationships up from the beginning.
I think that they're not that critical for small projects/one-off data models...but for anything of substance, you're better off using them.
A RELATION is a subset of the cartesian product of a set of domains (http://mathworld.wolfram.com/Relation.html). In everyday terms a relation (or more specifically a relation variable) is the data structure that most people refer to as a table (although tables in SQL do not necessarily qualify as relations).
Relations are the basis of the relational database model.
Relationships are something different. A relationship is a semantic "association among things".
I think you are actually asking about referential integrity constraints (foreign keys). A foreign key is a data integrity rule that ensures the database is consistent by preventing inconsistent data from being added to it. Don't confuse foreign keys with relations because they are very different things.
I'm assuming when you are reading about relations it is probably referring to foreign keys. If that's true, relations and joins are not different solutions for the same problem. They are 2 tools that accomplish different things, and they are usually used together.
A join as it sound like you know is part of a select query that let you get rows from more then 1 table.
A relation is part of the database structure its self that defines a rule. For example if you had a city table and a country table, you should have a relation pointing each row in the city table to a row in the country table. This would ensure the integrity of the data and not allow a city row to point to a country row that doesn't exist.
Asking "Why use relations when you can use joins?" to me sounds like asking 'Why do variables have types when I could read them anyway?".
The theory behind databases is based on something called Relational Algebra. Relation is not a database specific term, it is derived from Relational Algebra.
JOIN is kind of Relation, there can be different kind of relations. Refer to this wiki page to know more about what a Relation exactly is.
The relationships established in a RELATIONAL database are the very core of the relational database model. In a database, we model entities. We use relationships between entities to maintain data integrity, and ensure the records are organized properly. Relationships also create indexes between related tables.
If you are not using the relationships, and/or modelling your table structure based upon the relationships between discrete entities, then you are not harnessing the true power of your relational database. Yes, you can make queries work, and yes, you can get the Db to do some usefule work. But can you ensure that, say, every Employee record is properly RELATED to the proper company? Can you ensure that there is only one record for that company, and that all the emplotyees of that company are related to that record?
Without designing your database structure around entities and the relationships between them, you might as well use a spreadsheet, or one big, flat table. RELATIONSHIPS and NORMALIZATION form the basis of the modern relational database.
An SQL table is an approximation of a relational model relation. Tables/relations (bases, views & query results) represent relations/relationships/associations. These are boxes & diamonds on ER (Entity-Relationship) & pseudo-ER diagrams. Most lines on such diagrams correspond to FK (foreign key) constraints. They are frequently but wrongly called "relations" or "relationships" but they are not. They are facts. An SQL FK says that a table's subrows appear elsewhere where they are a PK (primary key) or UNIQUE. Equivalently, it says that an entity participating in a relation/relationship/association also participates once in another one. Table meanings are necessary & sufficient to query. Constraints--including PKs, UNIQUEs & FKs--are not needed to query. They are consequences of the table relation/relationship/association choices & what situations/states can arise. They are for integrity to be enforced by the DBMS.
When Ed Codd developed the relational model of data for use with large scale databases, he based his design on the mathematics of relational calculus and algebra. The results of this kind of mathematics is predictable with mathematical precision, and Ed Codd was able to forecast with near mathematical precision how relational databases would behave before the first one was ever built.
In mathematics, a relation is a mathematical abstraction. It's a subset of the cartesian product of two or more domains, as another responder said. If that's as clear as mud to you, maybe you're not a mathematician.
No matter. A good computer scientist can understand SQL tables fairly easily, and recognize and exploit the power of an SQL JOIN. This understanding will do in place of a mathematical understanding of relations for many purposes. An SQL table materializes a mathematical relation, approximately. If you are careful with table design, you can turn "approximately" into "exactly".

database - flattened out vs. normalized

what do they mean when a database is “flattened out” vs. normalized?
"Flattened out" typically refers to a database where you have a single (or few) very large tables.
"Normalized" refers to whether the data has been organized into well structured, related tables. This typically reduces duplication of values across rows in a table by pulling the values into a separate table, and relating to it by ID.
For details, see Database Normalization.
A normalized database is one that is organized to minimize redundancy of data and to produce small and well structured relationships, normally via related tables. An example might be a customer and all his/her orders. In a normalized database, you would have at least two (and probably more) tables. A customer table and an orders table, joined together in some fashion. In a flattened structure, customer and order data might be in a single table.
Reporting databases tend to be denormalized to allows quicker retrieval of data (where many joins may be required), whereas production or transactional databases (OLTP) tend to be (or should be) more normalized with foreign keys established between tables.

Is it bad to use redundant relationships?

Suppose I have the following tables in my database:
Now all my queries depend on Company table. Is it a bad practice to give every other table a (redundant) relationships to the Company table to simplify my sql queries?
Edit 1: Background is a usage problem with a framework. See Django: limiting model data.
Edit 2: No tuple would change his company.
Edit 3: I don't write the mysql queries. I use a abstraction layer (django).
It is bad practice because your redundant data has to be updated independently and therefore redundantly. A process that is fraught with potential for error. (Even automatic cascading has to be assigned and maintained separately)
By introducing this relation you effectively denormalize your database. Denormalization is sometimes necessary for the sake of performance but from your question it sounds like you're just simplifying your SQL.
Use other mechanisms to abstract the complexity of your database: Views, Stored Procs, UDFs
What you are asking is whether to violate Third Normal Form in your design. Doing so is not something to be done without good reason because by creating redundancy you create the possibility for errors and inconsistencies in your data. Also, "simplifying" the model with redundant data to support some operations is likely to complicate other operations. Also, constraints and other data access logic will likely need to be duplicated unnecessarily.
Is it a bad practice to give every other table a (redundant) relation to the Company table to simplify my sql queries?
Yes, absolutely, as it would mean updating every redundant relation when you update the relations customer to company or section to company -- and if you miss any such update, you now have a database full of redundant data. It's a bad denormalization.
If your point is to just simplify your SQL, consider using views to "bring along" parent data. Here's a view that pulls company_id into contract, by join through customer:
create view contract_customer as
select
a.*,
b.contract_id, b.company_id
from
contract a
join customer b on (a.customer_id = b.customer_id);
This join is simple, but why repeat it over and over? Write it once, and then use the view in other queries.
Many (but not all) RDBMSes can even optimize out the join if you don't put any columns from customer in the select list or where clause of the query based on the view, as long as you make contract.customer_id have a foreign key referential integrity constraint on customer.customer_id. (In the absence of such a constraint, the join can't be omitted, because it would then be possible for a contract.customer_id to exist which did not exist in customer. Since you'll never want that, you'll add the foreign key constraint.)
Using the view achieves what you want, without the time overhead of having to update the child tables, without the space overhead of making child rows wider by adding the redundant column (and this really begins to matter when you have many rows, as the wider the row, the fewer rows can fit into memory at once), and most importantly, without the possibility of inconsistent data when the parent is updated but the children are not.
If you really need to simplify things, this is where a View (or multiple views) would come in handy.
Having a column for the company in your employee view would not be poorly normalized providing it is derived from a join on section.
If you mean add a Company column to every table, it's a bad idea. It'll increase the potential for data integrity issues (i.e. it gets changed in one table but not the other 6 where it should).
I'd say not in the OP's case, but sometimes it's useful (just like goto ;).
An anecdote:
I'm working with a database where most tables have a foreign key pointing to a root table for the accounts. The account numbers are external to the database and aren't allowed to be changed once issued. So there is no danger of changing the account numbers and failing to update all references in the DB. I also find that it is also considerably easier to grab data from tables keyed by account number instead of having to do complex and costly joins up the hierarchy to get to the root account table. But in my case, we don't have so much a foreign key as an external (i.e., real world) identifier, so it's not quite the same as the OP's situation and seems suitable for an exception.
That depends on your functional requirements for 'Performance'. Is your application going to handle heavy demand? Simplifying JOINS boasts performance. Besides hardware is cheap and turn-around time is important.
The more deeper you go in database normal forms - you save space but heavy on computation

Is it good practice to have foreign keys in a datawarehouse (relationships)?

I think the question is clear enough. Some of the columns in my datawarehouse table could have a relationship to a primary key. But is it good practice? It is denormalized, so it should never be deleted again (data in datawarehouse). Hope question is somewhat clear enough.
I presume that you refer to FKs in fact tables. During DW loading, indexes and any foreign keys are dropped to speed up the loading -- the ETL process takes care of keys.
Foreign key constraint "activates" during inserts and updates (this is when it needs to check that the key value exists in the parent table) and during deletes of primary keys in parent tables. It does not play part during reads. Deleting records in a DW is (should) be a controlled process which scans for any existing relationships before deleting from dimension tables.
So, most DWs do not have foreign keys implemented as constraints.
FK constraints work well in Kimball dimensional models on SQL Server.
Typically, your ETL will need to lookup into the dimension table (usually on the business key to handle slowly changing dimensions) to determine dimension surrogate IDs, and the dimension surrogate id is usually an identity, and the PK on the dimension is usually the dimension surrogate id, which is already an index (probably clustered).
Having RI at this point is not a huge of overhead with the writes, since it can also help catch ETL defects during development. Also, having the PK of the fact table being a combination of all the FKs can also help trap potential data modeling problems and double-loading.
It can actually reduce overhead on selects if you like to make general-use flattened views or table-valued functions of your star models. Because extra inner joins to dimensions are guaranteed to produce one and only one row, so the optimizer can use these constraints very effectively to eliminate the need to look up into the table. Without FK constraints, these lookups may have to be done to eliminate facts where the dimension does not exist.
Using FK-constraints in a DW is like wearing a bicycle helmet. If the ETL is designed correctly, you technically don't need them. That said, if I had a million dollars for every time I've seen bug-free ETL, I'd have zero dollars.
Until you're at a point where FK-constraints are causing performance issues, I say leave'em. Cleaning up referential integrity problems can be much harder than adding them from the get-go ;-)
The quesiton is clear, but "good practice" seems the wrong question.
"Could have FK's" ?
Foreign keys are a mechanism to preserve integrity constraints during database modifications.
If your DW is read-only (accumulating data sources without writing back), there is no need for FK's.
If your DW supports writes, integrity constaints typically need to be coordinated across the participating data sources by the ETL (rather, it's Store equivalent). This process may or may not rely on FK's in the database.
So the right question would be: do you need them.
(The only other reason I can think of would be documentation of relationship - however, this can be done on paper / in a separate document, too.)
I have no idea. But nobody is answering, so I googled and found a best practises paper who seem to say the very helpful "it depends" :-)
While foreign key constraints help data integrity, they have an associated cost on all insert, update and delete statements. Give careful attention to the use of constraints in your warehouse or ODS when you wish to ensure data integrity and validation
The reason for using a foreign key constraint in a data warehouse is the same as for any other database: to ensure data integrity.
It is also possible that query performance will benefit because foreign keys permit certain types of query rewrite that are not normally possible without them. Data integrity is still the main reason to use foreign keys however.
Yes, as a best practice, implement the FK constraints on your fact tables. In SQL Server, use NOCHECK. In ORACLE always use RELY DISABLE NOVALIDATE. This allows the warehouse or mart to know about the relationship, but not check it on INSERT, UPDATE, or DELETE operations. Star transformations, optimizations, etc. may not rely on the FK constraints to improve queries like they used to, but one never knows what BI or OLAP tools will be used on the front side or your warehouse or mart. Some of these tools can make use of knowing the relationships are defined. Plus, how many ugly looking warehouses have you seen with little or no external documentation and had to try to reverse engineer them? Defining the FKs always helps with that.
As designers we NEVER seem to make our data warehouses or marts as self-documenting as we should. Defining FKs certainly helps with that. Now, having said this, if star schemas are properly designed without FKs being defined, it is easy to read and understand them anyway.
And for ORACLE fact tables, always define a LOCAL BITMAP index on every FK to a dimension. Just do it. The indexing is actually more important than the FK being defined.
There is a very good reason to create FK constraints in even read-only DW/DM.
Yes, they are not really required from read-only DW itself point of view, if your ETL is bullet-proof, etc., etc. But guess what - the life doesn't stop at the loading data in DW. Most of the BI analytical/reporting tools are using information about your DW relationships to automatically build their model (for example SSAS Tabular model).
In my humble opinion this alone outweighs the little overhead on dropping and recreating FK constraints during ETL process.

Resources