Is this example a violation of star schema? - database

I'm building a simple star schema in data warehouse with two dimensions based off of business entities: dim_loan and dim_borrower. There are also some fact tables, such as fact_loan_status which has one row per month for each loan showing the balance at that time, and has an FK back to dim_loan.
So here's my question: if dim_loan has a FK for borrower_id back to dim_borrower, does that violate star schema? Nearly all discussion of the star schema revolves around individual dim tables that only have FK relations with fact tables, not fellow dims. Making a fact_loan_borrower doesn't make sense to me for this simple one-to-one relationship.
Any advice would be welcomed!

if dim_borrower and dim_loan have the same cardinality, then keeping both ids (loan_id, borrower_id) in the fact_loan_borrower would help you gain performance. You need only one join to bring borrower or loan information from respective dimensions. If you keep borrower_id as FK in dim_loan you need to use two joins if you need to bring borrowers information.
If the two dimensions have different cardinality then it is wise to attach dimension with low cardinality with the fact table - it will help to keep fact table small.
The choice of star and snowflake schema fully depends on you.

Related

What is the number of columns that make table really big?

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

Difference between Fact table and Dimension table?

When reading a book for business objects, I came across the term- fact table and dimension table.
I am trying to understand what is the different between Dimension table and Fact table?
I read couple of articles on the internet but I was not able to understand clearly..
Any simple example will help me to understand better?
In Data Warehouse Modeling, a star schema and a snowflake schema consists of Fact and Dimension tables.
Fact Table:
It contains all the primary keys of the dimension and associated
facts or measures(is a property on which calculations can be made) like quantity sold, amount sold and average sales.
Dimension Tables:
Dimension tables provides descriptive information for all the measurements recorded in fact table.
Dimensions are relatively very small as comparison of fact table.
Commonly used dimensions are people, products, place and time.
image source
This appears to be a very simple answer on how to differentiate between fact and dimension tables!
It may help to think of dimensions as things or objects. A thing such
as a product can exist without ever being involved in a business
event. A dimension is your noun. It is something that can exist
independent of a business event, such as a sale. Products, employees,
equipment, are all things that exist. A dimension either does
something, or has something done to it.
Employees sell, customers buy. Employees and customers are examples of
dimensions, they do.
Products are sold, they are also dimensions as they have something
done to them.
Facts, are the verb. An entry in a fact table marks a discrete event
that happens to something from the dimension table. A product sale
would be recorded in a fact table. The event of the sale would be
noted by what product was sold, which employee sold it, and which
customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.
In addition fact tables also typically have some kind of quantitative
data. The quantity sold, the price per item, total price, and so on.
Source:
http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/
This is to answer the part:
I was trying to understand whether dimension tables can be fact table
as well or not?
The short answer (INMO) is No.That is because the 2 types of tables are created for different reasons. However, from a database design perspective, a dimension table could have a parent table as the case with the fact table which always has a dimension table (or more) as a parent. Also, fact tables may be aggregated, whereas Dimension tables are not aggregated. Another reason is that fact tables are not supposed to be updated in place whereas Dimension tables could be updated in place in some cases.
More details:
Fact and dimension tables appear in a what is commonly known as a Star Schema. A primary purpose of star schema is to simplify a complex normalized set of tables and consolidate data (possibly from different systems) into one database structure that can be queried in a very efficient way.
On its simplest form, it contains a fact table (Example: StoreSales) and a one or more dimension tables. Each Dimension entry has 0,1 or more fact tables associated with it (Example of dimension tables: Geography, Item, Supplier, Customer, Time, etc.). It would be valid also for the dimension to have a parent, in which case the model is of type "Snow Flake". However, designers attempt to avoid this kind of design since it causes more joins that slow performance. In the example of StoreSales, The Geography dimension could be composed of the columns (GeoID, ContenentName, CountryName, StateProvName, CityName, StartDate, EndDate)
In a Snow Flakes model, you could have 2 normalized tables for Geo information, namely: Content Table, Country Table.
You can find plenty of examples on Star Schema. Also, check this out to see an alternative view on the star schema model Inmon vs. Kimball. Kimbal has a good forum you may also want to check out here: Kimball Forum.
Edit: To answer comment about examples for 4NF:
Example for a fact table violating 4NF:
Sales Fact (ID, BranchID, SalesPersonID, ItemID, Amount, TimeID)
Example for a fact table not violating 4NF:
AggregatedSales (BranchID, TotalAmount)
Here the relation is in 4NF
The last example is rather uncommon.
Super simple explanation:
Fact table: a data table that maps lookup IDs together. Is usually one of the main tables central to your application.
Dimension table: a lookup table used to store values (such as city names or states) that are repeated frequently in the fact table.
Dimension table
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process
a Fact = an action: a sale, a transaction, an access
a Dimension = an object: a seller, a customer, a date, a price
Then...
Facts references dimensions for: when, where, what, who, how
The real interesting thing is deciding whether an attribute should be a dimension or a fact. For example, the price of each item in an order, or, the maximum amount of a insurance recorded in a contract. There are no generally correct way to approach these, only ones that make sense in the context.
PS: If I were to create those jargons I would prefer Log table and Object table.
In the simplest form, I think a dimension table is something like a 'Master' table - that keeps a list of all 'items', so to say.
A fact table is a transaction table which describes all the transactions. In addition, aggregated (grouped) data like total sales by sales person, total sales by branch - such kinds of tables also might exist as independent fact tables.
From my point of view,
Dimension table : Master Data
Fact table : Transactional Data
The fact table mainly consists of business facts and foreign keys that refer to primary keys in the dimension tables. A dimension table consists mainly of descriptive attributes that are textual fields.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign key, measurements, and degenerated dimensions.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact tables provide the measurements of an enterprise.
When comparing the size of the two tables, a fact table is bigger than a dimensional table. In a comparison table, more dimensions are presented than the fact tables. In a fact table, less numbers of facts are observed.
The dimension table has to be loaded first. While loading the fact tables, one should have to look at the dimension table. This is because the fact table has measures, facts, and foreign keys that are the primary keys in the dimension table.
Read more: Dimension Table and Fact Table | Difference Between | Dimension Table vs Fact Table http://www.differencebetween.net/technology/hardware-technology/dimension-table-and-fact-table/#ixzz3SBp8kPzo
For Relation database users, Dimension is equivalent to Master Table.
Fact is equivalent to Transaction table.
Dimension table : It is nothing but we can maintains information about the characterized date called as Dimension table.
Example : Time Dimension , Product Dimension.
Fact Table : It is nothing but we can maintains information about the metrics or precalculation data.
Example : Sales Fact, Order Fact.
Star schema : one fact table link with dimension table form as a Start Schema.
enter image description here

What is the purpose of data modeling cardinality?

I understand what cardinality is, so please don't explain that ;-)
I would like to know, what the purpose of doing cardinality is in data modeling, and why i should care.
Example: In an ER model you make relations and ad the cardinality to the relations.
When am i going to use the cardinality further in the development process? Why should i care about the cardinality?
How, when and where do i use the cardinalities after i finish an ER model for example.
Thanks :-)
Cardinalities tell you something important about table design. A 1:m relationship requires a foreign key column in the child table pointing back to the parent primary key column. A many-to-many relationship means a JOIN table with foreign keys pointing back to the two participants.
How, when and where do i use the cardinalities after i finish an ER model for example.
When physically creating the database, the direction, NULL-ability and number of FKs depends on the cardinalities on both endpoints of the relationship in the ER diagram. It may even "add" or "remove" some tables and keys.
For example:
A "1:N" relationship is represented as a NOT NULL FK from the "N" table to "1" table. You cannot do it in the opposite direction and retain the same meaning.
A "0..1:N" relationship is represented as a NULL-able FK from "N" to "0..1" table.
A "1:1" relationship is represented by two NOT NULL FKs (that are also keys) forming a circular reference1 or by merging two entities into a single physical table.
A "0..1:1" relationship is represented by two FKs, one of which is NULL-able (also under keys).
A "0..1:0..1" relationship is represented by two FKs, both NULL-able and under keys, or by a junction table with specially crafted keys.
An "M:N" relationship requires an additional (so called "junction" or "link") table. A key of that table is a combination of migrated keys from child tables.
Not all cardinalities can be (easily) represented declaratively in the physical database, but fortunately those that can tend to be most useful...
1 Which presents a chicken-and-egg problem when inserting new data, which is typically resolved by deferring constraint checking to the end of the transaction.
Cardinality is a vital piece of information of a relation between two entites. You need them for later models when the actual table architecture is being modelled. Without knowing the relationship cardinality, one cannot model the tables and key restriction between them.
For example, a car must have exactly 4 wheels and those wheels must be attached to exactly one car. Without cardinality, you could have a car with 3, 1, 0, 12, etc... wheels, which moreover could be shared among other cars. Of course, depending on the context, this can make sense, but it usually doesn't.
A data model is a set of constraints; without constraints, anything would be possible. Cardinality is a (special kind of) constraint. In most cultures, a marriage is a relation between exactly two persons. (In some cultures these persons must have different gender.)
The problem with data modelling is that you have to specify the constraints you wish to impose on the data. Some constraints (unique, foreign key) are more important, and less dependent on the problem domain as others ("salary < 100000"). In most cases Cardinality will be somewhere in between crucial and bogus.
If you are creating the data layer of an application and you decided to use an ORM, maybe it's entity framework.
There's a point when you need to create your models and your model maps. At that point you would be able to pull out your ERD, review the cardinality you put on your diagram and create the correct relationships so your data layer shape matched your database shape.

Database - fact table and dimension table

When reading a book for business objects, I came across the term- fact table and dimension table. Is this the standard thing for all the database that they all have fact table and dimension table or is it just for business object design? I am looking for an explanation which differentiates between two and how they are related.
Edited:
Why cannot a query just get the required data from the fact table? What happens if all the information are stored in one fact table alone? What advantages we get by creating a separate fact and dimension table and joining it?
Sorry for too many questions at a time but I would like to know about the inter-relations and whys.
Dimension and Fact are key terms in OLAP database design.
Fact table contains data that can be aggregate.
Measures are aggregated data expressions (e. Sum of costs, Count of calls, ...)
Dimension contains data that is use to generate groups and filters.
Fact table without dimension data is useless. A sample: "the sum of orders is 1M" is not information but "the sum of orders from 2005 to 2009" it is.
They are a lot of BI tools that work with these concepts (e.g. Microsft SSAS, Tableau Software) and languages (e. MDX).
Some times is not easy to know if a data is a measure or a dimension. For example, we are analyzing revenue, both scenarios are possibles:
3 measures: net profit , overheads , interest
1 measure: profit and 1 dimension: profit type (with 3 elements: net, overhead, interest )
The BI analyst is who determines what is the best design for each solution.
EDITED due to the question also being edited:
An OLAP solution usually has a semantic layer. This layer provides to the OLAP tool information about: which elements are fact data, which elements are dimension data and the table relationships. Unlike OLTP systems, it is not required that an OLAP database is properly normalized. For this reason, you can take dimension data from several tables including fact tables. A dimension that takes data from a fact table is named Fact Dimension or Degenerate dimension.
They are a lot of concepts that you should keep in mind when designing OLAP databases: "STAR Schema", "SNOWFLAKE Schema", "Surrogate keys", "parent-child hierarchies", ...
That's a standard in a datawarehouse to have fact tables and dimension tables. A fact table contains the data that you are measuring, for instance what you are summing. A dimension table is a table containing data that you don't want to constantly repeat in the fact table, for example, product data, statuses, customers etc. They are related by keys: in a star schema, each row in the fact table contains a the key of a row in the dimension table.

Why are there "relations" on databases instead of just using SQL's join?

I always see in database articles or tutorials or... just everywhere where they use databases, they use a thing called relations. It comes to my mind instantaneously those little boxes with lists of field names and one field connected to another field in another box with a line.
I'm not an expert on databases (as you can probably tell) but the little bit I've used, I never needed relations. They always seemed to be redundant as I can always use JOIN to achieved what it seemed to me they are made for. Are they redundant or is there anything you can do with relations that you cannot do with JOIN? Or am I just talking nonsense?
Relations are not just about joins for SQL queries. Relations provide many benefits:
Data integrity
Query convenience
Third party tool integration benefits
"Self-describing" data model to future DBAs/developers working with the database
Etc
Data integrity:
Relations help to ensure that your "order records" can't exist without a "customer record" for example. Simply by defining a relationship between customer and order, the database will ensure that this cannot happen. This helps to make sure that your database doesn't become a big pile of junk data
Query convenience:
Relations can make it easier to do certain types of queries. Deleting a customer record can automatically have the customer's orders deleted at the same time, thanks to the relationship between customer and order
Third party tool integration benefits
Many third party tools (O/R tools come to mind) rely on relations in order to work properly
Really, the list could go on and on...you should use them, they're very beneficial. Even if you don't perceive the value today, if you're working on a database project that will continue to grow over a long period of time, it would be to your benefit to set relationships up from the beginning.
I think that they're not that critical for small projects/one-off data models...but for anything of substance, you're better off using them.
A RELATION is a subset of the cartesian product of a set of domains (http://mathworld.wolfram.com/Relation.html). In everyday terms a relation (or more specifically a relation variable) is the data structure that most people refer to as a table (although tables in SQL do not necessarily qualify as relations).
Relations are the basis of the relational database model.
Relationships are something different. A relationship is a semantic "association among things".
I think you are actually asking about referential integrity constraints (foreign keys). A foreign key is a data integrity rule that ensures the database is consistent by preventing inconsistent data from being added to it. Don't confuse foreign keys with relations because they are very different things.
I'm assuming when you are reading about relations it is probably referring to foreign keys. If that's true, relations and joins are not different solutions for the same problem. They are 2 tools that accomplish different things, and they are usually used together.
A join as it sound like you know is part of a select query that let you get rows from more then 1 table.
A relation is part of the database structure its self that defines a rule. For example if you had a city table and a country table, you should have a relation pointing each row in the city table to a row in the country table. This would ensure the integrity of the data and not allow a city row to point to a country row that doesn't exist.
Asking "Why use relations when you can use joins?" to me sounds like asking 'Why do variables have types when I could read them anyway?".
The theory behind databases is based on something called Relational Algebra. Relation is not a database specific term, it is derived from Relational Algebra.
JOIN is kind of Relation, there can be different kind of relations. Refer to this wiki page to know more about what a Relation exactly is.
The relationships established in a RELATIONAL database are the very core of the relational database model. In a database, we model entities. We use relationships between entities to maintain data integrity, and ensure the records are organized properly. Relationships also create indexes between related tables.
If you are not using the relationships, and/or modelling your table structure based upon the relationships between discrete entities, then you are not harnessing the true power of your relational database. Yes, you can make queries work, and yes, you can get the Db to do some usefule work. But can you ensure that, say, every Employee record is properly RELATED to the proper company? Can you ensure that there is only one record for that company, and that all the emplotyees of that company are related to that record?
Without designing your database structure around entities and the relationships between them, you might as well use a spreadsheet, or one big, flat table. RELATIONSHIPS and NORMALIZATION form the basis of the modern relational database.
An SQL table is an approximation of a relational model relation. Tables/relations (bases, views & query results) represent relations/relationships/associations. These are boxes & diamonds on ER (Entity-Relationship) & pseudo-ER diagrams. Most lines on such diagrams correspond to FK (foreign key) constraints. They are frequently but wrongly called "relations" or "relationships" but they are not. They are facts. An SQL FK says that a table's subrows appear elsewhere where they are a PK (primary key) or UNIQUE. Equivalently, it says that an entity participating in a relation/relationship/association also participates once in another one. Table meanings are necessary & sufficient to query. Constraints--including PKs, UNIQUEs & FKs--are not needed to query. They are consequences of the table relation/relationship/association choices & what situations/states can arise. They are for integrity to be enforced by the DBMS.
When Ed Codd developed the relational model of data for use with large scale databases, he based his design on the mathematics of relational calculus and algebra. The results of this kind of mathematics is predictable with mathematical precision, and Ed Codd was able to forecast with near mathematical precision how relational databases would behave before the first one was ever built.
In mathematics, a relation is a mathematical abstraction. It's a subset of the cartesian product of two or more domains, as another responder said. If that's as clear as mud to you, maybe you're not a mathematician.
No matter. A good computer scientist can understand SQL tables fairly easily, and recognize and exploit the power of an SQL JOIN. This understanding will do in place of a mathematical understanding of relations for many purposes. An SQL table materializes a mathematical relation, approximately. If you are careful with table design, you can turn "approximately" into "exactly".

Resources