Designing a star / snowflake schema database [closed] - database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have to design and build a star / snowflake schema database that will keep data about employees in a company - especially the rates that are payed to the employees. This is the first time I am experimenting with this schema type and I'm not sure about which parts of the fact tables should be separate dimension tables.
I don't exactly understand the practical upsides of having this schema, is it actually that much easier to perform queries on this type of database? Or is it only about the performance?
Below I am attaching the project of the schema of my database. I would like to know what should I modify for this to be the best possible version for this database. I also have a question about two things:
Should the rate column be just a value in the fact table? Or should it be a foreign key to a dim_rate table?
What about date dimensions? Should they just be values in specific tables? Or should they always be foreign keys? If they should be foreign keys, should there be one dim_date table or a table for each type of date?
As an example for question 2 lets takie the dim_employee table and the employment_date and end_of_employment columns. I have these dates as values in the dim_employee table but I can think of 2 other versions of how to handle this data: either foreign keys to a dim_date table or seperate fact tables for fact_start_of_employment and fact_end_of_deployment. I know I will need different kinds of report for example reports showing how many people started work and left the company for different date intervals (eg. in december of 2020). Honestly at this point I have no idea which option would be best and easiest to work with in the future.
Also as I said - I would love any constructive criticism of this schema, even if it means completely redesigning it.

I would merge both fact tables because I think there is a strong relation between rate and position. But that's how I look at this data without knowing all the details.
I would also create a date dimension and a form_of_employment dimension.
That would result in 4 dimensions:
dim_employee
dim_date
dim_position
dim_form_of_employment
And a single fact table with these columns:
fact_assignment
employee_id
date_id
position_id
form_of_employment_id
rate
student
This setup results in a proper star and very simpel SQL for your reports

For every BI or reporting system, you have a process of designing your tables and building them based on that design. This process is called dimensional modeling. Some others call it data warehouse design, which is the same thing. Dimensional modeling is the process of thinking and designing the data model including tables and their relationships. As you see, there is no technology involved in the process of dimensional modeling, It is all happening on your head and ends up with sketching diagrams on the paper. Dimensional modeling is not the diagram in which tables are connected to each other, it is the process of doing that.
Star Schema is the best way of designing a data model for reporting, You will get the best performance and also flexibility using such a model.
In this case the Employee Dimension will be a Historical Dimension or Slowly Changing Dimension :
You can use a bridge table.
In a classic dimensional schema, each dimension attached to a fact table has a single value consistent with the fact table’s grain. But there are a number of situations in which a dimension is legitimately multivalued.
Like in your example, an employee can have many positions :

Related

Database tables without relationships [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Is it a good thing to create a database without relationships between the tables?
Is there any problem doing this? I have to design a database with historical events, sports events, environment data, etc. but can I put them in only one database?
In your case (as you said in a comment, it's for a history table), having no explicit relation between the parent table and the child table isn't a problem, as:
you won't need the unique constraints
you don't need to delete the orphans (if it's a history table, you want to maintain all the data, isn't it?)
And if the requests to this history table are made independently to the parent (e.g. any ORM used), make sure to have an index in the parent id column to be able to easily retrieve all the data linked to the parent.
Is a good thing create a database that its table hasn't relationships?
Sure if you don't have/need to make relations (Example Table Users and Table StarsInTheSky)
I have to design a database with some historical events, sports events, environment data and other stuff, but can I put them in only one database?
Probably you are talking about putting data in only one table; In my opinion You should think about Normalization:
Begin writing in a paper your unique table and the first row (Use your imagination).
Question yourself: "Am i repeating some Data in the rows written?"
EX:
Name - Surname - BirthDate - Address
Paul - Allen - 01/11/1957 - 21 Baker Street NY
Paul - Allen - 01/11/1957 - 66 Mullholland Drive LosAngeles
As you can see here U can Relate Personal Data with Address in two distinct table.
Question yourself: "Am i using irresponsible Columns (Fields)?
EX:
Name - Surname - BirthDate - Phone1 - Phone2
Paul - Allen - 01/11/1957 - 25412255 - null
What if another user has 3 or 4 phone numbers?
Relate User data with Phone table.
EDIT: Use a single Database or not? AFAIK programs need evolution and implementation in time, maybe one day you would need to make some relation so it's better if u use a single database per Program no matter how many tables u have and if they are related or not, keep the future work as simple as u can :)

Star Schema design from 3NF

I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema.
In all the examples I've seen the fact table references the dim tables, so I'm assuming the questionId and responseId would be part of the fact table? Any advice would be much appreciated.
I can't see the image at the moment (blocked by my firewall # the office). but I'll try to give you some ideas.
The general idea is to organize your measurable 'facts' into what are called fact tables. There are 3 main types of facts, but that is a topic for a different day (but I'd be happy to go into this if needed). Each of these facts are what you'd see in the center of typical 'star schema'. The other attributes within the fact tables are typically FK references to the dimension tables.
Regarding dimensions, these are groups of attributes that share commonality (the most notable being a calendar dimension). This is important because when you're doing analysis across multiple facts the dimensions are what you use to connect them.
If you consider this simple example: A product is ordered and then shipped. We could have 2 transaction facts (one that contains the qty ordered - measure, type of product ordered - dimension, and transaction date - dimension). We'd also have a transaction fact for the product shipping ( qty shipped - measure, product type - dimension, and ship date - dimension). This simple schema could be used to answer questions like 'how many products by product type last quarter were ordered but not shipped'.
Hopefully this helps you get started.
Usually a fact table is used to aggregate measures - which are always numeric. Examples would be: sales dollars, distances, weights, number of items sold.
The type of data you drew here doesn't have any cut and dry "measure" so you need to decide what you want to measure. Is the number of answers per question? Is it how many responses per sample?
This is often called an Event Fact table (if you want to search for other examples). And you need some sort of reporting requirements before you can turn it into a star schema. So it isn't an easy answer...
It's so easy :) Responses is fact, all other is dimensions. And your schema is now star designed, because you can directly connect fact with all dimensions. Example, when you need to redesign its structure where addresses stored in separate table and related with sample. You must add address table id into responses table for get star schema.

One-To-Many join table to avoid nullable columns [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I'M wondering myself whether am I the first programmer struggling with this problem, but i can't find anything in SO about this.
Point of my question, is it a good idea to make a One-To-Many join table, in order to prevent NULL references.
Let's explain, in our business requirements, we have some activities that causes a payment, i.e. sales, loans, rents, services etc. each activity can have zero or one or more payments.
When designing the DB, we have tables for each activity, Sales – Loans – Rents - Services etc, and a Payment table. The relation between the activities and the payments are one to many, each loan can have many payments, and each rent can have many payments.
But there is a problem, each payment can be a loan or a sale or any other activity, we need to relate it to its corresponding activity. I think about two options:
1) Add some Foreign keys in the Payments table for each kind of activity, LoanID - RentID - ServiceID etc. And make them Nullable, due to a loan is neither a service nor a rent.
I personally don't like this solution, it is very error prone, man can very easy forgot to add the matching FK due to it is Nullable, and then we don't know what this payment is about, we lose the Referential integrity. Although it is possible to overcome this problem by creating some constraint to ensure that there are Neither more nor less than one FK, but it is not so easy to create the right constraint and take into account all possible options, and it is hard to recreate the constraint when adding new FK columns.
Needless to say about the ugliness of such a table. Don't speak about the main issue of letting unnecessary nullable columns in a table.
2) A second solution, to create join tables in between for each kind of activity, called ActivityPayments i.e. LoanPayments etc., that holds the activity ID and the payment ID, like Many-To-Many table.
There aren’t the problems described above, each payment is related to its corresponding activity, there are no referential integrity loss, no Nullable columns.
The problem is however that it enlarges the Database, and adds another layer between the tables, and needs more work when joining in queries.
Has someone any idea?
Another option is to create a supertype table, say Activity, with all of the common attributes:
This should keep the number of tables small, and still allow you to identify the activity type for a payment. Note that this assumes that common attributes exist between the different activities. If that is not the case, the second option you listed is probably the way to go.
Look up the following tags in SO.
single-table-inheritance
class-table-inheritance
shared-primary-key
The info tab on these tags gives you a brief explanation, and the questions grouped under the tag will give you some examples.
Single table inheritance is similar you the solution you presented, and that you are unhappy with. Yes, it does involve NULLS. Generally, user errors here are prevented by the application.
Class-table-inheritance is like the solution offered by AMS. Note that SalesID and LoanID are listed as both a PK and an FK. This hints at the technique of shared primary key. With this, SalesID and LoanID are copies of a value in ActivityID. Again, it's the application layer that does the necessary work to mke sure the copies are right.
in this specific case (not necessarily applicable in similiar situations), we usualy calculate dynamically, in a view/function, each payment for what it was (in chronological order)
in other instances we had one sale table where each product can be a physical product or service or any other for-pay offer. so that limits all debit transactions to one tbale
HTA

Basic questions regarding Data Warehousing

I'm wanting to use OLAP cubes and have to first design a data warehouse. I am going for the star-schema. I'm a little confused about how to convert from a normal database to a data warehouse, especially with regards to foreign keys between dimension tables. I know a fact table has foreign keys to dimensions, but do dimensions have foreign keys between them? For example, what do I need to do with the following 2 examples:
TABLE: Airports
COLUMNS: Id, Name, Code, CityId
When I make the Airports dimension, do I remove CityId and put the City Name instead? Or what?
TABLE: Regions
COLUMNS: Id, Name, RegionType, ParentId
The question for this one is mostly the same, but a bit more complex, because here ParentId refers to the same table (Regions).. example: a City can refer to a parent Country record. How do I translate these over to a data warehouse star schema?
Lastly, regarding measures, those go on the fact table, right? I think I will likely need multiple fact tables. Is that normal? Does one fact table translate to one OLAP cube? Or what?
You want to include city within your airport dimension. You are intentionally flattening out your normalised schema to aid the speed of the dimensional model which can seem counter intuitive if you are coming from transactional development.
With regards to the perennial child relationship, you want the parented to be translated into the surrogate of the region record. Ssas will provide the functionality to relate parent child records when you are designing your cube.
Multiple facts are not unusual, but unless the fact data is completely unrelated, there is no need to separate them into different cubes. The requirement for multiple facts will be driven by having data at a different grain. Keep all of you metrics (I.e. Flights) together, but you would separate out flight metrics from food sale metrics
you not converting to data warehouse, you are creating new data warehouse with few dimension and 1 (at least) Fact table. dimension tables are loaded first and you DO NOT want to change id with name.
you need additional key for each dimension table. once you load dimensions, I usually use ssis package to load fact table.(either incremental load or you can truncate fact table each time before you load with new data( depends what you need) ...

Relational database design: standard row values in one table vs. separate tables

Note: I've seen a few related question about similar issues; however, none of them would fully answer my question.
I have exam data for schools. There are around 500 schools, and around 12 subject exams in my dataset (each school has data for each exam). Each exam has 6 attributes (columns). After the initial data is loaded to the database, no modifications are expected. With respect to SELECT queries, I imagine that separate exam data is used as often as queries over a number of exams. However, the database would be used by a website visualizing the data, thus those SELECT queries might have to be run rather often. With that in mind, I can think of three ways of organizing that data, with each way producing (apparently) BCNF tables.
First scema:
school
exam1_attr1
exam1_attr2
...
exam12_attr6
This schema feels wrong, though I do not have strong arguments against it. As I said, my data would not change, thus having exams carved into attribute names is not that much of an issue. However, such a setup would pose some aggregation difficulties over the entire dataset (i.e., resulting queries would possibly be unnecessarily complicated).
Second schema:
school
examID
attr1
attr2
...
attr6
While this schema looks attractive, I find it hard to convince myself that it is a good idea to represent exams as values rather than columns or separate tables. That is, the set of exams is known, finite and final, and each exam has exact same properties - sounds like a primary candidate for a separate table. On the other hand, under such an arrangement, both aggregation and single-exam queries are very clean and straight-forward.
Third schema would be identical for 12 separate exam tables:
school
attr1
attr2
...
attr6
Conceptually, I would feel that this schema represents my data best: each exam is logically separated into its own table. However, any queries requiring aggregate data over all exams would then include 12 tables, and that makes me feel rather uneasy.
Thus, my question: which database design would be best in my case? While I am looking for an answer, I am also very interested in reasons for choosing one schema over the other. Specifically, I wonder:
how efficiency of running queries changes with each database design,
how important in real life is the ease of writing queries (given that the data would be primarily used by a website - I would seldom write queries over the data after the website has been finished),
which design is better if potential future changes to the data of the website are taken into account,
whether your answer would be different if the number of schools was not 500, but 50,000.
In short, I am interested in any opinions that would help me understand why one design is better than the other. Any database design theories are welcome as well. Thanks!
In an operational relational database, the speed of changes is more important than speed of selects. In a data warehouse, the speed of selects is more important than the speed of changes.
You have a data warehouse.
Operational relational databases are normalized.
Data warehouses use some variation of a star schema.
Your second schema is a good schema for the reason you stated. Both aggregation and single-exam queries are very clean and straight-forward. However, you should put the school information in a separate school table, and reference the school table ID (primary key field, auto-increment integer) as a foreign key in the exam table. This allows you to scale from 500 to 50,000 schools more easily.

Resources