Basic questions regarding Data Warehousing - sql-server

I'm wanting to use OLAP cubes and have to first design a data warehouse. I am going for the star-schema. I'm a little confused about how to convert from a normal database to a data warehouse, especially with regards to foreign keys between dimension tables. I know a fact table has foreign keys to dimensions, but do dimensions have foreign keys between them? For example, what do I need to do with the following 2 examples:
TABLE: Airports
COLUMNS: Id, Name, Code, CityId
When I make the Airports dimension, do I remove CityId and put the City Name instead? Or what?
TABLE: Regions
COLUMNS: Id, Name, RegionType, ParentId
The question for this one is mostly the same, but a bit more complex, because here ParentId refers to the same table (Regions).. example: a City can refer to a parent Country record. How do I translate these over to a data warehouse star schema?
Lastly, regarding measures, those go on the fact table, right? I think I will likely need multiple fact tables. Is that normal? Does one fact table translate to one OLAP cube? Or what?

You want to include city within your airport dimension. You are intentionally flattening out your normalised schema to aid the speed of the dimensional model which can seem counter intuitive if you are coming from transactional development.
With regards to the perennial child relationship, you want the parented to be translated into the surrogate of the region record. Ssas will provide the functionality to relate parent child records when you are designing your cube.
Multiple facts are not unusual, but unless the fact data is completely unrelated, there is no need to separate them into different cubes. The requirement for multiple facts will be driven by having data at a different grain. Keep all of you metrics (I.e. Flights) together, but you would separate out flight metrics from food sale metrics

you not converting to data warehouse, you are creating new data warehouse with few dimension and 1 (at least) Fact table. dimension tables are loaded first and you DO NOT want to change id with name.
you need additional key for each dimension table. once you load dimensions, I usually use ssis package to load fact table.(either incremental load or you can truncate fact table each time before you load with new data( depends what you need) ...

Related

Star Schema from multiple source tables

I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?
Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table
Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.

A Master Category Table Where Records Have Various Categories OR There Should Be A Table For Each Category Type

Recently I encountered an application, Where a Master Table is maintained which contain the data of more than 20 categories. For e.g. it has some categories named as Country,State and City.
So my question is, it is better to move out this category as a separate table and fetching out the data through joins or Everything should be inside a single table.
P.S. In future categories count might increase to 50+ or more than it.
P.S. application based on EF6 + Sql Server.
Edited Version
I just want to know that in above scenario what should be the best approach, one should go with single table with proper indexing or go by the DB normalization approach, putting each category into a separate Table and maintaning relationship through fk's.
Normally, categories are put into separate tables. This conforms more closely with normalized database structures and the definition of entities. In particular, it allows for proper foreign key relationships to be defined. That is a big win for data integrity.
Sometimes categories are put into a single table. This can, of course, be confusing; consider, for instance, "Florida, Massachusetts" or "Washington, Iowa" (these are real places).
Putting categories in one table has one major advantage: all the text is in a single location. That can be very handy for internationalization efforts. To be honest, that is the situation where I have seen this used.

Combine two fact tables from two different marts and model in create Tabular model

I have a Fact Table Service1 (open or latest data) from Mart1 and another Fact Table Service2 (historic data) from Mart2. These tables share few common measures and dimensions but the underlying dataset is mutually exclusive.
Now the business wants to merge these two facts into one table in Tabular model to do Year over Year comparison.
Is it possible to combine these two facts, if so, what should be the approach.
Alternatively, do we have to achieve this.
Things to note down are,
Records in Fact table Service2 will never change
The Dimension keys between Mart1 and Mart2 is not guaranteed to be same
Are these Data Marts different databases? If so, you can create a calculated table that brings the two tables together. To do this, in 2016, on the bottom of the designer, there is a little plus sign on the far right next to the last table tab defined. When you hover over it, it will say "Create new table from DAX formula". Create the DAX that selects from the first table union the second table.
If the marts are in the same database you can create a partition for each on the table properties. In order to do this you would create a tabular model, open a connection to data source and bring in the changing data. Then click on table, partitions, click New, then grab the archived data. You would have to make sure the column definitions are in line in order to do this.
As far the other issues that you describe, it sounds as though you are using the Data Marts as your warehouses. Do you have access to the data before it was transformed into the Data Mart where the surrogate keys were applied? I generally keep a Persisted Staging Area around for these cases. If you employ a Data Vault (http://learndatavault.com/) prior to your Data Mart creation, you could simply create a new Data Mart with both sets of data and all of the Dimension keys will be intact.

Best approach to avoid Too many columns and complexity in database design

Inventory Items :
Paper Size
-----
A0
A1
A2
etc
Paper Weight
------------
80gsm
150gsm etc
Paper mode
----------
Colour
Bw
Paper type
-----------
glass
silk
normal
Tabdividers and tabdivider Type
--------
Binding and Binding Types
--
Laminate and laminate Types
--
Such Inventory items and these all needs to be stored in invoice table
How do you store them in Database using proper RDBMS.
As per my opinion for each list a master table and retrieval with JOINS. However this may be a little bit complex adding too many tables into the database.
This normalisation is having bit of problem when storing all this information against a Invoice. This is causing too many columns in invoice table.
Other way putting all of them into a one table with more columns and then each row will be a combination of them.. (hacking algorithm 4 list with 4 items over 24 records which will have reference ID).
Which one do you think the best and why!!
Your initial idea is correct. And anyone claiming that four tables is "a little bit complex" and/or "too many tables" shouldn't be doing database work. This is what RDBMS's are designed (and tuned) to do.
Each of these 4 items is an individual property of something so they can't simply be put, as is, into a table that merges them. As you had thought, you start with:
PaperSize
PaperWeight
PaperMode
PaperType
These are lookup tables and hence should have non-auto-incrementing ID fields.
These will be used as Foreign Key fields for the main paper-based entities.
Or if they can only exist in certain combinations, then there would need to be a relationship table to capture/manage what those valid combinations are. But those four paper "properties" would still be separate tables that Foreign Key to the relationship table. Some people would put an separate ID field on that relationship table to uniquely identify the combination via a single value. Personally, I wouldn't do that unless there was a technical requirement such as Replication (or some other process/feature) that required that each table had a single-field key. Instead, I would just make the PK out of the four ID fields that point to those paper "property" lookup tables. Then those four fields would still go into any paper-based entities. At that point the main paper entity tables would look about the same as they would if there wasn't the relationship table, the difference being that instead of having 4 FKs of a single ID field each, one to each of the paper "property" tables, there would be a single FK of 4 ID fields pointing back to the PK of the relationship table.
Why not jam everything into a single table? Because:
It defeats the purpose of using a Relational Database Management System to flatten out the data into a non-relational structure.
It is harder to grow that structure over time
It makes finding all paper entities of a particular property clunkier
It makes finding all paper entities of a particular property slower / less efficient
maybe other reasons?
EDIT:
Regarding the new info (e.g. Invoice Table, etc) that wasn't in the question when I was writing the above, that should be abstracted via a Product/Inventory table that would capture these combinations. That is what I was referring to as the main paper entities. The Invoice table would simply refer to a ProductID/InventoryID (just as an example) and the Product/Inventory table would have these paper property IDs. I don't see why these properties would be in an Invoice table.
EDIT2:
Regarding the IDs of the "property" lookup tables, one reason that they should not be auto-incrementing is that their values should be taken from Enums in the app layer. These lookup tables are just a means of providing a "data dictionary" so that the database layer can have insight into what these values mean.

How can I create a hierarchy in SSAS?

I have the table order with following fields:
ID
Serial
Visitor
Branch
Company
Assume there are relations between Visitor, Branch and Company in the database. But every visitor can be in more Branch. How can I create a hierarchy between these three fields for my order table.
How can I do that?
You would need to create a denormalised dimension table, with the distinct result of the denormalisation process of the table order. In this case, you would have many rows for the same visitor. One for each branch.
In your fact table, the activity record which would have BranchKey in the primary key, would reference this dimension. This obviously would be together with the VisitorKey...
Then in SSAS you would need to build the hierarchy, and set the relationships between the keys... When displaying this data in a client, such as excel, you would drag the hierarchy in the rows, and when expanding, data from your fact would fit in according to the visitors branch...
With regards to dimensions, it's important to set relationships between the attributes, as this will give you a massive performance gain when processing the dimension, and the cube. Take a look at this article for help regarding that matter http://www.bidn.com/blogs/DevinKnight/ssis/1099/ssas-defining-attribute-relationships-in-2005-and-2008. In this case it's the same approach also for '12.

Resources