Star Schema from multiple source tables - data-modeling

I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?

Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table

Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.

Related

A Master Category Table Where Records Have Various Categories OR There Should Be A Table For Each Category Type

Recently I encountered an application, Where a Master Table is maintained which contain the data of more than 20 categories. For e.g. it has some categories named as Country,State and City.
So my question is, it is better to move out this category as a separate table and fetching out the data through joins or Everything should be inside a single table.
P.S. In future categories count might increase to 50+ or more than it.
P.S. application based on EF6 + Sql Server.
Edited Version
I just want to know that in above scenario what should be the best approach, one should go with single table with proper indexing or go by the DB normalization approach, putting each category into a separate Table and maintaning relationship through fk's.
Normally, categories are put into separate tables. This conforms more closely with normalized database structures and the definition of entities. In particular, it allows for proper foreign key relationships to be defined. That is a big win for data integrity.
Sometimes categories are put into a single table. This can, of course, be confusing; consider, for instance, "Florida, Massachusetts" or "Washington, Iowa" (these are real places).
Putting categories in one table has one major advantage: all the text is in a single location. That can be very handy for internationalization efforts. To be honest, that is the situation where I have seen this used.

Basic questions regarding Data Warehousing

I'm wanting to use OLAP cubes and have to first design a data warehouse. I am going for the star-schema. I'm a little confused about how to convert from a normal database to a data warehouse, especially with regards to foreign keys between dimension tables. I know a fact table has foreign keys to dimensions, but do dimensions have foreign keys between them? For example, what do I need to do with the following 2 examples:
TABLE: Airports
COLUMNS: Id, Name, Code, CityId
When I make the Airports dimension, do I remove CityId and put the City Name instead? Or what?
TABLE: Regions
COLUMNS: Id, Name, RegionType, ParentId
The question for this one is mostly the same, but a bit more complex, because here ParentId refers to the same table (Regions).. example: a City can refer to a parent Country record. How do I translate these over to a data warehouse star schema?
Lastly, regarding measures, those go on the fact table, right? I think I will likely need multiple fact tables. Is that normal? Does one fact table translate to one OLAP cube? Or what?
You want to include city within your airport dimension. You are intentionally flattening out your normalised schema to aid the speed of the dimensional model which can seem counter intuitive if you are coming from transactional development.
With regards to the perennial child relationship, you want the parented to be translated into the surrogate of the region record. Ssas will provide the functionality to relate parent child records when you are designing your cube.
Multiple facts are not unusual, but unless the fact data is completely unrelated, there is no need to separate them into different cubes. The requirement for multiple facts will be driven by having data at a different grain. Keep all of you metrics (I.e. Flights) together, but you would separate out flight metrics from food sale metrics
you not converting to data warehouse, you are creating new data warehouse with few dimension and 1 (at least) Fact table. dimension tables are loaded first and you DO NOT want to change id with name.
you need additional key for each dimension table. once you load dimensions, I usually use ssis package to load fact table.(either incremental load or you can truncate fact table each time before you load with new data( depends what you need) ...

Table Design - Wide Table vs. Columns as Properties

I'm part of a team architecting an Operational Data Store (ODS) database, using SQL Server 2012, that will be used by some of our analysts to do predictive modeling. The ODS will contain manufacturing production data for a single product we make.
We will have hundreds of tables in the ODS. However, we will have a single core table that will contain critical information (lifecycle info) about each item manufactured (tens of millions each year). Our product is manufactured in a manufacturing plant and spends roughly 2.5 hours moving through various processes along a production line. We want to store various, individual, pieces of manufacturing and post manufacturing information in this core table. An example piece of data might be the time the product entered a particular oven.
We have a decision to make on how to architect this table. We can create a wide table (many columns) or a narrow table where most columns are rows (as property values). I have never designed and worked with a table structure that is very narrow and columns are treated as rows in the table.
I'd like some feedback on the pros and cons of a wide table vs. a narrow table. The following might be useful in helping with this discussion:
Number of products produced each year: Several million (each of these product instances will be a row in the core table)
Will this table be queried often: Yes, very often. It will be the parent to many child tables.
Potential number of columns (or row properties): 75 to 150+
If more information would be useful, I'd be glad to provide it.
Wide tables, static properties
You are tracking a single product through a well-defined manufacturing process. This data model sounds very static, and would lend itself to a wide table with many columns that are consistently populated with data.
Narrow tables, dynamic properties
If you had many, many products with lots of variation in the manufacturing process, it would be better suited for a narrow table, where you could easily add new properties for tracking.
Difficult to query a narrow table
However, even simple querying of a narrow table can extremely difficult. For example, what if you needed to sort the data by a certain property when that property is shuffled amongst 100+ other property rows? How would you get all the rows together to form a single "record" and then sort the record groups within your result set?
Flat tables simpler to query
Depending on how you need to view and analyze the data, you may find yourself constantly using pivot or crosstab queries. If that's the case, then why not flatten out the storage table to begin with?
Or do both
Another option is to do both: Store the data narrowly, and use a transformation process to flatten it out for ease of reporting. That way you can quickly begin tracking new properties (just by adding rows), and then you can work on getting your reporting tables and transformation process updated to utilize the new data.
How wide is too wide? Well, there can be several problems with wide tables.
One problem is that wide tables tend to deviate from the rules for normalizing data. This in turn can result in tricky update problems where you have to be careful to prevent the database from entering a self contradictory state. There's no particular answer to how wide it too wide here. Just apply the normalization rules, and you'll end up decomposing the table.
However, some databases are not built with normalization as the guiding principle. In particular, consider fact tables in star schemas. There are times when some of the coulmns are determined by some subset of the FK's, and this can violate 3NF or even 2NF. Keeping fact tables skinny is still important in star schemas, but it's for a different reason, namely speed. Sometimes, a fact table can be made skinnier by pushing data out to one of the dimension tables. Sometimes, you can decompose a star into two or more related stars.
Your case sounds like the second reason given above, even though your design probably isn't a star schema. Still, star schema design principles might help you improve your design.

Insert into a star-schema

I've read a lot about star-schema's, about fact/deminsion tables, select statements to quickly report data, however the matter of data entry into a star-schema seems aloof to me. How does one "theoretically" enter data into a star-schema db? while maintaining the fact table. Is a series of INSERT INTO statement within giant stored proc with 20 params my only option (and how to populate the fact table).
Many thanks.
Start with dimensions first -- one by one. Use ECCD (Extract, Clean, Conform, Deliver) approach.
Make sure that each dimension has a BusinessKey that uniquely identifies the "object" that a dimension row describes -- like email for a person.
With dimensions loaded, prepare key-lookup pipeline. In general, for each each dimension table you can prepare a key lookup table (BusinessKey, PrimaryKey). Some designers choose to lookup the dimension table directly, but the key-lookup can be often easily cached into memory which results in faster fact loading.
Use ECCD for fact data too. The ECC part happens in the staging area, you can choose (helper) tables or flat files for each step of the ECC, as you prefer.
While delivering fact tables, replace each BusinessKey in the fact row with the matching PrimaryKey that you get from a key-lookup table. Once all BusinessKeys are replaced with their matching PrimaryKeys, insert the row into the fact table.
Do not waste you time, use ETL tool. You can download Pentaho Kettle (community edition) for free -- it has everything one needs to achieve this.
You typically do not insert data into a star schema in the same way you might into a normal form - i.e. with a stored procedure which inserts/updated all the appropriate tables within a single transaction. Remember that the star schema is typically a read-only denormalized model of data - it is (rarely) treated transactionally, and is typically loaded from data that is already denormalized flat - usually one flat file per star.
As Damir points out, typically, you load all the dimensions (handle the slowly changing etc), then load the facts, joining to the appropriate current dimensions to find the dimension IDs (using the business keys).

Do 1 to 1 relations on db tables smell?

I have a table that has a bunch of fields. The fields can be broken into logical groups - like a job's project manager info. The groupings themselves aren't really entity candidates as they don't and shouldn't have their own PKs.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Is there anything I should watch out for when I do this? Is this just a poor choice?
I can see that maybe my queries will get more complicated with all the extra joins but that can be mitigated with views right? If we're talking about a table with less than 100k records is this going to have a noticeable effect on performance?
Edit: I'll justify the non-entity candidate thoughts a little further. This information is entered by our user base. They don't know/care about each other. So its possible that the same user will submit the same "projectManager name" or whatever which, at this point, wouldn't be violating any constraint. Its for us to determine later on down the pipeline if we wanna correlate entries from separate users. If I were to give these things their own key they would grow at the same rate the main table grows - since they are essentially part of the same entity. At no pt is a user picking from a list of available "project managers".
So, given the above, I don't think they are entities. But maybe not - if you have further thoughts please post.
I don't usually use 1 to 1 relations unless there is a specific performance reason for it. For example storing an infrequently used large text or BLOB type field in a separate table.
I would suspect that there is something else going on here though. In the example you give - PmFirstName - it seems like maybe there should be a single pm_id relating to a "ProjectManagers" or "Employees" table. Are you sure none of those groupings are really entity candidates?
To me, they smell unless for some rows or queries you won't be interested in the extra columns. e.g. if for a large portion of your queries you are not selecting the PmFirstName columns, or if for a large subset of rows those columns are NULL.
I like the smells tag.
I use 1 to 1 relationships for inheritance-like constructs.
For example, all bonds have some basic information like CUSIP, Coupon, DatedDate, and MaturityDate. This all goes in the main table.
Now each type of bond (Treasury, Corporate, Muni, Agency, etc.) also has its own set of columns unique to it.
In the past we would just have one incredibly wide table with all that information. Now we break out the type-specific info into separate tables, which gives us much better performance.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Create a person table, every database needs this. Then in your project table have a column called PMKey which points to the person table.
Why do you feel that the group of fields are not an entity candidates? If they are not then why try to identify them with a prefix?
Either drop the prefixes or extract them into their own table.
It is valuable splitting them up into separate tables if they are separate logical entities that could be used elsewhere.
So a "Project Manager" could be 1:1 with all the projects currently, but it makes sense that later you might want to be able to have a Project Manager have more than one project.
So having the extra table is good.
If you have a PrimaryFirstName,PrimaryLastName,PrimaryPhone, SecondaryFirstName,SecondaryLastName,SEcondaryPhone
You could just have a "Person" table with FirstName, LastName, Phone
Then your original Table only needs "PrimaryId" and "SecondaryId" columns to replace the 6 columns you previously had.
Also, using SQL you can split up filegroups and tables across physical locations.
So you could have a POST table, and a COMMENT Table, that have a 1:1 relationship, but the COMMENT table is located on a different filegroup, and on a different physical drive with more memory.
1:1 does not always smell. Unless it has no purpose.

Resources