I've read a lot about star-schema's, about fact/deminsion tables, select statements to quickly report data, however the matter of data entry into a star-schema seems aloof to me. How does one "theoretically" enter data into a star-schema db? while maintaining the fact table. Is a series of INSERT INTO statement within giant stored proc with 20 params my only option (and how to populate the fact table).
Many thanks.
Start with dimensions first -- one by one. Use ECCD (Extract, Clean, Conform, Deliver) approach.
Make sure that each dimension has a BusinessKey that uniquely identifies the "object" that a dimension row describes -- like email for a person.
With dimensions loaded, prepare key-lookup pipeline. In general, for each each dimension table you can prepare a key lookup table (BusinessKey, PrimaryKey). Some designers choose to lookup the dimension table directly, but the key-lookup can be often easily cached into memory which results in faster fact loading.
Use ECCD for fact data too. The ECC part happens in the staging area, you can choose (helper) tables or flat files for each step of the ECC, as you prefer.
While delivering fact tables, replace each BusinessKey in the fact row with the matching PrimaryKey that you get from a key-lookup table. Once all BusinessKeys are replaced with their matching PrimaryKeys, insert the row into the fact table.
Do not waste you time, use ETL tool. You can download Pentaho Kettle (community edition) for free -- it has everything one needs to achieve this.
You typically do not insert data into a star schema in the same way you might into a normal form - i.e. with a stored procedure which inserts/updated all the appropriate tables within a single transaction. Remember that the star schema is typically a read-only denormalized model of data - it is (rarely) treated transactionally, and is typically loaded from data that is already denormalized flat - usually one flat file per star.
As Damir points out, typically, you load all the dimensions (handle the slowly changing etc), then load the facts, joining to the appropriate current dimensions to find the dimension IDs (using the business keys).
Related
I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?
Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table
Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.
I have a Fact Table Service1 (open or latest data) from Mart1 and another Fact Table Service2 (historic data) from Mart2. These tables share few common measures and dimensions but the underlying dataset is mutually exclusive.
Now the business wants to merge these two facts into one table in Tabular model to do Year over Year comparison.
Is it possible to combine these two facts, if so, what should be the approach.
Alternatively, do we have to achieve this.
Things to note down are,
Records in Fact table Service2 will never change
The Dimension keys between Mart1 and Mart2 is not guaranteed to be same
Are these Data Marts different databases? If so, you can create a calculated table that brings the two tables together. To do this, in 2016, on the bottom of the designer, there is a little plus sign on the far right next to the last table tab defined. When you hover over it, it will say "Create new table from DAX formula". Create the DAX that selects from the first table union the second table.
If the marts are in the same database you can create a partition for each on the table properties. In order to do this you would create a tabular model, open a connection to data source and bring in the changing data. Then click on table, partitions, click New, then grab the archived data. You would have to make sure the column definitions are in line in order to do this.
As far the other issues that you describe, it sounds as though you are using the Data Marts as your warehouses. Do you have access to the data before it was transformed into the Data Mart where the surrogate keys were applied? I generally keep a Persisted Staging Area around for these cases. If you employ a Data Vault (http://learndatavault.com/) prior to your Data Mart creation, you could simply create a new Data Mart with both sets of data and all of the Dimension keys will be intact.
I recently build a simple data warehouse with 2 dimension tables and 1 fact table.
First Dim hold the user input "queryId, dna sequence, dna database name, other parameters".
Second Dim hold database description "databaseId, other parameters".
The fact table will hold the result of the search "queryId, databaseID, hit founded, other parameters describe the hit".
Now, where should I upload the data (The result)? To the fact table? or to the dimensions table?
To where should I upload "queryId and databaseID"? because they are in dimensions and in the fact. Sorry for this question but, I am new to DW.
Thanks a lot,
You have to create an ETL that loads like this (this assumes we rebuild the DW on each import, the steps are different for incremental loading):
Truncate fact table
Truncate dimensions
Populate dimensions, (your keys should be in the dimensions)
Populate the fact with with your dimension keys and measures
Then, when querying, you'll join your dimensions onto your fact via the keys.
Neither nor.
You UPLOAD data into staging tables. Those are created for optimal upload speed. Staging tables may be flat, may not be complete and require joining with other tables.
Then you use a loading process to load them from staging into the data warehouse.
I am using SQL Server 2005 Express and Visual Studio 2008.
I have a database which has a table with 400 Columns. Things were (just about manageable) until I had to perform bi-directional sync between several databases.
I am wondering what arguments are for and against using 400 column database or 40 table database are?
The table in not normalised and comprises of mainly nvarchar(64) columns and some TEXT columns. (there are no datatypes as it was converted from text files).
There is one other table that links to this table and is a 1-1 relationship (i.e one entry relates to one entry in the 400 column table).
The table is a list files that contained parameters that are "plugged" into a application.
I look forward to your replies.
Thank you
Based on your process description I would start with something like this. The model is simplified, does not capture history, etc -- but, it is a good starting point. Note: parameter = property.
- Setup is a collection of properties. One setup can have many properties, one property belongs to one setup only.
- Machine can have many setups, one setup belongs to one machine only.
- Property is of a specific type (temperature, run time, spindle speed), there can be many properties of a certain type.
- Measurement and trait are types of properties. Measurement is a numeric property, like speed. Trait is a descriptive property, like color or some text.
For having a wide table:
Quick to report on as it's presumably denormalized and so no joins are needed.
Easy to understand for end-consumers as they don't need to hold a data model in their heads.
Against having a wide table:
Probably need to have multiple composite indexes to get good query performance
More difficult to maintain data consistency i.e. need to update multiple rows when data changes if that data is on multiple rows
As you're having to update multiple rows and maintain multiple indexes, concurrent performance for updates may become an issue as locks escalate.
You might end up with records with loads of nulls in columns if the attribute isn't relevant to the entity on that row which can make handling results awkward.
If lazy developers do a SELECT * from the table you end up dragging loads of data across the network, so you generally have to maintain suitable subset views.
So it all really depends on what you're doing. If the main purpose of the table is OLAP reporting and updates are infrequent and affect few rows then perhaps a wide, denormalized table is the right thing to have. In an OLTP environment then it's probably not and you should prefer narrower tables. (I generally design in 3NF and then denormalize for query performance as I go along.)
You could always take the approach of normalizing and providing a wide-view for readers if that's what they want to see.
Without knowing more about the situation it's not really possible to say more about the pros and cons in your particular circumstance.
Edit:
Given what you've said in your comments, have you considered just having a long & skinny name=value pair table so you'd just have UserId, PropertyName, PropertyValue columns? You might want to add in some other meta-attributes into it too; timestamp, version, or whatever. SQL Server is quite efficient at handling these sorts of tables so don't discount a simple solution like this out-of-hand.
I have many tables in my database which are interrelated. I have a table (table one) which has had data inserted and the id auto increments. Once that row has an ID i want to insert this into a table (table three) with another set of ID's which comes from a form(this data will also be going into a table, so it could from from that table), the same form as the data which went into the first table came from.
The two ID's together make the primary key of the third table.
How can I do this, its to show that more than one ID is joined to a single ID for something else.
Thanks.
You can't do that through a trigger as the trigger only has available to it the data that you already inserted not data that is currenlty only residing in your user interface.
Normally how you handle this situation is that you write a stored proc that inserts the meeting, returns the id value (using scope_identity() in SQL Server, but I'm sure other databases would have method to return the auto-generated id as well). Then you would use that value to insert to the other table with the other values you need for that table. You would of course want to wrap the whole thing in a transaction.
I think you can probably do what you're describing (just write the INSERTs to table 3) in the table 1 trigger) but you'll have to put the additional info for the table 3 rows into your table 1 row, which isn't very smart.
I can't see why you would do that instead of writing the INSERTs in your code, where someone reading it can see what's happening.
The trouble with triggers is that they make it easy to hide business logic in the database. I think (and I believe I'm in the majority here) that it's easier to understand, manage, maintain and generally all-round deal with an application where all the business rules exist in the same general area.
There are reasons to use triggers (for propagating denormalised values, for example) just as there are reasons for useing stored procedures. I'm going to assert that they are largely related to performance-critical areas. Or should be.