Upload data in star data warehouse - sql-server

I recently build a simple data warehouse with 2 dimension tables and 1 fact table.
First Dim hold the user input "queryId, dna sequence, dna database name, other parameters".
Second Dim hold database description "databaseId, other parameters".
The fact table will hold the result of the search "queryId, databaseID, hit founded, other parameters describe the hit".
Now, where should I upload the data (The result)? To the fact table? or to the dimensions table?
To where should I upload "queryId and databaseID"? because they are in dimensions and in the fact. Sorry for this question but, I am new to DW.
Thanks a lot,

You have to create an ETL that loads like this (this assumes we rebuild the DW on each import, the steps are different for incremental loading):
Truncate fact table
Truncate dimensions
Populate dimensions, (your keys should be in the dimensions)
Populate the fact with with your dimension keys and measures
Then, when querying, you'll join your dimensions onto your fact via the keys.

Neither nor.
You UPLOAD data into staging tables. Those are created for optimal upload speed. Staging tables may be flat, may not be complete and require joining with other tables.
Then you use a loading process to load them from staging into the data warehouse.

Related

Star Schema from multiple source tables

I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?
Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table
Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.

SSIS Move Data Between Databases - Maintain Referential Integrity

I need to move data between two databases and wanted to see if SSIS would be a good tool. I've pieced together the following solution, but it is much more complex than I was hoping it would be - any insight on a better approach to tackling this problem would be greatly appreciated!
So what makes my situation unique; we have a large volume of data, so to keep the system performant we have split our customers into multiple database servers. These servers have databases with the same schema, but are each populated with unique data. Occasionally we have the need to move a customer's data from one server to another. Because of this, simple recreating the tables and moving the data in place won't work as in the database on server A there could be 20 records, but there could be 30 records in the same table for the database on server B. So when moving record 20 from A to B, it will need to be assigned ID 31. Getting past this wasn't difficult, but the trouble comes when needing to move the tables which have a foreign key reference to what is now record 31....
An example:
Here's a sample schema for a simple example:
There is a table to track manufacturers, and a table to track products which each reference a manufacturer.
Example of data in the source database:
To handle moving this data while maintaining relational integrity, I've taken the approach of gathering the manufacturer records, looping through them, and for each manufacturer moving the associated products. Here's a high level look at the Control Flow in SSDT:
The first Data Flow grabs the records from the source database and pulls them into a Recordset Destination:
The OLE DB Source pulls from the source databases manufacturer table while pulling all columns, and places it into a record set:
Back in the control flow, I then loop through the records in the Manufacturer recordset:
For each record in the manufacturer recordset I then execute a SQL task which determines what the next available auto-incrementing ID will be in the destination database, inserts the record, and then returns the results of a SELECT MAX(ManufacturerID) in the Execute SQL Task result set so that the newly created Manufacturer ID can be used when inserting the related products into the destination database:
The above works, however once you get more than a few layers deep of tables that reference one another, this is no longer very tenable. Is there a better way to do this?
You could always try this:
Populate you manufacturers table.
Get your products data (ensure you have a reference such as name etc. to manufacturer)
Use a lookup to get the ID where your name or whatever you choose matches.
Insert into database.
This will keep your FK constraints and not require you to do all that max key selection.

Combine two fact tables from two different marts and model in create Tabular model

I have a Fact Table Service1 (open or latest data) from Mart1 and another Fact Table Service2 (historic data) from Mart2. These tables share few common measures and dimensions but the underlying dataset is mutually exclusive.
Now the business wants to merge these two facts into one table in Tabular model to do Year over Year comparison.
Is it possible to combine these two facts, if so, what should be the approach.
Alternatively, do we have to achieve this.
Things to note down are,
Records in Fact table Service2 will never change
The Dimension keys between Mart1 and Mart2 is not guaranteed to be same
Are these Data Marts different databases? If so, you can create a calculated table that brings the two tables together. To do this, in 2016, on the bottom of the designer, there is a little plus sign on the far right next to the last table tab defined. When you hover over it, it will say "Create new table from DAX formula". Create the DAX that selects from the first table union the second table.
If the marts are in the same database you can create a partition for each on the table properties. In order to do this you would create a tabular model, open a connection to data source and bring in the changing data. Then click on table, partitions, click New, then grab the archived data. You would have to make sure the column definitions are in line in order to do this.
As far the other issues that you describe, it sounds as though you are using the Data Marts as your warehouses. Do you have access to the data before it was transformed into the Data Mart where the surrogate keys were applied? I generally keep a Persisted Staging Area around for these cases. If you employ a Data Vault (http://learndatavault.com/) prior to your Data Mart creation, you could simply create a new Data Mart with both sets of data and all of the Dimension keys will be intact.

Basic questions regarding Data Warehousing

I'm wanting to use OLAP cubes and have to first design a data warehouse. I am going for the star-schema. I'm a little confused about how to convert from a normal database to a data warehouse, especially with regards to foreign keys between dimension tables. I know a fact table has foreign keys to dimensions, but do dimensions have foreign keys between them? For example, what do I need to do with the following 2 examples:
TABLE: Airports
COLUMNS: Id, Name, Code, CityId
When I make the Airports dimension, do I remove CityId and put the City Name instead? Or what?
TABLE: Regions
COLUMNS: Id, Name, RegionType, ParentId
The question for this one is mostly the same, but a bit more complex, because here ParentId refers to the same table (Regions).. example: a City can refer to a parent Country record. How do I translate these over to a data warehouse star schema?
Lastly, regarding measures, those go on the fact table, right? I think I will likely need multiple fact tables. Is that normal? Does one fact table translate to one OLAP cube? Or what?
You want to include city within your airport dimension. You are intentionally flattening out your normalised schema to aid the speed of the dimensional model which can seem counter intuitive if you are coming from transactional development.
With regards to the perennial child relationship, you want the parented to be translated into the surrogate of the region record. Ssas will provide the functionality to relate parent child records when you are designing your cube.
Multiple facts are not unusual, but unless the fact data is completely unrelated, there is no need to separate them into different cubes. The requirement for multiple facts will be driven by having data at a different grain. Keep all of you metrics (I.e. Flights) together, but you would separate out flight metrics from food sale metrics
you not converting to data warehouse, you are creating new data warehouse with few dimension and 1 (at least) Fact table. dimension tables are loaded first and you DO NOT want to change id with name.
you need additional key for each dimension table. once you load dimensions, I usually use ssis package to load fact table.(either incremental load or you can truncate fact table each time before you load with new data( depends what you need) ...

Insert into a star-schema

I've read a lot about star-schema's, about fact/deminsion tables, select statements to quickly report data, however the matter of data entry into a star-schema seems aloof to me. How does one "theoretically" enter data into a star-schema db? while maintaining the fact table. Is a series of INSERT INTO statement within giant stored proc with 20 params my only option (and how to populate the fact table).
Many thanks.
Start with dimensions first -- one by one. Use ECCD (Extract, Clean, Conform, Deliver) approach.
Make sure that each dimension has a BusinessKey that uniquely identifies the "object" that a dimension row describes -- like email for a person.
With dimensions loaded, prepare key-lookup pipeline. In general, for each each dimension table you can prepare a key lookup table (BusinessKey, PrimaryKey). Some designers choose to lookup the dimension table directly, but the key-lookup can be often easily cached into memory which results in faster fact loading.
Use ECCD for fact data too. The ECC part happens in the staging area, you can choose (helper) tables or flat files for each step of the ECC, as you prefer.
While delivering fact tables, replace each BusinessKey in the fact row with the matching PrimaryKey that you get from a key-lookup table. Once all BusinessKeys are replaced with their matching PrimaryKeys, insert the row into the fact table.
Do not waste you time, use ETL tool. You can download Pentaho Kettle (community edition) for free -- it has everything one needs to achieve this.
You typically do not insert data into a star schema in the same way you might into a normal form - i.e. with a stored procedure which inserts/updated all the appropriate tables within a single transaction. Remember that the star schema is typically a read-only denormalized model of data - it is (rarely) treated transactionally, and is typically loaded from data that is already denormalized flat - usually one flat file per star.
As Damir points out, typically, you load all the dimensions (handle the slowly changing etc), then load the facts, joining to the appropriate current dimensions to find the dimension IDs (using the business keys).

Resources