Loading Fact tables from SCD1 and SCD2 Dimension in SSIS - sql-server

I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you

There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?

Related

Slowly Changing Dimension Vs Fact

I am rookie to DW. I have a Customer table with basic columns that rarely change like Name, JoinedOn etc. And another set of Columns that can change over time like "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner" etc. At the moment there is no history. In the DW I would like to track when the following columns change "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner". I feel it would be better if I create another table to track these, the table will have the following columns:
"CustomerId", "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner", "ExpiredOn","IsCurrent"
Is this the right approach? And if yes then is this new table a fact or a slowly changing dimension? I would like to run queries like when did the CustomerType change from A to B? When was it Published? When the BusinessStatus changed who was the owner?
Which way of modeling the customers is better really depends on how are you going to use the corresponding dimension. If you would want, for example, to summarize some sales associated with the "Customer" dimension by "CustomerType" at the time of the sale, you could do it only if you keep the historical details as a part of slowly changing dimension.
You probably can run a lot of customer reports right on that table that represents a slowly changing "Customer" dimension. But if the number of your customers get into the millions, you'd be better off creating a separate fact table (or tables) for customer status changes.
So, to summarize: start off with a slowly changing dimension. If the number of customers grows too large and reports on customer statust changes become too slow, add a fact table for them and don't worry about duplicating data.

data synchronization from unreliable data source to SQL table

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

What is the best way to represent a constrained many-to-many relationship within a relational database?

I would like to establish a many-to-many relationship with a constraint that only one or no entity from each side of the relationship can be linked at any one time.
A good analogy to the problem is cars and parking garage spaces. There are many cars and many spaces. A space can contain one car or be empty; a car can only be in one space at a time, or no space (not parked).
We have a Cars table and a Spaces table (and possibly a linking table). Each row in the cars table represents a unique instance of a car (with license, owner, model, etc.) and each row in the Spaces table represents a unique parking space (with address of garage floor level, row and number). What is the best way to link these tables in the database and enforce the constraint describe above?
(I am using C#, NHibernate and Oracle.)
If you're not worried about history - ie only worried about "now", do this:
create table parking (
car_id references car,
space_id references space,
unique car_id,
unique space_id
);
By making both car and space references unique, you restrict each side to a maximum of one link - ie a car can be parked in at most one space, and a space can has at most one car parked in it.
in any relational database, many to many relationships must have a join table to represent the combinations. As provided in the answer (but without much of the theoretical background), you cannot represent a many to many relationship without having a table in the middle to store all the combinations.
It was also mentioned in the solution that it only solves your problem if you don't need history. Trust me when I tell you that real world applications almost always need to represent historical data. There are many ways to do this, but a simple method might be to create what's called a ternary relationship with an additional table. You could, in theory, create a "time" table that also links its primary key (say a distinct timestamp) with the inherited keys of the other two source tables. this would enable you to prevent errors where two cars are located in the same parking spot during the same time. using a time table can allow you the ability to re-use the same time data for multiple parking spots using a simple integer id.
So, your data tables might look like so
table car
car_id (integers/numbers are fastest to index)
...
table parking-space
space_id
location
table timeslot
time_id integer
begin_datetime (don't use seconds unless you must!)
end_time (don't use seconds unless you must!)
now, here's where it gets fun. You add the middle table with a composite primary key that is made up of car.car_id + parking_space.space_id + time_id. There are other things you could add to optimize here, but you get the idea, I hope.
table reservation
car_id PK
parking_space_id PK
time_id PK (it's an integer - just try to keep it as highly granular as possible - 30 minute increments or something - if you allow this to include seconds / milliseconds /etc the advantages are cancelled out because you can't re-use the same value from the time table)
(this would also be the place to store variable rates, discounts, etc distinct to this particular account, reservation, etc).
now, you can reduce the amount of data because you aren't replicating the timestamp in the join table (reservation). By using an integer, you can re-use that timeslot for multiple parking spaces, but you could also apply a constraint preventing two cars from renting that given spot for the same "timeslot" for a given day / timeframe. This would also make it easier to store some history about the customers - who knows, you might want to see reports on customers who rent more often and offer them discounts or something.
By using the ternary relationship model, you are making each spot unique to a given timeslot (perhaps with some added validation rules), so the system can only store one car in one parking spot for one given time period.
By using integers as keys instead of timestamps, you are assured that the database won't need to do any heavy lifting to index the keys and sort / query against. This is a common practice in data warehousing / OLAP reporting when you have massive datasets and you need efficiency. I think it applies here as well.
create a third table.
parking
--------
car_id
space_id
start_dt
end_dt
for the constraint, i guess the problem with your situation is that you need to check a complex rule against the intersection table itself. if you try this in a trigger, it will report a mutation.
one way to avoid this would be to replicate the table, and query against this replication for the constraint.

How to handle an immutable table referencing mutable tables?

In making a pretty standard online store in .NET, I've run in to a bit of an architectural conundrum regarding my database. I have a table "Orders", referenced by a table "OrderItems". The latter references a table "Products".
Now, the orders and orderitems tables are in most aspects immutable, that is, an order created and its orderitems should look the same no matter when you're looking at the tables (for instance, printing a receipt for an order for bookkeeping each year should yield the same receipt the customer got at the time of the order).
I can think of two ways of achieving this behavior, one of which is in use today:
1. Denormalization, where values such as price of a product are copied to the orderitem table.
2. Making referenced tables immutable. The code that handles products could create a new product whenever a value such as the price is changed. Mutable tables referencing the products one would have their references updated, whereas the immutable ones would be fine and dandy with their old reference
What is your preferred way of doing this? Is there a better, more clever way of doing this?
It depends. I'm writing on a quite complex enterprise software that includes a kind of document management and auditing and is used in pharmacy.
Normally, primitive values are denormalized. For instance, if you just need a current state of the customer when the order was created, I would stored it to the order.
There are always more complex data that that need to be available of almost every point in time. There are two approaches: you create a history of them, or you implement a revision control system, which is almost the same.
The history means that every state that ever existed is stored as a separate record, in the same or another table.
I implemented a revision control system, where I split records into two tables, one for the actual item, lets say a product, and the other one for its versions. This way I can reference the product as a whole, or any specific version of it, because both have its own primary key.
This system is used for many entities. I can safely reference an object under revision control from audit trail for instance or other immutable records. At the beginning it seems to be more complex to have such a system, but at the end it is very straight forward and solves many problems at once.
Storing the price in both the Product table and the OrderItem table is NOT denormalizing if the price can change over time. Normalization rules say that every "fact" should be recorded only once in the database. But in this case, just because both numbers are called "price" doesn't make them the same thing. One is the current price, the other is the price as of the date of the sale. These are very different things. Just like "customer zip code" and "store zip code" are completely different fields; the fact that both might be called "zip code" for short does not make them the same thing. Personally, I have a strong aversion to giving fields that hold different data the same name because it creates confusion. I would not call them both "Price": I would call one "Current_Price" and the other "Sale_Price" or something like that.
Not keeping the price at the time of the sale is clearly wrong. If we need to know this -- which we almost surely do -- than we need to save it.
Duplicating the entire product record for every sale or every time the price changes is also wrong. You almost surely have constant data about a product, like description and supplier, that does not change every time the price changes. If you duplicate the product record, you will be duplicating all this data, which definately IS denormalization. This creates many potential problems. Like, if someone fixes a spelling error in the product description, we might now have the new record saying "4-slice toaster" while the old record says "4-slice taster". If we produce a report and sort on the description, they'll get separated and look like different products. Etc.
If the only data that changes about the product and that you care about is the price, then I'd just post the price into the OrderItem record.
If there's lots of data that changes, then you want to break the Product table into two tables: One for the data that is constant or whose history you don't care about, and another for data where you need to track the history. Like, have a ProductBase table with description, vendor, stock number, shipping weight, etc.; and a ProductMutable table with our cost, sale price, and anything else that routinely changes. You probably also want an as-of date, or at least an indication of which is current. The primary key of ProductMutable could then be Product_id plus As_of_date, or if you prefer simple sequential keys for all tables, fine, it at least has a reference to product_id. The OrderItem table references ProductMutable, NOT ProductBase. We find ProductBase via ProductMutable.
I think Denormalization is the way to go.
Also, Product should not have price (when it changes from time to time & when price mean different value to different people -> retailers, customers, bulk sellers etc).
You could also have a price history table where it contains ProductID, FromDate, ToDate, Price, IsActive - to maintain the price history for a product.

Resources