Matching fact data with de-duplicated dimension records - sql-server

I'm starting work on a data warehousing project for a customer that has multiple physical locations with separate instances of the same LOB databases at each location. There's a good bit of "common" data between the sites, but the systems are siloed, so data that conceptually refers to the same thing has a different representation in the source.
Consider, for example, a product category. The list of product categories would be identical for each location, but the auto-generated key would differ. When the data is extracted, staged, and loaded into the corresponding product category dimension table in the warehouse, the categories are effectively duplicated because they have different source system, or "natural" keys.
Clearly, the data needs to be de-duplicated, but what then would become the surrogate key that's persisted on the de-duplicated dimension record? Keep in mind that data referencing the product category will use the surrogate key from its location of origination. So, if I have three distinct locations, I'm going to have three different natural keys for the same product category and sales data corresponding to that product category which also references those three natural keys, but ultimately refer to the same conceptual category. There's a couple of ways I could handle this:
If I have three locations, write three distinct surrogate keys to the single dimension record. This would make matching in the ETL process straightforward, but it's not very scalable because additional locations can and likely will be added. For every new location that came online, I would then need to add an additional natural key field to every dimension table with such de-duplicated records.
Create a lookup table that recorded a mapping between every natural key and its corresponding surrogate key in the corresponding dimension table. I'm not sure if this approach is very standard nor am I sure about its maintainability.
Any input on how the above-referenced scenario could be handled would be greatly appreciated.

We use approach 2. Imagine one day having hundreds of locations, and you'll see that approach 1 is simply out of the question.
Approach 2 is scalable, and very easy to maintain, since your lookup table will only grow vertically.

Related

How does columnfamily from BigTable in GCP relate to columns in a relational database

I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?
BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?

Where is the limit to what would be considered data duplication in a database?

What is the line that you should draw when normalising data, in terms of data duplication? i.e would you say that 2 employees who share the same birthday or have the same timestamp for a shift is data duplication? and therefore should be placed into another data table?
Birth date has full and non-transitive dependency to a person which means that it should be stored within the same table where you keep your employees and it would comply with third normal form (3NF).
Work shifts are not an attribute of an employee which means that they are a different entity and stay in relation with employee entity.
There is no a particular 'limit' when following the normalisation to data, since the main restriction that is given for every relational database table is to have an unique parimary key. Hence, if all other columns contain the same data, but the primary key is still different, it is a different row of a table.
The actual restrictions can come in two form. One is either the programming or systhematic approach, where the restriction on what kind of data is inputed is given from a program which interacts with the database or already defined script handed down physically for the admin of the database.
Other, more database-orriented approach would be to create primary keys composed of multiple columns. That way a row is unique only if for both columns the data is unique. It should be noted that a primary key is not necessary the same as an unique key, which should be different for every instance.
You have misunderstood what normalization does.
Two attributes having the same value (i.e. two employees having the same birthday) is not redundancy.
Rather having the same attribute in the two tables (i.e. two tables having birthday column, therefore repeating every employee's birthday information) is.
Normalization is a quality decision and denormalization is a performance decision. For my school projects, my teachers recommended me to normalize at least till 3NF. So that may be a good guideline.

What are the points considered when a database table is created?

I know it is a big and general question. Let me describe what I am looking for.
In big projects, we have some entities with many properties. (Many is over 100 properties for just a specific entity.) These properties have one to one relation. By the time goes, these tables with many columns are really big problems for maintenance and further development.
As you think, these 90 columns is created in a time with many projects. Not a single project. Therefore, requirements affect the table design in a wide time duration.
i.e. : There is a table to store information of payments between banks in global.
Some columns are foreign keys of others.(Customer, TransferType etc.)
Some columns are parameters of current payment. (IsActive, IsLoaded, IsOurCustomer etc.)
Some columns are fields of payment. (Information Bank, Receiver Bank etc.)
and so on.
These fields are always counting and now we have about 90 columns with one to one relation.
What are the concerns to divide a table to smaller tables. I know normalization rules and I am not interested it. (Already duplicated columns are normalized)
I try to find some patterns or some rules to divide a table which has one to one relation among columns.
If all of the columns are only dependent on the primary table key and are not repeating (phone1, phone2) they should be part of the same table. If you split a table you will have to do joins when you need all the columns of the table. If many of the values are null you may investigate the use of sparse columns (which don't take up any room if they have a null value).

Difference between Fact table and Dimension table?

When reading a book for business objects, I came across the term- fact table and dimension table.
I am trying to understand what is the different between Dimension table and Fact table?
I read couple of articles on the internet but I was not able to understand clearly..
Any simple example will help me to understand better?
In Data Warehouse Modeling, a star schema and a snowflake schema consists of Fact and Dimension tables.
Fact Table:
It contains all the primary keys of the dimension and associated
facts or measures(is a property on which calculations can be made) like quantity sold, amount sold and average sales.
Dimension Tables:
Dimension tables provides descriptive information for all the measurements recorded in fact table.
Dimensions are relatively very small as comparison of fact table.
Commonly used dimensions are people, products, place and time.
image source
This appears to be a very simple answer on how to differentiate between fact and dimension tables!
It may help to think of dimensions as things or objects. A thing such
as a product can exist without ever being involved in a business
event. A dimension is your noun. It is something that can exist
independent of a business event, such as a sale. Products, employees,
equipment, are all things that exist. A dimension either does
something, or has something done to it.
Employees sell, customers buy. Employees and customers are examples of
dimensions, they do.
Products are sold, they are also dimensions as they have something
done to them.
Facts, are the verb. An entry in a fact table marks a discrete event
that happens to something from the dimension table. A product sale
would be recorded in a fact table. The event of the sale would be
noted by what product was sold, which employee sold it, and which
customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.
In addition fact tables also typically have some kind of quantitative
data. The quantity sold, the price per item, total price, and so on.
Source:
http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/
This is to answer the part:
I was trying to understand whether dimension tables can be fact table
as well or not?
The short answer (INMO) is No.That is because the 2 types of tables are created for different reasons. However, from a database design perspective, a dimension table could have a parent table as the case with the fact table which always has a dimension table (or more) as a parent. Also, fact tables may be aggregated, whereas Dimension tables are not aggregated. Another reason is that fact tables are not supposed to be updated in place whereas Dimension tables could be updated in place in some cases.
More details:
Fact and dimension tables appear in a what is commonly known as a Star Schema. A primary purpose of star schema is to simplify a complex normalized set of tables and consolidate data (possibly from different systems) into one database structure that can be queried in a very efficient way.
On its simplest form, it contains a fact table (Example: StoreSales) and a one or more dimension tables. Each Dimension entry has 0,1 or more fact tables associated with it (Example of dimension tables: Geography, Item, Supplier, Customer, Time, etc.). It would be valid also for the dimension to have a parent, in which case the model is of type "Snow Flake". However, designers attempt to avoid this kind of design since it causes more joins that slow performance. In the example of StoreSales, The Geography dimension could be composed of the columns (GeoID, ContenentName, CountryName, StateProvName, CityName, StartDate, EndDate)
In a Snow Flakes model, you could have 2 normalized tables for Geo information, namely: Content Table, Country Table.
You can find plenty of examples on Star Schema. Also, check this out to see an alternative view on the star schema model Inmon vs. Kimball. Kimbal has a good forum you may also want to check out here: Kimball Forum.
Edit: To answer comment about examples for 4NF:
Example for a fact table violating 4NF:
Sales Fact (ID, BranchID, SalesPersonID, ItemID, Amount, TimeID)
Example for a fact table not violating 4NF:
AggregatedSales (BranchID, TotalAmount)
Here the relation is in 4NF
The last example is rather uncommon.
Super simple explanation:
Fact table: a data table that maps lookup IDs together. Is usually one of the main tables central to your application.
Dimension table: a lookup table used to store values (such as city names or states) that are repeated frequently in the fact table.
Dimension table
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process
a Fact = an action: a sale, a transaction, an access
a Dimension = an object: a seller, a customer, a date, a price
Then...
Facts references dimensions for: when, where, what, who, how
The real interesting thing is deciding whether an attribute should be a dimension or a fact. For example, the price of each item in an order, or, the maximum amount of a insurance recorded in a contract. There are no generally correct way to approach these, only ones that make sense in the context.
PS: If I were to create those jargons I would prefer Log table and Object table.
In the simplest form, I think a dimension table is something like a 'Master' table - that keeps a list of all 'items', so to say.
A fact table is a transaction table which describes all the transactions. In addition, aggregated (grouped) data like total sales by sales person, total sales by branch - such kinds of tables also might exist as independent fact tables.
From my point of view,
Dimension table : Master Data
Fact table : Transactional Data
The fact table mainly consists of business facts and foreign keys that refer to primary keys in the dimension tables. A dimension table consists mainly of descriptive attributes that are textual fields.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign key, measurements, and degenerated dimensions.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact tables provide the measurements of an enterprise.
When comparing the size of the two tables, a fact table is bigger than a dimensional table. In a comparison table, more dimensions are presented than the fact tables. In a fact table, less numbers of facts are observed.
The dimension table has to be loaded first. While loading the fact tables, one should have to look at the dimension table. This is because the fact table has measures, facts, and foreign keys that are the primary keys in the dimension table.
Read more: Dimension Table and Fact Table | Difference Between | Dimension Table vs Fact Table http://www.differencebetween.net/technology/hardware-technology/dimension-table-and-fact-table/#ixzz3SBp8kPzo
For Relation database users, Dimension is equivalent to Master Table.
Fact is equivalent to Transaction table.
Dimension table : It is nothing but we can maintains information about the characterized date called as Dimension table.
Example : Time Dimension , Product Dimension.
Fact Table : It is nothing but we can maintains information about the metrics or precalculation data.
Example : Sales Fact, Order Fact.
Star schema : one fact table link with dimension table form as a Start Schema.
enter image description here

When should I use Oracle's Index Organized Table? Or, when shouldn't I?

Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.

Resources