lookup codes in data warehouse dimension - data-modeling

Many core entities in the upstream OLTP system have a lot of domain specific lookup codes that users are familiar with and wish to keep using in data warehouse reports. Things like product_category = "SRB6", incentive_scheme = "APP3" etc. These codes do have long form descriptions but that is not what users are familiar with nor want.
There is low correlation between the codes and cardinality is generally not that low, so a junk dimension doesn't seem right. The core dimensions are generally SCD type II and the lookup codes are unlikely to change.
How can I best model these lookup codes without using a snowflake of 3NF lookup tables around the dimension?
Options I can see include:
place the code and long form description straight in the dimension table
place the source system, code and description in a single global "lookups" dimension with a surrogate key and use that surrogate key in the entity dimension
Combination of both; lookups dim surrogate key, code and description in the dimension and SCD type II the lookups dim
Other ?

The typical dimensional modelling approach is just to place the codes and long form descriptions straight in the dimension table they relate to. E.g. DimProduct would have columns describing the product category, both codes and descriptions if needed.
Other systems do prioritise generic management of lookups, normalisation, etc and would use other options as you've suggested, but they wouldn't be a dimensional model or benefit from the easy readability of the model and performance from reduced numbers of joins.

Related

Are hashcodes good for joins

I am new to Snowflake and want to know, can we use hashcodes for joining tables or finding unique records or deleting duplicate records in Snowflake(or in any other database in general)?
I am designing an ETL flow, what are the advantages or disadvantages of using hashcodes and why are they generally not used often in most Data warehousing designs?
If you mean hashing with something like md5_binary or sha1_binary then yes absolutely,
Binary values are half the byte length of the equivalent varchar length and so you should use that. The benefit of using hash-keys (effectively) is that you only need a single join column if for instance the natural keys of a table might be a composite key. Now you could instead a numeric/int data type, sequence key but that imposes a load order. Example only after the related dimension tables have loaded should you build the related fact table --- if you are doing that.
Data Vault prefers durable hash-keys because it does not impose any load ordering, load in any order independently.
Anyway I digress, yes hash-keys have great advantages, just make sure they're binary data types when loaded.

Using surrogate keys in SAP Hana

I am creating a dimensional data model for implementation in SAP Hana. In Dimensional modeling, having surrogate keys for dimension tables is mandatory, however I am told that in SAP Hana, we cannot define surrogate keys and have to depend on the natural keys for the dimensions. I have never come across this before, especially using natural keys for SCD dimensions is not possible.
Any suggestion on implementing surrogate keys in Hana will be great.
SAP HANA supports, just like most other RDMBS, the automatic generation of surrogate (synthetic) keys. The feature name for this is IDENTITY column. There are also key value generating functions like SYSGUUID() available that generate guaranteed globally unique numbers.
This covers the feature for current databases, i.e. databases that represent only the most current state of information.
For the example you mentioned (slowly changing dimensions, SCD, type 2), you need to bring in a concept of during which timeframe any dimension entry is considered current. You need to create a temporal database. One way to do that is to add validFrom/validTo fields to your dimension tables and fill them accordingly during data loading.
SAP HANA supports this type of modelling with a feature called temporal join that allows an easy match of fact data to a temporal dimension table.
Considering these features and the fact that SAP’s own data warehouse solution SAP BW/4 HANA manages slowly changing dimensions on SAP HANA, I’d say that the claim you heard is incorrect.

What is the number of columns that make table really big?

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

Data warehouse design, multiple dimensions or one dimension with attributes?

Working on a data warehouse and am looking for suggestions on having numerous dimensions versus on large dimension with attributes.
We currently have DimEntity, DimStation, DimZone, DimGroup, DimCompany and have multiple fact tables that contain the keys from each of the dimensions. Is this the best way or would it be better to have just one dimension, DimEntity and include station, zone, group and company as attributes of the entity?
We have already gone the route of separate dimensions with our ETL so it isn't like the work to populate and build out the star schema is an issue. Performance and maintainability are important. These dimensions do not change often so looking for guidance on the best way to handle such dimensions.
Fact tables have over 100 million records. The entity dimension has around 1000 records and the others listed have under 200 each.
Without knowing your star schema table definitions, data cardinality, etc, it's tough to give a yes or no. It's going to be a balancing act.
For read performance, the fact table should be as skinny as possible and the dimension should be as short (low row count) as possible. Consolidating dimensions typically means that the fact table gets skinnier while the dimension record count increases.
If you can consolidate dimensions without adding a significant number of rows to the consolidated dimension, it may be worth looking into. It may be that you can combine the low cardinality dimensions into a junk dimension and achieve a nice balance. Dimensions with high cardinality attributes shouldn't be consolidated.
Here's a good Kimball University article on dimensional modeling. Look specifically where he addresses centipede fact tables and how he recommends using junk dimensions.

Database - fact table and dimension table

When reading a book for business objects, I came across the term- fact table and dimension table. Is this the standard thing for all the database that they all have fact table and dimension table or is it just for business object design? I am looking for an explanation which differentiates between two and how they are related.
Edited:
Why cannot a query just get the required data from the fact table? What happens if all the information are stored in one fact table alone? What advantages we get by creating a separate fact and dimension table and joining it?
Sorry for too many questions at a time but I would like to know about the inter-relations and whys.
Dimension and Fact are key terms in OLAP database design.
Fact table contains data that can be aggregate.
Measures are aggregated data expressions (e. Sum of costs, Count of calls, ...)
Dimension contains data that is use to generate groups and filters.
Fact table without dimension data is useless. A sample: "the sum of orders is 1M" is not information but "the sum of orders from 2005 to 2009" it is.
They are a lot of BI tools that work with these concepts (e.g. Microsft SSAS, Tableau Software) and languages (e. MDX).
Some times is not easy to know if a data is a measure or a dimension. For example, we are analyzing revenue, both scenarios are possibles:
3 measures: net profit , overheads , interest
1 measure: profit and 1 dimension: profit type (with 3 elements: net, overhead, interest )
The BI analyst is who determines what is the best design for each solution.
EDITED due to the question also being edited:
An OLAP solution usually has a semantic layer. This layer provides to the OLAP tool information about: which elements are fact data, which elements are dimension data and the table relationships. Unlike OLTP systems, it is not required that an OLAP database is properly normalized. For this reason, you can take dimension data from several tables including fact tables. A dimension that takes data from a fact table is named Fact Dimension or Degenerate dimension.
They are a lot of concepts that you should keep in mind when designing OLAP databases: "STAR Schema", "SNOWFLAKE Schema", "Surrogate keys", "parent-child hierarchies", ...
That's a standard in a datawarehouse to have fact tables and dimension tables. A fact table contains the data that you are measuring, for instance what you are summing. A dimension table is a table containing data that you don't want to constantly repeat in the fact table, for example, product data, statuses, customers etc. They are related by keys: in a star schema, each row in the fact table contains a the key of a row in the dimension table.

Resources