Using surrogate keys in SAP Hana - data-modeling

I am creating a dimensional data model for implementation in SAP Hana. In Dimensional modeling, having surrogate keys for dimension tables is mandatory, however I am told that in SAP Hana, we cannot define surrogate keys and have to depend on the natural keys for the dimensions. I have never come across this before, especially using natural keys for SCD dimensions is not possible.
Any suggestion on implementing surrogate keys in Hana will be great.

SAP HANA supports, just like most other RDMBS, the automatic generation of surrogate (synthetic) keys. The feature name for this is IDENTITY column. There are also key value generating functions like SYSGUUID() available that generate guaranteed globally unique numbers.
This covers the feature for current databases, i.e. databases that represent only the most current state of information.
For the example you mentioned (slowly changing dimensions, SCD, type 2), you need to bring in a concept of during which timeframe any dimension entry is considered current. You need to create a temporal database. One way to do that is to add validFrom/validTo fields to your dimension tables and fill them accordingly during data loading.
SAP HANA supports this type of modelling with a feature called temporal join that allows an easy match of fact data to a temporal dimension table.
Considering these features and the fact that SAP’s own data warehouse solution SAP BW/4 HANA manages slowly changing dimensions on SAP HANA, I’d say that the claim you heard is incorrect.

Related

Are hashcodes good for joins

I am new to Snowflake and want to know, can we use hashcodes for joining tables or finding unique records or deleting duplicate records in Snowflake(or in any other database in general)?
I am designing an ETL flow, what are the advantages or disadvantages of using hashcodes and why are they generally not used often in most Data warehousing designs?
If you mean hashing with something like md5_binary or sha1_binary then yes absolutely,
Binary values are half the byte length of the equivalent varchar length and so you should use that. The benefit of using hash-keys (effectively) is that you only need a single join column if for instance the natural keys of a table might be a composite key. Now you could instead a numeric/int data type, sequence key but that imposes a load order. Example only after the related dimension tables have loaded should you build the related fact table --- if you are doing that.
Data Vault prefers durable hash-keys because it does not impose any load ordering, load in any order independently.
Anyway I digress, yes hash-keys have great advantages, just make sure they're binary data types when loaded.

lookup codes in data warehouse dimension

Many core entities in the upstream OLTP system have a lot of domain specific lookup codes that users are familiar with and wish to keep using in data warehouse reports. Things like product_category = "SRB6", incentive_scheme = "APP3" etc. These codes do have long form descriptions but that is not what users are familiar with nor want.
There is low correlation between the codes and cardinality is generally not that low, so a junk dimension doesn't seem right. The core dimensions are generally SCD type II and the lookup codes are unlikely to change.
How can I best model these lookup codes without using a snowflake of 3NF lookup tables around the dimension?
Options I can see include:
place the code and long form description straight in the dimension table
place the source system, code and description in a single global "lookups" dimension with a surrogate key and use that surrogate key in the entity dimension
Combination of both; lookups dim surrogate key, code and description in the dimension and SCD type II the lookups dim
Other ?
The typical dimensional modelling approach is just to place the codes and long form descriptions straight in the dimension table they relate to. E.g. DimProduct would have columns describing the product category, both codes and descriptions if needed.
Other systems do prioritise generic management of lookups, normalisation, etc and would use other options as you've suggested, but they wouldn't be a dimensional model or benefit from the easy readability of the model and performance from reduced numbers of joins.

Database - fact table and dimension table

When reading a book for business objects, I came across the term- fact table and dimension table. Is this the standard thing for all the database that they all have fact table and dimension table or is it just for business object design? I am looking for an explanation which differentiates between two and how they are related.
Edited:
Why cannot a query just get the required data from the fact table? What happens if all the information are stored in one fact table alone? What advantages we get by creating a separate fact and dimension table and joining it?
Sorry for too many questions at a time but I would like to know about the inter-relations and whys.
Dimension and Fact are key terms in OLAP database design.
Fact table contains data that can be aggregate.
Measures are aggregated data expressions (e. Sum of costs, Count of calls, ...)
Dimension contains data that is use to generate groups and filters.
Fact table without dimension data is useless. A sample: "the sum of orders is 1M" is not information but "the sum of orders from 2005 to 2009" it is.
They are a lot of BI tools that work with these concepts (e.g. Microsft SSAS, Tableau Software) and languages (e. MDX).
Some times is not easy to know if a data is a measure or a dimension. For example, we are analyzing revenue, both scenarios are possibles:
3 measures: net profit , overheads , interest
1 measure: profit and 1 dimension: profit type (with 3 elements: net, overhead, interest )
The BI analyst is who determines what is the best design for each solution.
EDITED due to the question also being edited:
An OLAP solution usually has a semantic layer. This layer provides to the OLAP tool information about: which elements are fact data, which elements are dimension data and the table relationships. Unlike OLTP systems, it is not required that an OLAP database is properly normalized. For this reason, you can take dimension data from several tables including fact tables. A dimension that takes data from a fact table is named Fact Dimension or Degenerate dimension.
They are a lot of concepts that you should keep in mind when designing OLAP databases: "STAR Schema", "SNOWFLAKE Schema", "Surrogate keys", "parent-child hierarchies", ...
That's a standard in a datawarehouse to have fact tables and dimension tables. A fact table contains the data that you are measuring, for instance what you are summing. A dimension table is a table containing data that you don't want to constantly repeat in the fact table, for example, product data, statuses, customers etc. They are related by keys: in a star schema, each row in the fact table contains a the key of a row in the dimension table.

Why are there "relations" on databases instead of just using SQL's join?

I always see in database articles or tutorials or... just everywhere where they use databases, they use a thing called relations. It comes to my mind instantaneously those little boxes with lists of field names and one field connected to another field in another box with a line.
I'm not an expert on databases (as you can probably tell) but the little bit I've used, I never needed relations. They always seemed to be redundant as I can always use JOIN to achieved what it seemed to me they are made for. Are they redundant or is there anything you can do with relations that you cannot do with JOIN? Or am I just talking nonsense?
Relations are not just about joins for SQL queries. Relations provide many benefits:
Data integrity
Query convenience
Third party tool integration benefits
"Self-describing" data model to future DBAs/developers working with the database
Etc
Data integrity:
Relations help to ensure that your "order records" can't exist without a "customer record" for example. Simply by defining a relationship between customer and order, the database will ensure that this cannot happen. This helps to make sure that your database doesn't become a big pile of junk data
Query convenience:
Relations can make it easier to do certain types of queries. Deleting a customer record can automatically have the customer's orders deleted at the same time, thanks to the relationship between customer and order
Third party tool integration benefits
Many third party tools (O/R tools come to mind) rely on relations in order to work properly
Really, the list could go on and on...you should use them, they're very beneficial. Even if you don't perceive the value today, if you're working on a database project that will continue to grow over a long period of time, it would be to your benefit to set relationships up from the beginning.
I think that they're not that critical for small projects/one-off data models...but for anything of substance, you're better off using them.
A RELATION is a subset of the cartesian product of a set of domains (http://mathworld.wolfram.com/Relation.html). In everyday terms a relation (or more specifically a relation variable) is the data structure that most people refer to as a table (although tables in SQL do not necessarily qualify as relations).
Relations are the basis of the relational database model.
Relationships are something different. A relationship is a semantic "association among things".
I think you are actually asking about referential integrity constraints (foreign keys). A foreign key is a data integrity rule that ensures the database is consistent by preventing inconsistent data from being added to it. Don't confuse foreign keys with relations because they are very different things.
I'm assuming when you are reading about relations it is probably referring to foreign keys. If that's true, relations and joins are not different solutions for the same problem. They are 2 tools that accomplish different things, and they are usually used together.
A join as it sound like you know is part of a select query that let you get rows from more then 1 table.
A relation is part of the database structure its self that defines a rule. For example if you had a city table and a country table, you should have a relation pointing each row in the city table to a row in the country table. This would ensure the integrity of the data and not allow a city row to point to a country row that doesn't exist.
Asking "Why use relations when you can use joins?" to me sounds like asking 'Why do variables have types when I could read them anyway?".
The theory behind databases is based on something called Relational Algebra. Relation is not a database specific term, it is derived from Relational Algebra.
JOIN is kind of Relation, there can be different kind of relations. Refer to this wiki page to know more about what a Relation exactly is.
The relationships established in a RELATIONAL database are the very core of the relational database model. In a database, we model entities. We use relationships between entities to maintain data integrity, and ensure the records are organized properly. Relationships also create indexes between related tables.
If you are not using the relationships, and/or modelling your table structure based upon the relationships between discrete entities, then you are not harnessing the true power of your relational database. Yes, you can make queries work, and yes, you can get the Db to do some usefule work. But can you ensure that, say, every Employee record is properly RELATED to the proper company? Can you ensure that there is only one record for that company, and that all the emplotyees of that company are related to that record?
Without designing your database structure around entities and the relationships between them, you might as well use a spreadsheet, or one big, flat table. RELATIONSHIPS and NORMALIZATION form the basis of the modern relational database.
An SQL table is an approximation of a relational model relation. Tables/relations (bases, views & query results) represent relations/relationships/associations. These are boxes & diamonds on ER (Entity-Relationship) & pseudo-ER diagrams. Most lines on such diagrams correspond to FK (foreign key) constraints. They are frequently but wrongly called "relations" or "relationships" but they are not. They are facts. An SQL FK says that a table's subrows appear elsewhere where they are a PK (primary key) or UNIQUE. Equivalently, it says that an entity participating in a relation/relationship/association also participates once in another one. Table meanings are necessary & sufficient to query. Constraints--including PKs, UNIQUEs & FKs--are not needed to query. They are consequences of the table relation/relationship/association choices & what situations/states can arise. They are for integrity to be enforced by the DBMS.
When Ed Codd developed the relational model of data for use with large scale databases, he based his design on the mathematics of relational calculus and algebra. The results of this kind of mathematics is predictable with mathematical precision, and Ed Codd was able to forecast with near mathematical precision how relational databases would behave before the first one was ever built.
In mathematics, a relation is a mathematical abstraction. It's a subset of the cartesian product of two or more domains, as another responder said. If that's as clear as mud to you, maybe you're not a mathematician.
No matter. A good computer scientist can understand SQL tables fairly easily, and recognize and exploit the power of an SQL JOIN. This understanding will do in place of a mathematical understanding of relations for many purposes. An SQL table materializes a mathematical relation, approximately. If you are careful with table design, you can turn "approximately" into "exactly".

When should I use Oracle's Index Organized Table? Or, when shouldn't I?

Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.

Resources