Data warehouse modeling - consistency between two fact tables - data-modeling

I have some trouble to design my data warehouse. Here's the context :
Financial people register our deals and report a financial snapshot every month. When they register new deals, they also indicates some information like which equipment is sold, at which customer, etc. (our dimensions).
Project managers add additionnal data to these deals with milestones information (startup project date, customer acceptance date, etc.), also on a monthly basis.
Finance will only use finance information, Project Manager could use both type of information.
Based on this information, I have many possible scenarios, which is the best ?
1st scenario : star schema
In this scenario, I have two separate tables for Finance and Project management. But the thing is that I will have to duplicate reference to dimensions (equipment, customer, etc.) as it is Finance that declare deals and that information have to stay consistant for a same deal.
First Scenario Schema
2nd scenario : one common table
As we have the same granularity (both are monthly snapshot), we could merge Finance and Project management information in a single table and proposes two views to the users. But I fear that it will become a mess (different enterprise function in a single table...).
3nd scenario : snowflake schema
We also could add a "Deal" table, containing all references to other dimensions (customer, equipment, etc.).
Third Scenario Schema
Thanks in advice for any usefull advice !

Related

Identify the data model grain

I'm currently working on a project of designing and implementing a banking data warehouse. I want to define the data model for the accounting data mart, define the grain and use the star schema to model it. I have been told that we are interested in the transactions of a customer that's registered in a branch for an account .... ( some other dimensions) ..... at a certain date. But they're asking for the DAILY transactions ! My opinion is that it's pointless to have daily transactions in the data warehouse because it would be the exact replica of the transactional database ! This data warehouse will be used to make dashboards my guess is that decision makers aren't intereted in such detailed data. What do you think ?
Thank you.
Use the day grain for your time dimension and consider the following:
The warehouse not a replica of the transactional database, even though the same information may be available in both. The warehouse is optimized for analysis, it contains all history, it's non-volatile, and it aggregates data along the dimensions.
In your example, the warehouse may have a single row representing many transactions that occurred within a single day, so it doesn't duplicate the grain. It may contain information from five years ago that's been purged from the transactional system. It will be lightning fast to aggregate amounts in a query. It's use will not put a load on your transaction system. Some day it may contain information from another transactional database when your company merges with another company. Or the customer information may be enhanced with data imported from one or more social networks.
The point is, don't balk at having fine-grained data in the warehouse that seems to be redundant to the transactional system. It's useful, and common.
A principle of dimensional modelling is to always model at the finest grain possible. I'd never think of modelling transactions at anything less than day, and I'd even try for time (although that may be a separate dimension).

Which database to use for highly-connected data model with frequent schema changes?

I am currently working on a project, where we use web mining (web crawlers) to build up a company database. At the moment we employ PostgreSQL as our main database, however, I feel it will cause a lot of problems in the future, because as our crawler develop and extract more data we'll see many schema changes/additions.
Some examples:
At the moment we store one address per company, but at one point we might want to store multiple addresses. (1 - 1 relationship transforms to 1 - n, or even n - n)
Companies in different industries have very different attributes, e.g. for we have a lot of NULL-fields in our relational schema at the moment.
Different degrees of information available, e.g. for some companies we only know the CEO's name, which could be stored in a single attribute of the company. For other companies we might want to use a relationship to a Person relation, because we have a photo, birthdate, CV etc... (the schema is not fixed)
What kind of database would be suited for such a task? I've looked into MongoDB, Neo4J, OrientDB. Some requirements which are important for us:
No license fee for non-open-source commercial projects
Should scale to storing 100GB - 1000TB, while executing OLAP queries for displaying a web interface (company query interface) in millisecond range

How does data denormalization work with the Microservice Pattern?

I just read an article on Microservices and PaaS Architecture. In that article, about a third of the way down, the author states (under Denormalize like Crazy):
Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface).
While this sounds great in theory, in practicality it has some serious hurdles to overcome. The biggest of which is that, often, databases are tightly coupled and every table has some foreign key relationship with at least one other table. Because of this it could be impossible to partition a database into n sub-databases controlled by n microservices.
So I ask: Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables) so that the fragments can be controlled by separate microservices?
For instance, given the following (rather small, but exemplar) database:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
user_id
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
product_id
order_id
quantity_ordered
Don't spend too much time critiquing my design, I did this on the fly. The point is that, to me, it makes logical sense to split this database into 3 microservices:
UserService - for CRUDding users in the system; should ultimately manage the [users] table; and
ProductService - for CRUDding products in the system; should ultimately manage the [products] table; and
OrderService - for CRUDding orders in the system; should ultimately manage the [orders] and [products_x_orders] tables
However all of these tables have foreign key relationships with each other. If we denormalize them and treat them as monoliths, they lose all their semantic meaning:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
quantity_ordered
Now there's no way to know who ordered what, in which quantity, or when.
So is this article typical academic hullabaloo, or is there a real world practicality to this denormalization approach, and if so, what does it look like (bonus points for using my example in the answer)?
This is subjective but the following solution worked for me, my team, and our DB team.
At the application layer, Microservices are decomposed to semantic function.
e.g. a Contact service might CRUD contacts (metadata about contacts: names, phone numbers, contact info, etc.)
e.g. a User service might CRUD users with login credentials, authorization roles, etc.
e.g. a Payment service might CRUD payments and work under the hood with a 3rd party PCI compliant service like Stripe, etc.
At the DB layer, the tables can be organized however the devs/DBs/devops people want the tables organized
The problem is with cascading and service boundaries: Payments might need a User to know who is making a payment. Instead of modeling your services like this:
interface PaymentService {
PaymentInfo makePayment(User user, Payment payment);
}
Model it like so:
interface PaymentService {
PaymentInfo makePayment(Long userId, Payment payment);
}
This way, entities that belong to other microservices only are referenced inside a particular service by ID, not by object reference. This allows DB tables to have foreign keys all over the place, but at the app layer "foreign" entities (that is, entities living in other services) are available via ID. This stops object cascading from growing out of control and cleanly delineates service boundaries.
The problem it does incur is that it requires more network calls. For instance, if I gave each Payment entity a User reference, I could get the user for a particular payment with a single call:
User user = paymentService.getUserForPayment(payment);
But using what I'm suggesting here, you'll need two calls:
Long userId = paymentService.getPayment(payment).getUserId();
User user = userService.getUserById(userId);
This may be a deal breaker. But if you're smart and implement caching, and implement well engineered microservices that respond in 50 - 100 ms each call, I have no doubt that these extra network calls can be crafted to not incur latency to the application.
It is indeed one of key problems in microservices which is quite conviniently omitted in most of articles. Fortunatelly there are solutions for this. As a basis for discussion let's have tables which you have provided in the question.
Image above shows how tables will look like in monolith. Just few tables with joins.
To refactor this to microservices we can use few strategies:
Api Join
In this strategy foreign keys between microservices are broken and microservice exposes an endpoint which mimics this key. For example: Product microservice will expose findProductById endpoint. Order microservice can use this endpoint instead of join.
It has an obvious downside. It is slower.
Read only views
In the second solution you can create copy of the table in the second database. Copy is read only. Each microservice can use mutable operations on its read/write tables. When it comes to read only tables which are copied from other databases they can (obviously) use only reads
High performance read
It is possible to achieve high performance read by introducing solutions such as redis/memcached on top of read only view solution. Both sides of join should be copied to flat structure optimized for reading. You can introduce completely new stateless microservice which can be used for reading from this storage. While it seems like a lot of hassle it is worth to note that it will have higher performance than monolithic solution on top of relational database.
There are few possible solutions. Ones which are simplest in implementation have lowest performance. High performance solutions will take few weeks to implement.
I realise this is possibly not a good answer but what the heck. Your question was:
Given a database that consists entirely of related tables, how does
one denormalize this into smaller fragments (groups of tables)
WRT the database design I'd say "you can't without removing foreign keys".
That is, people pushing Microservices with the strict no shared DB rule are asking database designers to give up foreign keys (and they are doing that implicitly or explicitly). When they don't explicitly state the loss of FK's it makes you wonder if they actually know and recognise the value of foreign keys (because it is frequently not mentioned at all).
I have seen big systems broken into groups of tables. In these cases there can be either A) no FK's allowed between the groups or B) one special group that holds "core" tables that can be referenced by FK's to tables in other groups.
... but in these systems "groups of tables" is often 50+ tables so not small enough for strict compliance with microservices.
To me the other related issue to consider with the Microservice approach to splitting the DB is the impact this has reporting, the question of how all the data is brought together for reporting and/or loading into a data warehouse.
Somewhat related is also the tendency to ignore built in DB replication features in favor of messaging (and how DB based replication of the core tables / DDD shared kernel) impacts the design.
EDIT: (the cost of JOIN via REST calls)
When we split up the DB as suggested by microservices and remove FK's we not only lose the enforced declarative business rule (of the FK) but we also lose the ability for the DB to perform the join(s) across those boundaries.
In OLTP FK values are generally not "UX Friendly" and we often want to join on them.
In the example if we fetch the last 100 orders we probably don't want to show the customer id values in the UX. Instead we need to make a second call to customer to get their name. However, if we also wanted the order lines we also need to make another call to the products service to show product name, sku etc rather than product id.
In general we can find that when we break up the DB design in this way we need to do a lot of "JOIN via REST" calls. So what is the relative cost of doing this?
Actual Story: Example costs for 'JOIN via REST' vs DB Joins
There are 4 microservices and they involve a lot of "JOIN via REST". A benchmark load for these 4 services comes to ~15 minutes. Those 4 microservices converted into 1 service with 4 modules against a shared DB (that allows joins) executes the same load in ~20 seconds.
This unfortunately is not a direct apples to apples comparison for DB joins vs "JOIN via REST" as in this case we also changed from a NoSQL DB to Postgres.
Is it a surprise that "JOIN via REST" performs relatively poorly when compared to a DB that has a cost based optimiser etc.
To some extent when we break up the DB like this we are also walking away from the 'cost based optimiser' and all that in does with query execution planning for us in favor of writing our own join logic (we are somewhat writing our own relatively unsophisticated query execution plan).
I would see each microservice as an Object, and as like any ORM , you use those objects to pull the data and then create joins within your code and query collections, Microservices should be handled in a similar manner. The difference only here will be each Microservice shall represent one Object at a time than a complete Object Tree. An API layer should consume these services and model the data in a way it has to be presented or stored.
Making several calls back to services for each transaction will not have an impact as each service runs in a separate container and all these calles can be executed parallely.
#ccit-spence, I liked the approach of intersection services, but how it can be designed and consumed by other services? I believe it will create a kind of dependency for other services.
Any comments please?

How to visualize relation between objects in a database?

I have a database which I want to visualize in some kind of tool. Let me explain the basic:
Company A does business with Transport Company A and Transport Company B.
Transport Company A does business with Company A, Company B and Company C.
Company C does business with Transport Company A and Transport Company B.
As you can see every Company does business with different Transport Companies and vice versa. These relationships can be implemented in a database, and when drawing a visual model on paper this is also very easy.
Of course the model should contain hundreds of Companies and Transport Companies. So I want to have a visualizing tool, where a overview of these relations can be displayed.
My question is which tools can be used for realizing this?
I think you want to look at Microsoft Visio (get the 2010 version. 2013 is almost unusable from a database standpoint).
But if i am assuming correctly, you want to create a table per company. Don't do this! this can cause redundancy and data integrity problems. you want to create just one table and create what is called a unary many-to-many relationship. This is relationship that can be translated to many different rows can relate to many rows in the same table. I won't go into more detail unless you want me to, as i spent a week or 2 in my Database Design course last month just on many-to-many relationships and gets kinda complicated.

Data historic in a business application

I have work with a few databases up to now and the philosophys where verry different. It got me wondering,
Is it a good idea to duplicate tables for historic purpose in a business application?
By buisiness application i mean :
a software used by an enterprise to manage all of his data (eg. invoices, clients, stocks [if applicable], etc)
By 'duplicating tables' i mean :
when, lets say your invoices, goes out of date (like after one year, after being invoiced and paid, w/e), you can store them into 'historic' tables which makes them aviable for consultation but shouldent be modified. Same thing clients inactive for years.
Pros :
Using historic tables can accelerate researches trough actually used data since it make your actually used tables smaller.
Better separation of historic and actual data
Easier to remove data from the database to store it on hard media without affecting your database, (more predictable beacause the data had no chance of being used since it was in an historic table). This often happend after 10 years when you got unused data.
Cons :
Make your database have up to 2 times more tables.
Make your database more complex
Make your program more complex for reports since you sometimes have to import twice the amount of tables.
Archiving is a key aspect of enterprise applications, but in general, I'd recommend against it unless you really, really need it.
Archiving means you either accept you can't get at historical data before a specific date, or that you create some scheme for managing "current" and "historical" data; your solution (archive tables) is one solution to this problem.
Neither solution is all that nice - archive tables mean lots of duplicated code/data, complex archival procedures (esp. with foreign key relationships), lots of opportunity for errors.
I do believe the concept of "time" should be baked into the domain and data model for most business applications, along with mutability - you shouldn't be able to change an order once it's been confirmed, but you should be able to add products to a new order.
As for your pros:
In general, I don't think you'd notice the performance impact unless you're talking about very, very large scale businesses. I don't think - on modern SQL server solutions - you'd notice the speed difference between querying 10.000 customer records or 1.000.000 customer records.
The definition of "historic" is actually rather tricky - most businesses have to keep historical around for regulatory and tax purposes, often for many years; they'll probably want to be able to analyse trends over several years, etc. If the business wants to see "how many widgets did we sell per month over the last 5 years", that means you have to keep 5 years of data around somehow (either "raw" or pre-aggregated).
Yes, separating out data would be easier. Building a feature today - which you have to maintain every time you change the application - for pay-off in 10 years seems a poor investment to me...
I would only have a "duplicate" type table to store historic VERSIONS of each record, like a change log. Even a change log is not a duplicate as it would have to have info on when it was changed, etc. As a general practice,I would not recommend migrating rows from an active to a historical table. You'd have to manage different versions of queries to find the data in two places! Use a status to control if the data can be changed. I could see it may be done if there are certain circumstances for a particular application. Once you start adding foreign keys, it becomes difficult to remove data. If you had a truly enterprise business application and you attempted to remove invoices, you have all sorts of issues with FKs to other tables, accounts payable/receivable, costs of raw materials, profits from sales, shipping info, etc.

Resources