Should we plan purchase module or inventory first for data warehouse design - data-modeling

I have been developing a data warehouse for Analytical needs. As I am a new learner, I have started first working on my own work area of manufacturing related tables and made reports for these. I have been following Kimball design and so adding step by step new processes.
Now, I would like to start working on Inventory or purchase transactions.
I am interested to know, what is the recommended flow of processes for a good data warehouse design, for example should I first work on purchase transactions or Inventory transactions and so on for all other processes. As of now, I would like to consider it for manufacturing organization. Is there anywhere documented such a practice.
Thank you in advance for the help.

Related

How to handle Manual Input within your Data Warehouse

I newly joined an organisation and we recently introduced a Data Warehouse solution (Snowflake) that incorporates a large amount of external systems (CRM etc). There are use cases to bring manual data input on weekly (e.i. Sales targets ). This one area that I am having trouble with.
In an ideal world, all systems would perfectly integrate and form the core data within the DW.
But the reality is that there is likely to need to keep the manual data input to create a complete picture (at least until we can find a way around it long term).
So far I have thought of Excel/Google Sheet as manual entry into a backend service which populates DB Tables in the staging server.
Does anyone here have experience in this scenario? How do users of a data platform typically handle this scenario? And practice for handling manual data entry into a Data Warehouse solution?
Any help you can provide here would be greatly appreciated.

Understanding Data for newly hired

I have just started a new job, where I have to maintain the data warehouse, but the problem is that the company data warehouse gets data from the eautomate, and some other software, and when I look at the tables in the data warehouse, I don't understand anything. What would be the best idea for understanding all the data in the warehouse for analytic purpose? It looks like the eautomate company has build all the tables for the company and no one is here to train me how data got in there. I just felt bad for not being very productive, since its my second week at the job.
Honestly, there has to be a contact for eAutomate that set up the DB. Without having experience with eAutomate, you're handicapped.
I'd recommend getting a idea what which DBs/Tables exist, then I'd get permission to add records/modify some test data in eAutomate. Start looking in tables that seem to correspond with area of eAutomate you modified data in.
Your best bet is to get some domain knowledge before you dive into the DB.
Just my .02. Hope it helps!

Identify the data model grain

I'm currently working on a project of designing and implementing a banking data warehouse. I want to define the data model for the accounting data mart, define the grain and use the star schema to model it. I have been told that we are interested in the transactions of a customer that's registered in a branch for an account .... ( some other dimensions) ..... at a certain date. But they're asking for the DAILY transactions ! My opinion is that it's pointless to have daily transactions in the data warehouse because it would be the exact replica of the transactional database ! This data warehouse will be used to make dashboards my guess is that decision makers aren't intereted in such detailed data. What do you think ?
Thank you.
Use the day grain for your time dimension and consider the following:
The warehouse not a replica of the transactional database, even though the same information may be available in both. The warehouse is optimized for analysis, it contains all history, it's non-volatile, and it aggregates data along the dimensions.
In your example, the warehouse may have a single row representing many transactions that occurred within a single day, so it doesn't duplicate the grain. It may contain information from five years ago that's been purged from the transactional system. It will be lightning fast to aggregate amounts in a query. It's use will not put a load on your transaction system. Some day it may contain information from another transactional database when your company merges with another company. Or the customer information may be enhanced with data imported from one or more social networks.
The point is, don't balk at having fine-grained data in the warehouse that seems to be redundant to the transactional system. It's useful, and common.
A principle of dimensional modelling is to always model at the finest grain possible. I'd never think of modelling transactions at anything less than day, and I'd even try for time (although that may be a separate dimension).

Data warehouse modeling - consistency between two fact tables

I have some trouble to design my data warehouse. Here's the context :
Financial people register our deals and report a financial snapshot every month. When they register new deals, they also indicates some information like which equipment is sold, at which customer, etc. (our dimensions).
Project managers add additionnal data to these deals with milestones information (startup project date, customer acceptance date, etc.), also on a monthly basis.
Finance will only use finance information, Project Manager could use both type of information.
Based on this information, I have many possible scenarios, which is the best ?
1st scenario : star schema
In this scenario, I have two separate tables for Finance and Project management. But the thing is that I will have to duplicate reference to dimensions (equipment, customer, etc.) as it is Finance that declare deals and that information have to stay consistant for a same deal.
First Scenario Schema
2nd scenario : one common table
As we have the same granularity (both are monthly snapshot), we could merge Finance and Project management information in a single table and proposes two views to the users. But I fear that it will become a mess (different enterprise function in a single table...).
3nd scenario : snowflake schema
We also could add a "Deal" table, containing all references to other dimensions (customer, equipment, etc.).
Third Scenario Schema
Thanks in advice for any usefull advice !

Architecture for database analytics

We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.
The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:
as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
this is slow and non-"agile"
Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.
You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.
In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.
You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.
There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/
The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.
On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.

Resources