When a data model becomes a data nightmare - data-modeling

I am revising a data model for an educational company. So not your straightforward retail or finance model.
Problem is the company want to use the datawarehouse to produce listing reports, the kind of reports that should normally be produced by the Education Management System (EMS).
Reports like class lists, detailed learner information reports, guardian contact, work, and financial information, academic reports.
My argument thus far has been that a datawarehouse houses an analytical data model used to data analytics. Not a reporting model for an education management system with a for more complex relational database.
The current model has (forgive the crude diagram) snow flaked completely out of control in such a way that reporting and analytical tools are struggling to interpret it. The warehouse is starting to resemble the RDMS model more and more. There are so many relationships between Dimensions in order to keep data together.
Some of the tables contains so much unnecessary attributes that have no analytical value and purely exist for a listing report.
I need some opinions/criticism (possibly as good reference) regarding this approach so I can better explain the problem to a business who is oblivious to the concept of data modelling. I need to make them understand that butchering the DW to handle detailed reporting is going to end badly for them.

Related

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

How to design a database that can handle unknown reports?

I am working on a project which stores large amounts of data on multiple industries.
I have been tasked with designing the database schema.
I need to make the database schema flexible so it can handle complex reporting on the data.
For example,
what products are trending in industry x
what other companies have a similar product to my company
how is my company website different to x company website
There could be all sorts of reports. Right now everything is vague. But I know for sure the reports need to be fast.
Am I right in thinking my best path is to try to make as many association tables as I can? The idea being (for example) if the product table is linked to the industry table, it'll be relatively easy to get all products for a certain industry without having to go through joins on other tables to try to make a connection to the data.
This seems insane though. The schema will be so large and complex.
Please tell me if what I'm doing is correct or if there is some other known solution for this problem. Perhaps the solution is to hire a data scientist or DBA whose job is to do this sort of thing, rather than getting the programmer to do it.
Thank you.
I think getting these kinds of answers from a relational/operational database will be very difficult and the queries will be really slow.
The best approach I think will be to create multidimensional data structures (in other words a data warehouse) where you will have flattened data which will be easier to query than a relational database. It will also have historical data for trend analysis.
If there is a need for complex statistical or predictive analysis, then the data scientists can use the data warehouse as their source.
Adding to Amit's answer above, the problem is that what you need from your transactional database is a heavily normalized association of facts for operational purposes. For an analytic side you want what are effectively tagged facts.
In other words what you want is a series of star schemas where you can add whatever associations you want.

How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related?

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

DataWarehouse - What is a good definition?

Could someone give me a good, practical definition of what a data warehouse is?
I'm surprised no one has posted Inmon's definition:
A warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection of data in
support of management's decision
making process
From the same page you can pick up Kimball's definition:
A copy of transaction data
specifically structured for query and
analysis
I think that, unfortunately, data warehousing is a wide-ranging field. There is a lot of variety with very few standard paradigms, specifically I'm thinking of Kimball's dimensional modelling. Inmon does not have as a specific a methodology as Kimball's and thus some 3NF models may or may not conform to his principles.
Because Inmon has broadened his scope for what warehousing is meant to accomplish, it can encompass unstructured data. However, analysis of unstructured data is very different than traditional analysis.
As applied to SQL Server, typically the largest Data Warehouses on SQL Server are modelled dimensionally, because this lends itself well to the non-distributed, non-massively parallel model. Massively parallel systems like Teradata generally perform a lot better with 3NF models. These are still table-based systems with the various tables connected with foreign key constraints (perhaps not enforced, but at least logical).
Of course, we are also seeing NoSQL data processing systems like Map/Reduce which are not really databases at all in the sense of normalized, denormalized or non/poorly-normalized relational databases which we have had for 40 years now.
i just started with Datawarehousing and Buisness Intelligence and looking around the web you can find some interesting links :
Get Start With Datawarehousing
I think this two links could help you to understand the concepts of datawarehousing.
sorry, im new i can post only one link ^^
we're sorry, but as a spam prevention mechanism, new users can only post a maximum of one hyperlink. Earn 10 reputation to post more hyperlinks.
A database optimized for retrieval, in general denormalized data, usually a star schema(but could be snowflake) and uses dimensional modeling (fact and dimension tables)
While this is not an academic definition, it might serve as a practical one. A data warehouse is a collection of datamarts and will combine datasets across the breadth of an organization.
A datamart will contain datasets specific to certain portions of the business. In the datamart you will find fact tables, measurable pieces of information, along with dimensions, attributes of your measurable pieces.
A true data warehouse will have conformed dimension tables that can be shared across datamarts.
An example...
Your company may build a datamart around sales. And another datamart around human resources. If the customer dimension table is shared across both these datamarts, it would be considered a conformed dimension. All three of these entities together would make up a data warehouse.
As someone else stated you can find more detailed information by searching for Ralph Kimball's Data Strategies.
Definition : Datawarehouse is a database used for analysis purpose rather than for transaction processing
Check the below link for more informaion on datawarehouse
http://www.idatastage.com/datawarehouse/

Architecture for database analytics

We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.
The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:
as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
this is slow and non-"agile"
Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.
You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.
In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.
You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.
There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/
The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.
On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.

Resources