Business Intelligence is defined by Gartner as follows:
Business intelligence (BI) platforms enable enterprises to build BI
applications by providing capabilities in three categories: analysis,
such as online analytical processing (OLAP); information delivery,
such as reports and dashboards; and platform integration, such as BI
metadata management and a development environment. (Gartner IT)
While BI supports the company in decision making, Data Warehouse helps with the purely technical supply of the data.
A data warehouse is a storage architecture designed to hold data
extracted from transaction systems, operational data stores and
external sources. The warehouse then combines that data in an
aggregate, summary form suitable for enterprise-wide data analysis and
reporting for predefined business needs. (Gartner IT)
But what is exactly the different between an Analytic Platform and Business Intelligence?
Is an Analytic Platform called the entirety of Data Warehouse and Business Intelligence?
I would create an analogy like:
Business intelligence its the destination,its the objective
The analytic platform is the ship that helps you get there
The data warehouse is the fuel that feeds the ship and lets it take you to your destination.
Sometimes they are used as sinonyms ,and sometimes they appear as different thing, in the end its depends very much on the author, but sure they are related. The analytic plattform is not the same as a data warehouse, but the data warehouse is a substantial part of the analytic platform(for example the analytic platform could be a data warehouse + a data visualization tool, or a data warehouse, and semantic layer and a data exploration tool, etc etc)
Related
I need to test my data warehouse using TPC-DS. How can I generate queries for my data warehouse using TPC-DS?
I tried to generate but it generate for a specific data warehouse.
Thanks.
I'm not sure what you mean by "testing your data warehouse" using TPC-DS.
TPC-DS is a benchmark for database engines, focused on typical decision support access patterns; a data warehouse is an information systems concept that is usually built using a variety of database management systems (and other tools).
This being clarified, you can use TPC-DS to benchmark the database engine that you plan to use as a data store for your data warehouse. If that's your goal, you need to:
either generate the data using the official TPC-DS tool, or download the dataset if you can find it online (alternatively, maybe your database vendor provides it already).
load the data into the benchmark's model on the database you're testing.
run the benchmark (the queries) over the data model you created. You can find an example of the queries here (for Impala, in this case), but you may have to translate them into the SQL idiom used by whatever DBMS you're using.
The TPC-DS specification doc not only provides this information but it can also help you understand some essential concepts on this topic: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.11.0.pdf
I ask as the words are used pretty much interchangeably in some documentation I have had to review.
In the real world what are the differences?
A "Data Warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
That being said, I think it's a bit redundant to say "unified data warehouse"; a data warehouse is a "unified" source of data by definition.
This definition implies that the data model in a data warehouse must/should be a unified, canonical model of all relevant data. You can also look at a Data Warehouse as a collection of data marts, which in turn are smaller unified/canonical models focused on specific business/functional areas; so the "unified data model" is can be thought of as the sum of the various smaller/specific models (the data marts).
A Data Warehouse, as an information system, is usually surrounded by a lot of technology tools (databases, ETL software, analytics and reporting tools, etc); but regardless of how you handle, model and explore data, the primary purpose of a DW is to serve as a curated, single source of truth for (business) questions that (should) rely on data.
Since I am new to the word Database. I would like to know the differences. Please explain with examples. What is Database, Data mining, Data warehouse and Big Data?
I highly recommend using http://bigdatauniversity.com/
It has free relevant and updated course materials on the topics you seek. Topics such as Hadoop and Data Mining are covered and this gives you access to tools to practise.
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so
large that it's difficult to process using traditional database and
software techniques.
A database is an organized collection of data. The data is
typically organized to model aspects of reality in a way that
supports processes requiring information.
Data Mining is an analytic process designed to explore data (usually large amounts of data typically business or market related
also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the
findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction - and predictive data
mining is the most common type of data mining and one that has the
most direct business applications.
StatSoft defines data warehousing as a process of organizing the
storage of large, multivariate data sets in a way that facilitates
the retrieval of information for analytic purposes.
I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.
We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.
The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:
as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
this is slow and non-"agile"
Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.
You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.
In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.
You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.
There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/
The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.
On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.