Data warehouse and/or database? - database

Can an enterprise use both data warehouse and database in one Head Office? Is it just OK to use only one of these or is it necessary to use both in the same place?

Yes, an enterprise can use both data warehouse and database in one office. They need not be in the same physical data-center. It all depends on the needs of the organization. Generally, databases are used to support transactions as they happen and data warehouses are used to support business intelligence or the like.
Database
Transactions in an enterprise is most likely to happen in a relational database management system (aka database, aka RDBMS). Reporting can happen using the same database, but it is also possible that reporting is done off of a mirror of the RDBMS. Now, an enterprise may have more than one RDBMS - one running SQL Server, one running Oracle, one running MySQL etc. All this is great for recording activities and reporting.
Warehouse
Additionally, enterprises seek to do data analysis on a regular basis. Business Intelligence, data science, big data - regardless of the term, we are talking about data analysis overall. Doing number crunching on large amounts of data stored in an RDBMS can be hard on the RDBMS. So, organizations decide to de-normalize data to some extent and store data in a warehouse. When data is extracted, transformed and loaded (ETL) from one or more RDBMS (and other sources of data) and stored in a data warehouse, it is available for some research.
Organizations may choose to move the warehouse to a different office location, or may have multiple-warehouses. For example, a headquarter with 5 satellite facilities may choose to bring data from all those facilities to the warehouse at the headquarter every night, or it may choose to have a warehouse in a different datacenter. In contrast to that, a company with hundreds of satellite facilities may choose to have a warehouse with high-level summarized data at their headquarter and regionalize their warehouses; one warehouse in each continent, so that target markets are better served by satellite units in that particular continent.
Database (or databases) to Warehouse journey
Business Intelligence
Cognos, QlikView, Tableau, Microstrategy etc. are some business intelligence/data analytic tools among many that reach out to the data warehouse and present data for analytics. They are great for presentation and reporting - data visualization, in general. These tools can also get data from RDBMS, but it's convenient to get it from data warehouse since they are architected in a way to make it easier to showcase that data on a business intelligence dashboard
Example of a dashboard:
Big data
The buzzword around big data is interesting. Many of us may take a subset of data from a large pool of data, do analysis and assume that the results from the subset applies to the large pool of data. What if all the data was used for analysis? And even better - what if we took related data from elsewhere (outside of our dataset) and included it in our analysis? Yeah, you would have a giant pile of data and if you had means to analyze them all, you'd be doing big data. We are talking several hundred GB or even PB of data. Although Hadoop and the like are used in big data analysis, they could derive that data from the warehouse.

Related

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

why operational database are not fulfilling business challenges as data warehouse?

i have a question why operational database are not fulfilling business challenges as data warehouse?
in operational database i can create reports in details about any product or any thing and i can issue statistical reports with charts and diagrams, so why the operational database can not use as data warehouse?
Best Regards
Usually an operational database only keeps track of the current state of each record.
The purpose of a data warehouse is two-fold:
- Keep track of historic events without overwhelming the operational database;
- Isolate OLAP queries so that they don't impact the load on the operational datastore.
If you try to query your operational data store for sales per product line per month for the past year the amount of joins required, as well as the amount of information you need to read from storage may cause performance degradation on your operational database.
A data warehouse tries to avoid this by 1) keeping things separated and 2) denormalising the data model (Kimball approach) so that query plans are simpler.
I suggest reading The Data Warehouse Toolkit, by Ralph Kimball. The first chapter deals precisely with this question: why do we need a data warehouse if we already have an operational data store?
i can create reports in details about any product or any thing
and i can issue statistical reports with charts and diagrams
Yes you can but a business user can not as they don't know SQL. And you it's very difficult to put a BI tool (for a business users to use) over the top of an operational database for many reasons:
The data model is not built for an end user to understand. A data warehouse data model is (i.e. there is ONE table for customers that has everything about a customer in it rather than being split into addresses, accounts etc.)
The operational data store is probably missing important reporting reference data such as grouping levels and hierarchies
A slowly changing dimension is a method of 'transparently' modelling changes to for example, customers. An operational data model generally doesn't do this very well. You need to understand all the table and join them correctly, if this information is even stored
There are many other reasons but these just serve to address your points
When you get to the point that you are too busy to service business users requests, and you are issuing reports that don't match from one day to the other, you'll start to see the value of a data warehouse.

Move from Azure Table Storage to other database

I have been asked to move an Azure Table Storage service from the Microsoft Dublin Data Centre to a data centre in the UK (any secure data centre so long as it is UK based). My problem is that Azure table storage is only supported in Microsoft data centres (Dublin and Amsterdam in Europe).
There are about 50 tables plus queues and blobs. The data requirements are for highly scalable storage. I do not want to re-write the storage into SQL Server because this will require schema management and direct management of the indexes for performance. I would prefer a performant nosql database that operates to the standards of Azure table storage.
Does anyone have any experience in this area?
As far as migrating your data, there's no automated way to do it. Check out the AzCopy utility for blobs.
As far as which database to choose, that's really going to be app-dependent. You'll need to think about search, storage, indexing, map/reduce/aggregation, etc. Then there's document, column, graph, key/value, sql - you may choose one or a combination.
Just remember that Table Storage is storage-as-a-service, with triple-replicated storage providing durability, and upwards of 2000 tps per partition, 20K tps per storage account. You'll need to build this out yourself, whatever you choose (maybe Redis?).
Anyway: This is pretty wide-open as far as solving your architecture. But hopefully I gave you something to think about.
One more thing: You should really look into the reasons for moving your data. Many people talk about data sovereignty, but sometimes it turns out that the data location doesn't violate any local data laws (or that some data can actually remain where it is, with only specific data needing to be hosted within a country's boundaries).

What is the best database technology for storing OHLC historical prices?

Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.

What is the difference between a database and a data warehouse?

What is the difference between a database and a data warehouse?
Aren't they the same thing, or at least written in the same thing (ie. Oracle RDBMS)?
Check out this for more information.
From a previous link:
Database
Used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data Warehousing. This records the data from the user for history.
The tables and joins are complex since they are normalized (for RDMS). This is done to reduce redundant data and to save storage space.
Entity – Relational modeling techniques are used for RDMS database design.
Optimized for write operation.
Performance is low for analysis queries.
Data Warehouse
Used for Online Analytical Processing (OLAP). This reads the historical data for the Users for business decisions.
The Tables and joins are simple since they are de-normalized. This is done to reduce the response time for analytical queries.
Data – Modeling techniques are used for the Data Warehouse design.
Optimized for read operations.
High performance for analytical queries.
Is usually a Database.
It's important to note as well that Data Warehouses could be sourced from zero to many databases.
From a Non-Technical View:
A database is constrained to a particular applications or set of applications.
A data warehouse is an enterprise level data repository. It's going to contain data from all/many segments of the business. It's going to share this information to provide a global picture of the business. It is also critical to integration between the different segments of the business.
From a Technical view:
The word "Data Warehouse" has been given no recognized definition. Personally, I define a data warehouse as a collection of data-marts. Where each data-mart consists of one or more databases where the database is specific to a specific problem set (application, data-set or process).
Simply put a database is a component of a data-warehouse. There are many places to explore this concept, but because there is no "definition", you will find challenges with any answer you give.
A data warehouse is a TYPE of database.
In addition to what folks have already said, data warehouses tend to be OLAP, with indexes, etc. tuned for reading, not writing, and the data is de-normalized / transformed into forms that are easier to read & analyze.
Some folks have said "databases" are the same as OLTP -- this isn't true. OLTP, again, is a TYPE of database.
Other types of "databases": Text files, XML, Excel, CSV..., Flat Files :-)
The simplest way to explain it would be to say that a data warehouse consists of more than just a database. A database is an collection of data organized in some way, but a data warehouse is organized specifically to "facilitate reporting and analysis". This however is not the entire story as data warehousing also contains "the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system".
Data Warehouse
Data Warehouse vs Database: A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction.
Data Warehouse:
Suitable workloads - Analytics, reporting, big data.
Data source - Data collected and normalized from many sources.
Data capture - Bulk write operations typically on a predetermined batch schedule.
Data normalization - Denormalized schemas, such as the Star schema or Snowflake schema.
Data storage - Optimized for simplicity of access and high-speed query. performance using columnar storage.
Data access - Optimized to minimize I/O and maximize data throughput.
Transactional Database:
Suitable workloads - Transaction processing.
Data source - Data captured as-is from a single source, such as a transactional system.
Data capture - Optimized for continuous write operations as new data is available to maximize transaction throughput.
Data normalization - Highly normalized, static schemas.
Data storage - Optimized for high throughout write operations to a single row-oriented physical block.
Data access - High volumes of small read operations.
DataBase :-
OLTP(online transaction process)
It is current data, up-to-date detailed data, flat relational
isolated data.
Entity relationship is used to design the database
DB size 100MB-GB simple transaction or quires
Datawarehouse
OLAP(Online Analytical process)
It is about Historical data Star schema,snow flexed schema and galaxy
schema is used to design the
data warehouse
DB size 100GB-TB Improved query performance foundation
for DATA MINING DATA VISUALIZATION
Enables users to gain a deeper understanding and knowledge about various
aspects of their corporate data through fast, consistent, interactive access
to a wide variety of possible views of the data
Any data storage for application generally uses the database. It could be relational database or no sql databases which are currently trending.
Data warehouse is also database. We can call data warehouse database as specialized data storage for the analytical reporting purposes for the company.
This data used for key business decision.
The organized data helps is reporting and taking business decision effectively.
Database:
Used for Online Transactional Processing (OLTP).
Transaction-oriented.
Application oriented.
Current data.
Detailed data.
Scalable data.
Many Users, Administrators / Operational.
Execution time: short.
Data Warehouse:
Used for Online Analytical Processing (OLAP).
Oriented analysis.
Subject oriented.
Historical data.
Aggregated data.
Static data.
Not many users, manager.
Execution time: long.
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data warehouse is the core of the BI system which is built for data analysis and reporting.
Source for the Data warehouse can be cluster of Databases, because databases are used for Online Transaction process like keeping the current records..but in Data warehouse it stores historical data which are for Online analytical process.
A Data Warehouse is a type of Data Structure usually housed on a Database. The Data Warehouse refers the the data model and what type of data is stored there - data that is modeled (data model) to server an analytical purpose.
A Database can be classified as any structure that houses data. Traditionally that would be an RDBMS like Oracle, SQL Server, or MySQL. However a Database can also be a NoSQL Database like Apache Cassandra, or an columnar MPP like AWS RedShift.
You see a database is simply a place to store data; a data warehouse is a specific way to store data and serves a specific purpose, which is to serve analytical queries.
OLTP vs OLAP does not tell you the difference between a DW and a Database, both OLTP and OLAP reside on databases. They just store data in a different fashion (different data model methodologies) and serve different purposes (OLTP - record transactions, optimized for updates; OLAP - analyze information, optimized for reads).
See in simple words :
Dataware --> Huge data using for Analytical/storage/ copy and Analysis .
Database --> CRUD operation with Frequently used data .
Dataware house is Kind of storage which u are not using on daily basis & Database is something which your dealing frequently .
Eg. If we are asking statement of bank then it gives us for last 3/4/6/more months bcoz it is in database. If you want more than that it stores on Dataware house.
Example: A house is worth $100,000, and it is appreciating at $1000 per year.
To keep track of the current house value, you would use a database as the value would change every year.
Three years later, you would be able to see the value of the house which is $103,000.
To keep track of the historical house value, you would use a data warehouse as the value of the house should be
$100,000 on year 0,
$101,000 on year 1,
$102,000 on year 2,
$103,000 on year 3.

Resources