I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.
I'm familiar with the traditional form of data management and data life cycle; e.g.:
Structured data collected (e.g. web form)
Data stored in tables in an RDBMS on a database server
Data cleaned and then ETL'd into a Data Warehouse
Data is analysed using OLAP cubes and various other BI tools/techniques
However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.
From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?
To me, the Big Data life cycle goes something on the lines of this:
Data collected (structured/unstructured/semi)
Data stored in NoSQL database on a Big Data platform; e.g. HBase on MapR Hadoop distribution of servers.
Big Data analytic/data mining tools clean and analyse data
But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?
When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.
What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.
You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.
If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.
Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.
Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.
Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).
This is my Big Data in a nutshell for people who are experienced with traditional database systems.
Big data, simply put, is an umbrella term used to describe large quantities of structured and unstructured data that are collected by large organizations. Typically, the amounts of data are too large to be processed through traditional means, so state-of-the-art solutions utilizing embedded AI, machine learning, or real-time analytics engines must be deployed to handle it. Sometimes, the phrase "big data" is also used to describe tech fields that deal with data that has a large volume or velocity.
Big data can go into all sorts of systems and be stored in numerous ways, but it's often stored without structure first, and then it's turned into structured data clusters during the extract, transform, load (ETL) stage. This is the process of copying data from multiple sources into a single source or into a different context than it was stored in the original source. Most organizations that need to store and use big data sets will have an advanced data analytics solution. These platforms give you the ability to combine data from otherwise disparate systems into a single source of truth, where you can use all of your data to make the most informed decisions possible. Advanced solutions can even provide data visualizations for at a glance understanding of the information that was pulled, without the need to worry about the underlying data architecture.
Related
I've been reading into what a data cube is and there are lots of resources saying what it is and why (OLAP/ business intelligence / aggregations on specific columns) you would use one but never how.
Most of the resources seem to be referencing relational data stores but it doesn't seem like you have to use an RDBMS.
But nothing seems to show how you structure the schema and how to query efficiently to avoid the slow run time of aggregating on all of this data. The best I could find was this edx class that is "not currently available": Developing a Multidimensional Data Model.
You probably already know that there are 2 different OLAP approaches:
MOLAP that requires data load step to process possible aggregations (previously defined as 'cubes'). Internally MOLAP-based solution pre-calculates measures for possible aggregations, and as result it is able to execute OLAP queries very fast. Most important drawbacks of this approach come from the fact that MOLAP acts as a cache: you need to re-load input data to refresh a cube (this can take a lot of time - say, hours), and full reprocessing is needed if you decide to add new dimensions/measures to your cubes. Also, there is a natural limit of the dataset size + cube configuration.
ROLAP doesn't try to pre-process input data; instead of that it translates OLAP query to database aggregate query to calculate values on-the-fly. "R" means relational, but approach can be used even with NoSQL databases that support aggregate queries (say, MongoDb). Since there is no any data cache users always get actual data (on contrast with MOLAP), but DB should able to execute aggregate queries rather fast. For relatively small datasets usual OLTP databases could work fine (SQL Server, PostgreSql, MySql etc), but in case of large datasets specialized DB engines (like Amazon Redshift) are used; they support efficient distributed usage scenario and able to processes many TB in seconds.
Nowadays it is a little sense to develop MOLAP solution; this approach was actual >10 years ago when servers were limited by small amount of RAM and SQL database on HDD wasn't able to process GROUP BY queries fast enough - and MOLAP was only way to get really 'online analytical processing'. Currently we have very fast NVMe SSD, and servers could have hundreds gigabytes of RAM and tens of CPU cores, so for relatively small database (up to TB or a bit more) usual OLTP databases could work as ROLAP backend fast enough (execute queries in seconds); in case of really big data MOLAP is almost unusable in any way, and specialized distributed database should be used in any way.
The general wisdom is that cubes work best when they are based on a 'dimensional model' AKA a star schema that is often (but not always) implemented in an RDBMS. This would make sense as these models are designed to be fast for querying and aggregating.
Most cubes do the aggregations themselves in advance of the user interacting with them, so from the user perspective the aggregation/query time of the cube itself is more interesting than the structure of the source tables. However, some cube technologies are nothing more than a 'semantic layer' that passes through queries to the underlying database, and these are known as ROLAP. In those cases, the underlying data structure becomes more important.
The data interface presented to the user of the cube should be simple from their perspective, which would often rule out non-dimensional models such as basing a cube directly on an OLTP system's database structure.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am reading book on big data for dummies.
Welcome to Big Data For Dummies. Big data is becoming one of the most
important technology trends that has the potential for dramatically
changing the way organizations use information to enhance the customer
experience and transform their business models.
Big data enables organizations to store, manage, and manipulate vast
amounts of data at the right speed and at the right time to gain the
right insights. The key to understanding big data is that data has to
be managed so that it can meet the business requirement a given
solution is designed to support. Most companies are at an early stage
with their big data journey.
I can understand store means we have to store in DBMS
My questions on above text .
What does author mean by manage vast amounts of data in above context? Example will be helpful.
What does author mean by "organizations transform their business models" with big data? Again example will be helpful.
What does author mean by "manipulate vast amounts of data in above context?
Following are the answers to your questions:
1.What does author mean by manage vast amounts of data in above context? Example will be helpful.
Ans. When we talk about Bigdata, its the data at scale that we mention. Vast amounts of data in the above context indicates a hint at the volume of data that we can process with bigdata platforms. It could be somewhere in the range of Terabytes to petabytes or even more. This volume of data is unmanageable for the age old relational systems.
Example : Twitter, Facebook, Google etc. handling Petabytes of data on a daily basis.
2.What does author mean by "organizations transform their business models" with big data? Again example will be helpful.
Ans. With the use of bigdata technologies,organizations can have huge insights into their business models and accordingly they can make future strategies that can help them to conquer more business share in the market.
Example : Online Retail giant Amazon thrives on user data that helps them know about user's online shopping pattern and hence they create more products and services that are likely to shoot up the business and take them way ahead of their competitors.
3.What does author mean by "manipulate vast amounts of data in above context? Example will be helpful.
Ans. We can manage humongous amounts of data with big data but managing is not enough. So we use sophisticated tools that help us manipulate data in such a way that it turns into business insights and ultimately into money.
Example : Clickstream data. This data consists of user clicks on websites, how much time he/she spent on a particular site, on a particular item etc. All these things when manipulated properly results in greater business insights about the users and hence a huge profit.
Vast amount of Data means a large size file not MB or GB it may be in Tera Byte. For example some social networking site everyday generate approx 6 TB of data.
Organization using traditional RDBMS to handle data. But they are implementing Hadoop, Spark to manage easily big data. So day by day they are changing their business tactics with the help of new technology. Easily they are getting customer view with analysis of insight.
Your assumption/understanding
"I can understand store means we have to store in DBMS"
was the way long back. I am answering that aspect in my detailed answer here. Detailed so you get the Big Data concept clear upfront. (I will provide answers to your listed questions in another subsequent answer post.)
It's not just DBMS/RDBMS any more. It's data storage including file system to data stores.
In Big Data Context, it refers to
a) big data (data itself)
and
b) a storage system - distributed file system (highly available, scalable,
fault-tolerant being the salient features. High throughput and low latency
is targeted.) handling large volumes (multiples) (not necessarily
homogenous or one type of data) than the traditional DBMS in terms of I/O
and (durable/consistent) storage.
and
(extension)
c) Big Data eco system that includes systems, frameworks, projects that handle and deal with or
interacts with (and/or based on) the above two. Example. Apache Spark.
It can store just any file including raw file as it's. DBMS equivalent Data Storage system for Big Data allows giving structure to data or storing structured data.
As you store data on any normal user device – computer, hard disk or external hard disks, you can think of Big Data store as a cluster (defined/configurable networked collection of nodes) of commodity hardware and storage components (that has a configurable network IP at least, so you usually need to mount/attach a storage device or disk to a computer system or server to have an IP) to provide a single aggregated distributed (data/file) view store / storage system.
So data: structured (traditional DBMS equivalent), relational structured (RDMS equivalent), un structured (e.g., text files and more) and semi-structured files/data (csv, json, xml etc.).
With respect to Big Data, it can be flat files, text files, log files, image files, video files or binary files.
There's again row-oriented and/or column-oriented data as well (when structured / semi-structured data are stored/treated as Database / Data Warehouse data. Example: Hive is a data warehouse of/on Hadoop that allows storing structed relational data and csv files etc. in as-is file format or any specific one like parquet, avro, ORC etc.).
In terms of volume/size, though individual files can be (KBs not recommended) MBs, GBs or some times TBs aggregating to be TBs and PBs (or more; there's no official limit as such) storage at any point of time across the store/system.
It can be batch data or discrete stream data or stream real time data and feeds.
(Wide Data goes beyond Big Data in terms of nature, size and volume etc.)
Book for Beginners:
11. In terms of Book for Beginners, though “Big Data for Dummies” is not a bad option (I have not personally read it though, but know their series/style when I had touched upon during my software engineering degree studies way back.)
12. I suggest you go for "Hadoop: The Definitive Guide" book. You should go for the last edition release which happens to be the 4th Edition (year 2015). It's based on Hadoop 2.x. Though it has not been enhanced with latest 2.x updates, you will find it really good book to read and reading it.
Beyond:
Though Hadoop 3 in alpha phase, you need not worry about that just now.
Follow the Apache Hadoop site and documentation though. (ref: http://hadoop.apache.org/)
Know and learn the Hadoop Ecosystem as well.
(Big Data and Hadoop almost going synonymous now a days though Hadoop is based on the Big Data concept. Hadoop is an Open Source Apache project. Used in Production.)
The file system I mentioned is HDFS (Hadoop Distributed File System) (and/or similar ones).
Otherwise it's other Cloud storage systems including AWS S3, Google Cloud Storage and Azure Blob Storage (Object Storage).
Big data can also be stored on NoSQL DB/s which functions as non-relational flexible schema data store DBMS but not optimised for strictly relational data though. If you store relational data, relation constraints are by default removed/broken. And they are not inherently SQL-oriented though interfaces are provided. NoSQL DBs like HBase (on top of HDFS and based Big Table), Cassandra, MongoDB etc. depending on the data type (or direct files) storage and CAP theorem's attributes handled.
I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.
What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.
I have an ESB that processes lots of transactions per second (5000). It receives all types of request in different type of formats (xml, json, csv, and some are formatless). As you can imagine that is a lot of requests being processed.
The problem is due to requirements, I have to log every single of this data for auditing/issue resolution. These data have to searchable using any part of the request data that comes to the user's mind. There major problems are:
The data (XML) are heavy and cause insert locks on our RDBM
(SQLServer 2008).
Querying these large data (XML, and other unstructured data) takes a
lot of time especially when they are not optimized. (Free Text Search didnt solve my problem, it is still too slow).
The data grows very fast (expectedly - I am hoping there are databases that can optimize saved data to conserve space). A few months data eat up hundreds of gigabytes.
The question is, what database or even design principle can best solve my problems: NoSQL, RDBMS, others? I want somethign that can log very faster and search very fast using any of part of the stored data.
I would consider Elastic Search: http://www.elasticsearch.org/
The benefits for your use case:
Can scale very large. You just add nodes to the cluster as the data grows.
Based on Lucene, so you know it's a time tested search engine.
It is schemaless, so you don't have to do any ETL to store data. Just store it as is.
It is well supported by a good community and has many enterprise companies using it (including Stack Overflow).
It's free!
It's easy to search against and provides lots of control over how to boost certain results so you can tune it for your domain.
I would consider putting a queue in front of it in case you are trying to write faster than it can handle.
I've just come across RRD lately by trying out ganglia monitoring system. Ganglia stores the monitoring data in RRD. I am just wondering that, from scalability perspective, how RRD works ? What if I have potentially huge amount of data to store. Like the ganglia case, if I want to store all the historical monitoring statistics instead of just storing the data recently with a specific TTL, will RRD good enough to cope with that?
Can someone who used RRD share some experience on how does RRD scale, and how does it compare to RDBMS or even big table?
The built-in consolidation feature of rrdtool is configurable, so depending on your disk space there is no limit to the amount of high precision data you can store with rrdtool. also due to its design, rrdtool databases will never have to be vacuumed or otherwise maintained, so that you can grow the setup to staggering sizes. Obviously you need enough memory and fast disks for rrdtool to work with big data, but this is the same with any large data step.
Some people get confused about rrdtools abilities due to the fact that you can also run it on a tiny embedded system, and when these people start logging gigabytes worth of data on an old pc from the attic and find that it does not cope, they wonder ...
RRD is designed to automatically blur (average out) your data over time, such that total size of database stays roughly the same, even as new data continuously arrives.
So, it is only good if you want some historical data and are willing to lose precision over time.
In other words, you cannot really compare RRD to standard SQL databases or to Bigtable, because standard SQL and NoSQL databases all store data precisely - you will read exactly what was written.
With RRDtool, however, there is no such guarantee. But its speed makes it attractive solution for all kinds of monitoring solutions, where only most recent data matters.