I have seen that the combination of using Spark with Cassandra is relatively popular.
I know that Cassandra is a BigData solution that provides reliability over consistency, therefore fits for real time systems. It also provides an SQL-like syntax for queries, but under the hood manages its data very differently than a normal DB.
Hadoop on the other hand provides consistency over reliability, therefore fits for analytics systems. Its interface is MapReduce, which is quite slow and too low level for nowadays. So this is where Sparks comes in. Sparks uses Hadoop's HDFS and replaces the old MapReduce with better architecture that takes more advantage of memory rather than hard disk, and exposes better interfaces such as RDD and dataframes.
So my question is:
Why would I want to use Spark combined with Cassandra? What are the advantages of that? Why not to use just one of them?
As far as I understand, Cassandra would just replace the HDFS, so I'd have reliability over consistency, and I'd also have to use RDD/dataframes instead of CQL, and spark would generate CQL under the hood, which gives me fewer control.
Spark is a data processing framework. You are going to process your data with Spark.
Cassandra is a DBMS. You are going to store your data in Cassandra.
It is true that you can process data in Cassandra with CQL, and if you can get away with CQL, you probably don't need Spark. However, in general Spark is a way more powerful tool. In practice a lot of people use Spark to receive data from an external source, process it and store already processed data in Cassandra.
HDFS is a "file system", hadoop sitting on top of it.
There are also many database engines that run on top of hadoop and hdfs, like hbase, hive etc and utilizing it's distributed architecture.
You don't have to run spark on hadoop, you can run it independently.
CQL of Cassandra is very, very basic. You have basic aggregation functions added in latest versions, but Cassandra wasn't designed for analytical workloads, and probably you will both struggle to run analytical queries and will "kill" your cluster performance.
You can't compare HDFS and Cassandra, like you can't compare ntfs and mysql. Cassandra is designed for heavy workloads and easy scalabilty based on Dynamo (AWS) and BigTable(Google) concepts and can handle very high number of requests per second. There's alternatives, running on hadoop like HBase, and Cassandra wins in every benchmark i've seen (but don't believe benchmarks, always test it with your data and for your use case).
So what Spark is trying to solve there, is executing analytical queries on top of data that sitting in Cassandra. Using Spark, you can take data from many sources (RDBMS, files, hadoop etc.) and execute analytical queries versus that data.
Also, this
reliability over consistency, therefore fits for real time systems
is so wrong. There are many real time systems that need consistency (not eventual), serialization, transactions etc which Cassandra can't provide...
Cassandra is NoSQL database and it is very limited in functionality for analytics.
For instance, CQL supports aggregation within single partition and there are no table joins.
Spark is streaming processing engine, it could use data from HDFS or from database. So if you want to do deep analysis of data among the whole dataset, you have to use Spark for it.
You can read more about Cassandra and Big Data here
Related
I want to use nosql for my application. The purpose of nosql is to store user log data, to use and analyze the data, and to provide customized data to users. Here we came to know about map-reduce in search of algorithm and method to process large amount of log data quickly.
I have a few questions:
Is map-reduce an algorithm?
Is map-reduce suitable for fast processing of large amounts of data?
How can I use nosql in addition to map-reduce for faster speed?
I know that mongodb supports map-reduce, is that correct?
I do not understand exactly the relationship between nosql and map-reduce.
Thanks.
NoSQL = [Not only SQL] database are the types of databases that can
have structural, semi-structural(XML, json) or non-structural
data(textual data).
Yes it can help for processing large data sets.
Where as Map-reduce is an algorithm. Please read this article
to understand how map reduce works in NoSQL or big data
applications.
EDIT
Here is some good resource for learning mapReduce & Big Data technologies. BTW these tutorials are in hindi.
Is map-reduce an algorithm?
MapReduce is not exactly an algorithm, rather a tool which can be used with many algorithm, which make a good "fit". Mapreduce leverage the features of hadoop distributed data storage and processing. As you may have notice, not all the algorithm can be "efficiently" implemented using mapreduce. So, a design decision should be made based on various factors like data volume, processing restrictions etc.
Is map-reduce suitable for fast processing of large amounts of data?
mapreduce does a lot of disk I/O during its processing and hence is not suitable for the cases, where execution time is a constraint. You may want to switch to spark for faster processing. Using tez engine with mapreduce is another option. However, do not compare mapreduce performance with nosql database like hbase. mapreduce and nosql both belong to two entirely different technology stack.
How can I use nosql in addition to map-reduce for faster speed?
It depends on your use case. It is very common to process hbase data into a mapreduce program to produce analytical results.
I know that mongodb supports map-reduce, is that correct?
Let me re-phrase it. mapreduce is a tool for which mongodb may be a data source.
I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.
What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.
I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.
First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/
As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.
Are either HBase/Hive suitable replacements as your traditional (non)relational database? Will they be able to serve up web-requests from web clients and respond in a timely manner? Are HBase/Hive only suitable for large dataset analysis? Sorry I'm a noob at this subject. Thanks in advance!
Hive is not at all suitable for any real time need such as timely web responses. You can use HBase though. But don't think about either HBase or Hive as a replacement of traditional RDBMSs. Both were meant to serve different needs. If your data is not huge enough better go with a RDBMS. RDBMSs are still the best choice(if they fit into your requirements). Technically speaking, HBase is really more a DataStore than DataBase because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
And the most important thing which could struck a newbie is the lack of SQL support by HBase, since it belongs to NoSQL family of stores.
And HBase/Hive are not the only options to handle large datasets. You have several options like Cassandra, Hypertable, MongoDB, Accumulo etc etc. But each one is meant for solving some specific problem. For example, MongoDB is used handling document data. So, you need to analyze your use case first and based on that you have to choose the datastore which suits your requirements.
You might find this list useful which compares different NoSQL datastores.
HTH
Hive is data warehouse tool, and it is mainly used for batch processing.
HBase is NoSQL database which allows random access based on rowkey (primary key). It is used for transactional access. It doesn't have indexing support which could be limitation for your needs.
Thanks,
Dino
I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.