Elasticsearch vs RDMBs for Aggregations/Reporting Data - database

Has anyone has experience switching between Elasticsearch and a relational DB like mysql/postgres/? What are the pros/cons of both?
Background: looking to build a dashboard UI to show store/item related metrics and need the correct tool on the backend side that provides flexibility in queries (Imagine that the UI has selectors for date ranges and then the UI shows top items sold, total sales, etc.) in different time based charts. Some other notes are that we are just going to be using aggregations/nested aggregations (wouldn't be taking advantage of text search) around stores or items.
I know you could use both but which one is preferable in terms of
performance? I imagine that they would be largely similar
durability? I imagine elasticsearch and it automatically replicates data
maintenance? I imagine elasticsearch would be worse (maintaining a cluster vs maintaining a single node)
cost? I imagine an elasticsearch cluster storing the same amount of data would cost more because of replication
development work? I imagine elasticsearch would cause development to take longer using elasticsearch's custom queries vs writing APIs around sql queries
Are these assumptions correct?
Are there other dbs/data stores that I should consider over these 2 options?

Based on my experience Elastic Search is a superb tool for :
Search
Real-time data Aggregation
Real-time reporting with extensive filtering support
We are also using Elastic Search for powering our real-time reports having extensive filter options (like date-range, status, etc).
We compared aggregation performance of E.S and MongoDB with similar set of machines and for aggregating 5 million records mongo-db took around 12 Sec while E.S took time under 1 sec.
performance? I imagine that they would be largely similar
If you have pure aggregation use case on loads of data requiring extensive filtering, searching etc then the performance of ES would be unmatched.
durability? I imagine elastic search and it automatically replicates
data
Yes E.S do have inherent replication support, as it is a distributed system.
maintenance? I imagine elasticsearch would be worse (maintaining a
cluster vs maintaining a single node)
Definitely distributed systems demand more maintenance but you can use the Hosted version of ES (e.g AWS Elasti-cache) as well
cost? I imagine an elasticsearch cluster storing the same amount of
data would cost more because of replication
Considering cluster is required with replication support as well. Infra cost will be larger.
development work? I imagine elasticsearch would cause development to
take longer using elasticsearch's custom queries vs writing APIs
around sql queries
It depends on the experience with E.S. Since Mysql has been around for long, most dev folks are skilled with that. Any new technology has it's learning curve.
Keep in mind :
E.S is not an ACID compliant datastore.
No Transactions support is there. If your system is purely transactional, then you may require relational-db as a read/write store and E.S for powering aggregation use cases.

Related

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.
First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/
As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

Indefinite Search Cluster (Solr vs ES vs Datastax EE)

PREFACE:
This question is not asking for an open ended comparison of Elastic Search vs. Solr vs. Datastax Solr (Datastax EE). (Though links in comments section for this are welcome).
PROJECT:
I have been building a domain name type web service for a while. In doing so, I am realizing the exponential growth of such service.
BACKGROUND:
I would like to know which specific search platform allows me to save and expand indefinitely. Yes, I realize you can split a Solr Shard these days– so if I have a 20 shard solr cloud I can later split them into 40 (I think? Again... that's not indefinate). Not sure on the Elastic Search side of things. Datastax (EE) seems to be the answer because of Cassandra’s architecture but (A) Since they give no transparency on license price – and I have to disclose my earnings to them I'm quickly reminded of Oracle's bleed you slowly fee strategy and as I start-up that is a huge deterrent. Also, (B) When they say they integrate full MapReduce with Hive, Sqop, Mahout, Solr, and Pig – I’m thinking I don’t want to spend a lifetime learning bells and whistles that aren’t applicable to my project. I want a search platform that I can add 2 billion documents a month (or whatever number) indefinitely and not have to worry that I started a cluster with too little shards upfront.
QUESTION:
Admittedly my background section is pilfered with ignorance that I would like to correct. My intention is not to offend or dilute these amazing technologies. I am simply wondering which of them can scale w/o having to worry about overgrowing shards [I took out the word forever here -- thank you per comment below]. Or can any? Not hardware-wise, but Shards. Which platform can I use and not have to worry about the future growth whether its 20TB or 2PB. Assume hardware budget for servers, switches, etc. etc. are indefinite.
DataStax Enterprise (DSE) is not a "search platform" per se. One of the features DSE provides is the ability to search data stored in Cassandra. Cassandra is being used to store and access enterprise operational data. The idea is that once you have decided that Cassandra is your preferred data store for your enterprise operational data, the DSE/Solr integration then allows you to perform rich search on that data.
Large enterprises are looking to migrate off of traditional relational databases, to more modern platforms such as NoSQL databases, such as Cassandra, where scalability and distributed computing (including multi-data center support, tunable consistency, and robust operations tools, including the OpsCenter GUI dashboard) are the norm. The Solr integration of DSE facilitates that migration.
With regards to your revenue, that link points to a startup program. That makes the software 100% free if you qualify.

What is the difference between Membase and Couchbase?

With the two merging under the same roof recently, it has become difficult to determine what the major differences between Membase and Couchbase. Why would one be used over the other?
I want to elaborate on the answer given by James.
At the moment Couchbase server is CouchDB with GeoCouch integration out of the box. What is great about CouchDB is that you have the ability to create structured documents and do map-reduce queries on those documents.
Membase server is memcached with persistence and very simple cluster management interface. It's strengths are the ability to do very low latency queries as well as the ability to easily add and remove servers from a cluster.
Late this summer however Membase and CouchDB will be merged together to form the next version of Couchbase. So what will the new version of Couchbase look like?
Right now in Membase the persistence layer for memcached is implemented with SQLite. After the merger of these two products CouchDB will be the new persistence layer. This means that you will get the low latency requests and great cluster management that was provided by Membase and you will also get the great document oriented model that CouchDB is known for.
From the Couchbase Product Comparison Table:
Couchbase Server is a fit if:
A single-server solution is enough to support your users and data
Advanced querying and indexing is important
You demand peer-to-peer sync
Membase Server is a fit if:
You have large number of users
Multiple servers are necessary to support growing user population and data set
Low latency, high throughput are needed for snappy interactive experience

Database selection for a web-scale analytics application

I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
High scalability, handle very large volume
Compartmentalized - Queries always run on a single customer's data
Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
What do you mean by large volume ?
Where are your customer user information?
What kind of front-end and reporting are you going to use?
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details

Enterprise grade databases that can handle large RDF datasets?

Are there any enterprise-grade database engines (Oracle, MS SQL...etc) that can handle large RDF datasets (320 million) and SPARQL queries? I guess my question is also: is SPARQL/RDF/OWL ready for serving large real-world data warehouses for an enterprise? If not, are there efficient mechanisms for adapting SPARQL/RDF against a typical data warehouse star schema.
Thanks!
Virtuoso - is the datastore used by Bio2RDF and DBPedia
Following from Kaarel's suggestion one of the entries this year presented at ISWC used 4store which does scale that far though the competitor set it up in some weird configuration which the CTO of Gralik (who develop 4store) described to me and colleagues as 'crazy' but 4store would be capable of that scale - http://4store.org
Also Virtuoso supports stores at this scale, they have a live application that you can use to SPARQL query over the majority of the major LOD (Linked Open Data) data sources which total around 9 billion Triples
Virtuoso - http://virtuoso.openlinksw.com
LOD Application - http://lod.openlinksw.com/sparql
I maintain this list of large triplestores on the W3C wiki:
http://esw.w3.org/topic/LargeTripleStores
There are 7 seven triplestores that are known to be able to hold over a billion triples. Four of them are open source. Please update the above-mentioned wiki page if you have more information.
Obviously, performance depends on what you use it for. I used Virtuoso in a large-scale industrial project, and it is quite fast.
Neo4j handles around 1+ Billion triples out of the box, SAIL API here, while still have the whole graph to do advanced stuff with things like Gremlin, or SPARQL.
Disclaimer: I am part of the Neo4j team.
Intellidimension provides a solution called Semantic Server that is developed on top of Microsoft's SQL Server 2005 or 2008. It easily scales to the hundreds of millions of triples and I know they have at least one customer happily running an enterprise deployment with over a billion statements.
I am one of their customers working with datasets > 100 million. Our plans are to move towards the 10s of billions of statements.
4store looks to be a good solution however the documentation is pretty sparse at this time and when I last looked at it there was no ability to delete an individual triple from the graph.
I would also take a look at BigData
Here is a quote from their main page summarizing their offering.
Bigdata(R) is an open-source scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. Bigdata was designed from the ground up as a distributed database architecture optimized for very high aggregate IO rates running over clusters of 100s to 1000s of machines, but can also run in a single-server mode. Bigdata offers a distributed file system, similar to the Google File System but also useful for workflow queues, a data extensible sparse row store, similar to Googles widely recognized bigtable project, and map/reduce processing for parallelizing data intensive workflows over a cluster.
Bigdata(R) comes packaged with a very high-performance RDF store supporting RDF(S) and OWL Lite inference. The Bigdata RDF Store is currently the only RDF database capable of operating distributed on a cluster with dynamic key-range partitioning of indices. The Bigdata RDF Store was designed specifically to meet requirements for very large scale semantic alignment and federation. RDF is a Semantic Web technology particularly well-suited to modeling graph-shaped data and metadata, such as an associative entity-link model, whereby actors are linked to one another in an ad-hoc fashion within the context of an evolving ontology of concepts for entity types and link types related to a particular problem domain. The Bigdata RDF Store is used operationally in data harvesting systems to create mash-ups of structured, semi-structured, and unstructured data from myriad sources in a schema-flexible manner.

Resources