With the two merging under the same roof recently, it has become difficult to determine what the major differences between Membase and Couchbase. Why would one be used over the other?
I want to elaborate on the answer given by James.
At the moment Couchbase server is CouchDB with GeoCouch integration out of the box. What is great about CouchDB is that you have the ability to create structured documents and do map-reduce queries on those documents.
Membase server is memcached with persistence and very simple cluster management interface. It's strengths are the ability to do very low latency queries as well as the ability to easily add and remove servers from a cluster.
Late this summer however Membase and CouchDB will be merged together to form the next version of Couchbase. So what will the new version of Couchbase look like?
Right now in Membase the persistence layer for memcached is implemented with SQLite. After the merger of these two products CouchDB will be the new persistence layer. This means that you will get the low latency requests and great cluster management that was provided by Membase and you will also get the great document oriented model that CouchDB is known for.
From the Couchbase Product Comparison Table:
Couchbase Server is a fit if:
A single-server solution is enough to support your users and data
Advanced querying and indexing is important
You demand peer-to-peer sync
Membase Server is a fit if:
You have large number of users
Multiple servers are necessary to support growing user population and data set
Low latency, high throughput are needed for snappy interactive experience
Related
Has anyone has experience switching between Elasticsearch and a relational DB like mysql/postgres/? What are the pros/cons of both?
Background: looking to build a dashboard UI to show store/item related metrics and need the correct tool on the backend side that provides flexibility in queries (Imagine that the UI has selectors for date ranges and then the UI shows top items sold, total sales, etc.) in different time based charts. Some other notes are that we are just going to be using aggregations/nested aggregations (wouldn't be taking advantage of text search) around stores or items.
I know you could use both but which one is preferable in terms of
performance? I imagine that they would be largely similar
durability? I imagine elasticsearch and it automatically replicates data
maintenance? I imagine elasticsearch would be worse (maintaining a cluster vs maintaining a single node)
cost? I imagine an elasticsearch cluster storing the same amount of data would cost more because of replication
development work? I imagine elasticsearch would cause development to take longer using elasticsearch's custom queries vs writing APIs around sql queries
Are these assumptions correct?
Are there other dbs/data stores that I should consider over these 2 options?
Based on my experience Elastic Search is a superb tool for :
Search
Real-time data Aggregation
Real-time reporting with extensive filtering support
We are also using Elastic Search for powering our real-time reports having extensive filter options (like date-range, status, etc).
We compared aggregation performance of E.S and MongoDB with similar set of machines and for aggregating 5 million records mongo-db took around 12 Sec while E.S took time under 1 sec.
performance? I imagine that they would be largely similar
If you have pure aggregation use case on loads of data requiring extensive filtering, searching etc then the performance of ES would be unmatched.
durability? I imagine elastic search and it automatically replicates
data
Yes E.S do have inherent replication support, as it is a distributed system.
maintenance? I imagine elasticsearch would be worse (maintaining a
cluster vs maintaining a single node)
Definitely distributed systems demand more maintenance but you can use the Hosted version of ES (e.g AWS Elasti-cache) as well
cost? I imagine an elasticsearch cluster storing the same amount of
data would cost more because of replication
Considering cluster is required with replication support as well. Infra cost will be larger.
development work? I imagine elasticsearch would cause development to
take longer using elasticsearch's custom queries vs writing APIs
around sql queries
It depends on the experience with E.S. Since Mysql has been around for long, most dev folks are skilled with that. Any new technology has it's learning curve.
Keep in mind :
E.S is not an ACID compliant datastore.
No Transactions support is there. If your system is purely transactional, then you may require relational-db as a read/write store and E.S for powering aggregation use cases.
PREFACE:
This question is not asking for an open ended comparison of Elastic Search vs. Solr vs. Datastax Solr (Datastax EE). (Though links in comments section for this are welcome).
PROJECT:
I have been building a domain name type web service for a while. In doing so, I am realizing the exponential growth of such service.
BACKGROUND:
I would like to know which specific search platform allows me to save and expand indefinitely. Yes, I realize you can split a Solr Shard these days– so if I have a 20 shard solr cloud I can later split them into 40 (I think? Again... that's not indefinate). Not sure on the Elastic Search side of things. Datastax (EE) seems to be the answer because of Cassandra’s architecture but (A) Since they give no transparency on license price – and I have to disclose my earnings to them I'm quickly reminded of Oracle's bleed you slowly fee strategy and as I start-up that is a huge deterrent. Also, (B) When they say they integrate full MapReduce with Hive, Sqop, Mahout, Solr, and Pig – I’m thinking I don’t want to spend a lifetime learning bells and whistles that aren’t applicable to my project. I want a search platform that I can add 2 billion documents a month (or whatever number) indefinitely and not have to worry that I started a cluster with too little shards upfront.
QUESTION:
Admittedly my background section is pilfered with ignorance that I would like to correct. My intention is not to offend or dilute these amazing technologies. I am simply wondering which of them can scale w/o having to worry about overgrowing shards [I took out the word forever here -- thank you per comment below]. Or can any? Not hardware-wise, but Shards. Which platform can I use and not have to worry about the future growth whether its 20TB or 2PB. Assume hardware budget for servers, switches, etc. etc. are indefinite.
DataStax Enterprise (DSE) is not a "search platform" per se. One of the features DSE provides is the ability to search data stored in Cassandra. Cassandra is being used to store and access enterprise operational data. The idea is that once you have decided that Cassandra is your preferred data store for your enterprise operational data, the DSE/Solr integration then allows you to perform rich search on that data.
Large enterprises are looking to migrate off of traditional relational databases, to more modern platforms such as NoSQL databases, such as Cassandra, where scalability and distributed computing (including multi-data center support, tunable consistency, and robust operations tools, including the OpsCenter GUI dashboard) are the norm. The Solr integration of DSE facilitates that migration.
With regards to your revenue, that link points to a startup program. That makes the software 100% free if you qualify.
Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.
Are there any enterprise-grade database engines (Oracle, MS SQL...etc) that can handle large RDF datasets (320 million) and SPARQL queries? I guess my question is also: is SPARQL/RDF/OWL ready for serving large real-world data warehouses for an enterprise? If not, are there efficient mechanisms for adapting SPARQL/RDF against a typical data warehouse star schema.
Thanks!
Virtuoso - is the datastore used by Bio2RDF and DBPedia
Following from Kaarel's suggestion one of the entries this year presented at ISWC used 4store which does scale that far though the competitor set it up in some weird configuration which the CTO of Gralik (who develop 4store) described to me and colleagues as 'crazy' but 4store would be capable of that scale - http://4store.org
Also Virtuoso supports stores at this scale, they have a live application that you can use to SPARQL query over the majority of the major LOD (Linked Open Data) data sources which total around 9 billion Triples
Virtuoso - http://virtuoso.openlinksw.com
LOD Application - http://lod.openlinksw.com/sparql
I maintain this list of large triplestores on the W3C wiki:
http://esw.w3.org/topic/LargeTripleStores
There are 7 seven triplestores that are known to be able to hold over a billion triples. Four of them are open source. Please update the above-mentioned wiki page if you have more information.
Obviously, performance depends on what you use it for. I used Virtuoso in a large-scale industrial project, and it is quite fast.
Neo4j handles around 1+ Billion triples out of the box, SAIL API here, while still have the whole graph to do advanced stuff with things like Gremlin, or SPARQL.
Disclaimer: I am part of the Neo4j team.
Intellidimension provides a solution called Semantic Server that is developed on top of Microsoft's SQL Server 2005 or 2008. It easily scales to the hundreds of millions of triples and I know they have at least one customer happily running an enterprise deployment with over a billion statements.
I am one of their customers working with datasets > 100 million. Our plans are to move towards the 10s of billions of statements.
4store looks to be a good solution however the documentation is pretty sparse at this time and when I last looked at it there was no ability to delete an individual triple from the graph.
I would also take a look at BigData
Here is a quote from their main page summarizing their offering.
Bigdata(R) is an open-source scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. Bigdata was designed from the ground up as a distributed database architecture optimized for very high aggregate IO rates running over clusters of 100s to 1000s of machines, but can also run in a single-server mode. Bigdata offers a distributed file system, similar to the Google File System but also useful for workflow queues, a data extensible sparse row store, similar to Googles widely recognized bigtable project, and map/reduce processing for parallelizing data intensive workflows over a cluster.
Bigdata(R) comes packaged with a very high-performance RDF store supporting RDF(S) and OWL Lite inference. The Bigdata RDF Store is currently the only RDF database capable of operating distributed on a cluster with dynamic key-range partitioning of indices. The Bigdata RDF Store was designed specifically to meet requirements for very large scale semantic alignment and federation. RDF is a Semantic Web technology particularly well-suited to modeling graph-shaped data and metadata, such as an associative entity-link model, whereby actors are linked to one another in an ad-hoc fashion within the context of an evolving ontology of concepts for entity types and link types related to a particular problem domain. The Bigdata RDF Store is used operationally in data harvesting systems to create mash-ups of structured, semi-structured, and unstructured data from myriad sources in a schema-flexible manner.
I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.