Indefinite Search Cluster (Solr vs ES vs Datastax EE)

Indefinite Search Cluster (Solr vs ES vs Datastax EE) - solr

PREFACE:
This question is not asking for an open ended comparison of Elastic Search vs. Solr vs. Datastax Solr (Datastax EE). (Though links in comments section for this are welcome).
PROJECT:
I have been building a domain name type web service for a while. In doing so, I am realizing the exponential growth of such service.
BACKGROUND:
I would like to know which specific search platform allows me to save and expand indefinitely. Yes, I realize you can split a Solr Shard these days– so if I have a 20 shard solr cloud I can later split them into 40 (I think? Again... that's not indefinate). Not sure on the Elastic Search side of things. Datastax (EE) seems to be the answer because of Cassandra’s architecture but (A) Since they give no transparency on license price – and I have to disclose my earnings to them I'm quickly reminded of Oracle's bleed you slowly fee strategy and as I start-up that is a huge deterrent. Also, (B) When they say they integrate full MapReduce with Hive, Sqop, Mahout, Solr, and Pig – I’m thinking I don’t want to spend a lifetime learning bells and whistles that aren’t applicable to my project. I want a search platform that I can add 2 billion documents a month (or whatever number) indefinitely and not have to worry that I started a cluster with too little shards upfront.
QUESTION:
Admittedly my background section is pilfered with ignorance that I would like to correct. My intention is not to offend or dilute these amazing technologies. I am simply wondering which of them can scale w/o having to worry about overgrowing shards [I took out the word forever here -- thank you per comment below]. Or can any? Not hardware-wise, but Shards. Which platform can I use and not have to worry about the future growth whether its 20TB or 2PB. Assume hardware budget for servers, switches, etc. etc. are indefinite.

DataStax Enterprise (DSE) is not a "search platform" per se. One of the features DSE provides is the ability to search data stored in Cassandra. Cassandra is being used to store and access enterprise operational data. The idea is that once you have decided that Cassandra is your preferred data store for your enterprise operational data, the DSE/Solr integration then allows you to perform rich search on that data.
Large enterprises are looking to migrate off of traditional relational databases, to more modern platforms such as NoSQL databases, such as Cassandra, where scalability and distributed computing (including multi-data center support, tunable consistency, and robust operations tools, including the OpsCenter GUI dashboard) are the norm. The Solr integration of DSE facilitates that migration.

With regards to your revenue, that link points to a startup program. That makes the software 100% free if you qualify.

Related

Elasticsearch vs RDMBs for Aggregations/Reporting Data

Has anyone has experience switching between Elasticsearch and a relational DB like mysql/postgres/? What are the pros/cons of both?
Background: looking to build a dashboard UI to show store/item related metrics and need the correct tool on the backend side that provides flexibility in queries (Imagine that the UI has selectors for date ranges and then the UI shows top items sold, total sales, etc.) in different time based charts. Some other notes are that we are just going to be using aggregations/nested aggregations (wouldn't be taking advantage of text search) around stores or items.
I know you could use both but which one is preferable in terms of
performance? I imagine that they would be largely similar
durability? I imagine elasticsearch and it automatically replicates data
maintenance? I imagine elasticsearch would be worse (maintaining a cluster vs maintaining a single node)
cost? I imagine an elasticsearch cluster storing the same amount of data would cost more because of replication
development work? I imagine elasticsearch would cause development to take longer using elasticsearch's custom queries vs writing APIs around sql queries
Are these assumptions correct?
Are there other dbs/data stores that I should consider over these 2 options?

Based on my experience Elastic Search is a superb tool for :
Search
Real-time data Aggregation
Real-time reporting with extensive filtering support
We are also using Elastic Search for powering our real-time reports having extensive filter options (like date-range, status, etc).
We compared aggregation performance of E.S and MongoDB with similar set of machines and for aggregating 5 million records mongo-db took around 12 Sec while E.S took time under 1 sec.
performance? I imagine that they would be largely similar
If you have pure aggregation use case on loads of data requiring extensive filtering, searching etc then the performance of ES would be unmatched.
durability? I imagine elastic search and it automatically replicates
data
Yes E.S do have inherent replication support, as it is a distributed system.
maintenance? I imagine elasticsearch would be worse (maintaining a
cluster vs maintaining a single node)
Definitely distributed systems demand more maintenance but you can use the Hosted version of ES (e.g AWS Elasti-cache) as well
cost? I imagine an elasticsearch cluster storing the same amount of
data would cost more because of replication
Considering cluster is required with replication support as well. Infra cost will be larger.
development work? I imagine elasticsearch would cause development to
take longer using elasticsearch's custom queries vs writing APIs
around sql queries
It depends on the experience with E.S. Since Mysql has been around for long, most dev folks are skilled with that. Any new technology has it's learning curve.
Keep in mind :
E.S is not an ACID compliant datastore.
No Transactions support is there. If your system is purely transactional, then you may require relational-db as a read/write store and E.S for powering aggregation use cases.

Which community edition graph database supports high-available cluster and has good online query performance?

I am currently building a knowledge graph for an e-commerce company, and it mainly consists of the product category hierarchies, properties, and relations among them. Besides the common relational queries, we care about the following points very much:
Master-slave cluster support. This graph database will be used for online search query processing, so high availability is crucial to us. The data volume won't be as big as millions of nodes, so we don't need a distributed cluster that can span data across multiple machines. Still, rather we may need multiple machines that can be read simultaneously, and the service won't go down even if one of the machines is offline.
Fast online query performance. Reasoning about relations can be done offline, so the performance is not that important. But we need to do a lot of online queries like "find the nodes whose property P equals to value V", so we need good performance for online query processing. This database will be read-intensive and won't be changed very much after it's initialization.
Community and documentation. Since our team is new to the field of a graph database, so we expect user-friendly documentation for deployment and development and an active community for solving problems.
Based on the requirements above, I investigated some candidates:
Neo4j. We first tried Neo4j since it's the most popular one in the field. Actually, I liked it, especially the Cypher query language. But we are about to abandon it because the community edition does not support any cluster, and currently, we don't have the budget to pay for the enterprise edition.
OrientDB. OrientDB is like the second most popular one on the market, and it seems to support cluster in its community edition. I use the word "seems" because it is not clearly stated on its website. Can anyone clear this out? Besides, I found a negative article about OrientDB which makes me hesitate: http://orientdbleaks.blogspot.jp/2015/06/the-orientdb-issues-that-made-us-give-up.html
Titan. Titan is also great, but since its original company has been acquired and its original developers are developing a different product, its future development and maintenance are in doubt.
ArangoDB. This one seems to be very fast, according to the performance report(https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/)， but I don't know if its online query processing ability is good enough, and its support for the cluster is also unknown to me.
As for documentation and community, I really have no idea since these are the kind of things that you only get to know after you start doing it.
To sum up, based on my requirements, I think OrientDB and ArangoDB maybe my candidates, but I don't know which one to choose because of the points I stated above. Or perhaps is there any other right candidate that I'm missing?
Thanks.

Max working for ArangoDB here. ArangoDB does not only do online queries for graphs, but due to its multi-model nature you can mix graph queries with document queries (using secondary indexes), key lookups and joins. It has a sophisticated query engine with an optimizer that is fully aware of the ArangoDB cluster structure and can optimize and distribute query executions across all instances.
In a cluster, sharding, synchronous replication and self-healing are all fully automatic with configurable parameters. Deployment of an ArangoDB cluster is particularly simple (literally two clicks) on Apache Mesos or DC/OS, but is also relatively straightforward with other orchestration frameworks. ArangoDB on DC/OS additionally allows you to scale up and down via the graphical user interface or REST API calls, and failed tasks are automatically replaced.
As to the performance, all our benchmarks show a very good performance, the just released Version 3.1 even has vertex centric indexes, which is particularly important for graph queries.
We do our best to provide extensive documentation, which you find at https://www.arangodb.com/documentation/ . We have a user manual, a manual for our query language AQL as well as one for the HTTP/REST API. Furthermore, we have tutorials, frequently asked questions, a "Cookbook" for standard tasks, and we try to answer questions on StackOverflow and github issues in a timely manner.
All of this is included in the Community Edition, which is available with the Apache 2.0 open source license.
If you have more questions, do not hesitate to reach out to our team or to me personally.

OrientDB Community Edition is a free open source software, built upon by a community of developers and is constantly improving. Features such as horizontal scaling, fault tolerance, clustering, sharding and replicating aren’t disabled in OrientDB community.
For more information about cluster, take a look at the official OrientDB guide: http://orientdb.com/docs/last/Tutorial-Clusters.html
Hope it helps.
Regards

Neo4j enterprise edition can be used under the AGPL license. I am surprised a lot of people arn't aware this. If you are using Neo4j Enterprise as a server and communicating with it via REST or bolt protocol (Apache Licensed), then you don't have to worry about releasing the code of the system connecting to it under AGPL.
If you are using it embedded, then you to adhere to AGPL, but then why would you need Neo4j enterprise in that situation?
Remember to clone and compile Neo4j Enterprise from github if you plan on using it under AGPL, don't download trial.
Neo Technology gives great support and that is what you are essentially paying for for the enterprise subscription.

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.

First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/

As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

hadoop vs teradata what is the difference

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?

I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.

Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”

I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)

To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support

What is the difference between Membase and Couchbase?

With the two merging under the same roof recently, it has become difficult to determine what the major differences between Membase and Couchbase. Why would one be used over the other?

I want to elaborate on the answer given by James.
At the moment Couchbase server is CouchDB with GeoCouch integration out of the box. What is great about CouchDB is that you have the ability to create structured documents and do map-reduce queries on those documents.
Membase server is memcached with persistence and very simple cluster management interface. It's strengths are the ability to do very low latency queries as well as the ability to easily add and remove servers from a cluster.
Late this summer however Membase and CouchDB will be merged together to form the next version of Couchbase. So what will the new version of Couchbase look like?
Right now in Membase the persistence layer for memcached is implemented with SQLite. After the merger of these two products CouchDB will be the new persistence layer. This means that you will get the low latency requests and great cluster management that was provided by Membase and you will also get the great document oriented model that CouchDB is known for.

From the Couchbase Product Comparison Table:
Couchbase Server is a fit if:
A single-server solution is enough to support your users and data
Advanced querying and indexing is important
You demand peer-to-peer sync
Membase Server is a fit if:
You have large number of users
Multiple servers are necessary to support growing user population and data set
Low latency, high throughput are needed for snappy interactive experience