How to test code that uses DSE Cassandra/Solr? - solr

I am working on an application where we are interfacing with Datastax Enterprise edition (we are auto syncing with solr).
I was wondering how can this application be efficiently tested.
I was considering embedded cassandra for testing but the caveat there is that we are using solr_query to query Cassandra.
The alternative is to setup a test keyspace in the real node and run the tests using that keyspace.
But I would like to write functional test cases that has no dependency on the real cassandra database.
I would like to know about the best practices that people follow to handle such scenarios.
Cheers,
Utsav

The DataStax java driver does this type of thing using CCM. CCM is a tool to stand up / simulate a small cluster (both DSE and OSS C* are supported) on a single machine.
Check out their code here https://github.com/datastax/java-driver/tree/3.x/testing

Related

What is the difference between Databricks and Spark?

I am trying to a clear picture of how they are interconnected and if the use of one always require the use of the other. If you could give a non-technical definition or explanation of each of them, I would appreciate it.
Please do not paste a technical definition of the two. I am not a software engineer or data analyst or data engineer.
These two paragraphs summarize the difference quite good (from this source)
Spark is a general-purpose cluster computing system that can be used for numerous purposes. Spark provides an interface similar to MapReduce, but allows for more complex operations like queries and iterative algorithms. Databricks is a tool that is built on top of Spark. It allows users to develop, run and share Spark-based applications.
Spark is a powerful tool that can be used to analyze and manipulate data. It is an open-source cluster computing framework that is used to process data in a much faster and efficient way. Databricks is a company that uses Apache Spark as a platform to help corporations and businesses accelerate their work. Databricks can be used to create a cluster, to run jobs and to create notebooks. It can be used to share datasets and it can be integrated with other tools and technologies. Databricks is a useful tool that can be used to get things done quickly and efficiently.
In simple words, Databricks has a 'tool' that is built on top of Apache Spark, but it wraps and manipulates it in an intuitive way which is easier for people to use.
This, in principle, is the same as difference between Hadoop and AWS EMR.

Cassandra and solr on same node

I am working on architecting a POC Cassandra Datastax enterprise cluster environment. We are going to use solr in combination with Cassandra. Would it be a valid configuration to host both solr and Cassandra on the same physical server?
If you're evaluating DSE, Solr is built into the packages you're using. It's an extremely tight integration that would be tough to replicate on your own. Here's the documentation: https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchIntro.html
It's also worth noting that Solr, in this case, does run co-located with Cassandra for data locality, and to take advantage of C* replication, availability, and some other C* specific benefits.
But most importantly I suggest checking out this hands on training: https://academy.datastax.com/courses/ds310-datastax-enterprise-search-apache-solr
If you have any specific questions about the integration, update your question and I'd be happy to help.

Indefinite Search Cluster (Solr vs ES vs Datastax EE)

PREFACE:
This question is not asking for an open ended comparison of Elastic Search vs. Solr vs. Datastax Solr (Datastax EE). (Though links in comments section for this are welcome).
PROJECT:
I have been building a domain name type web service for a while. In doing so, I am realizing the exponential growth of such service.
BACKGROUND:
I would like to know which specific search platform allows me to save and expand indefinitely. Yes, I realize you can split a Solr Shard these days– so if I have a 20 shard solr cloud I can later split them into 40 (I think? Again... that's not indefinate). Not sure on the Elastic Search side of things. Datastax (EE) seems to be the answer because of Cassandra’s architecture but (A) Since they give no transparency on license price – and I have to disclose my earnings to them I'm quickly reminded of Oracle's bleed you slowly fee strategy and as I start-up that is a huge deterrent. Also, (B) When they say they integrate full MapReduce with Hive, Sqop, Mahout, Solr, and Pig – I’m thinking I don’t want to spend a lifetime learning bells and whistles that aren’t applicable to my project. I want a search platform that I can add 2 billion documents a month (or whatever number) indefinitely and not have to worry that I started a cluster with too little shards upfront.
QUESTION:
Admittedly my background section is pilfered with ignorance that I would like to correct. My intention is not to offend or dilute these amazing technologies. I am simply wondering which of them can scale w/o having to worry about overgrowing shards [I took out the word forever here -- thank you per comment below]. Or can any? Not hardware-wise, but Shards. Which platform can I use and not have to worry about the future growth whether its 20TB or 2PB. Assume hardware budget for servers, switches, etc. etc. are indefinite.
DataStax Enterprise (DSE) is not a "search platform" per se. One of the features DSE provides is the ability to search data stored in Cassandra. Cassandra is being used to store and access enterprise operational data. The idea is that once you have decided that Cassandra is your preferred data store for your enterprise operational data, the DSE/Solr integration then allows you to perform rich search on that data.
Large enterprises are looking to migrate off of traditional relational databases, to more modern platforms such as NoSQL databases, such as Cassandra, where scalability and distributed computing (including multi-data center support, tunable consistency, and robust operations tools, including the OpsCenter GUI dashboard) are the norm. The Solr integration of DSE facilitates that migration.
With regards to your revenue, that link points to a startup program. That makes the software 100% free if you qualify.

Information on Nutch , Hadoop , Solr, MapReduce and Mahout

PS: Correct me if I am wrong in any line
I am building a search engine with Nutch and Solr.
I know by using Solr, I can enhance the efficiency of Searching- let Nutch do the crawling alone of the entire web.
I also know that Hadoop is used to handle petabytes of data by forming clusters and MapReduce.
Now , What i want to know is that
1) Since,I'll be running these open source softwares on only 1 machine,ie, my laptop on localhost... How would Hadoop be beneficial in my case as it forms clusters? How would clusters be formed on only 1 machine??
2) What would be the importance of MapReduce in my case?
3) How would MAHOUT,CASSANDRA and HBASE effect my engine???
Any help on this aspect is very much appreciated.Apologize me if I asked a noob question!!
Thanks
Regards
1) Since,I'll be running these open source software on only 1 machine,ie, my laptop on localhost... How would Hadoop be beneficial in my case as it forms clusters?
Hadoop was created to process large scale data. Hadoop is a
distributed application. It is not going to provide you benefits on a
single machine.
How would clusters be formed on only 1 machine??
Install Hadoop in pseudo cluster mode
What would be the importance of MapReduce in my case?
Again, if you want to process pages fetched by a crawler on the scale of 1000s of gigabyte. Map-Reduce is useful in processing such large data
How would MAHOUT,CASSANDRA and HBASE effect my engine???
They are different tools for different needs.
Mahout is machine
learning algorithms adapted for running as map-reduce tasks on Hadoop
or local files. Do you want to learn languages like Google Translate,
you can use it.
HBase is a no-sql database that provides more real time data
processing over ad hoc analysis for which map-reduce is more useful.
I would suggest that you go back to your problem statement, design with as little tools as required and when you hit the notes, you will understand when some of these tools could be useful.

What is the production ready NonSQL database?

With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.

Resources