Clustering several instances of Apache Zeppelin? - apache-zeppelin

I have been testing Apache Zeppelin to query several sources hosted on Apache Drill and then creating charts to analyze our data.
Since the product seems robust enough, I have been planning on rolling our analyst team over this solution for monitoring and discovering business data.
The problem I face now is that only 1 instance of Zeppelin will not be enough to manage concurrently the users (and thinking about HA, it's not a good idea on relying exclusively on 1 host). I have already built an Apache Drill cluster to be able to handle the traffic volumes, but I couldn't find anything on the documentation on how to build a cluster of several Zeppelin instances to share their notebooks and user sessions behind a load balancer.
Can you advice if what I'm trying to do is posible? If so, can you point me on the right direction?
Thanks
EDIT: Been playing with Zeppelin 0.8.0-snapshot and MongoDB integration to store notebooks. Although it seems to be able to write new and update notebooks, other connected Zeppelin instances will only update their internal notebooks after a restart.

Related

Database Clustering

My Application is built on monolithic architecture using laravel framework and mysql database
The application targetting that it will serve more than 1 million users and at its pick hour will face to be managed more than 50k request per sec.
I know about Load balancing. But i want a help about database clustering
i want to implement a master-slave topology but i have no clue how to start with it.
i found some resource about clusterControl https://severalnines.com/product/clustercontrol But not understanding the proper guideline
Question 1 : The Guideline for implementing Database Clustering.
you can watch a video on how to setup DB clustering. It is trivial with ClusterControl. Just setup the hosts with Debian/Ubuntu/Rhel/CentOS and provide the IP/hostnames to ClusterControl and it will deploy a database cluster Active/Standby (MySQL/Maria, PG) or Multi-Master (Galera).
https://www.youtube.com/watch?v=umgvVHHaBog

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

Would Apache Gora fit when you have to build an application which writes/reads from a set of databases?

Would Apache Gora fit when you have to build an application which writes/reads from a set of databases including SQLServer, MongoDB, HBase & Cassandra?
The idea is to develop an application which is capable of performing CRUD operations across databases? Request 1 goes to SQLServer, Request 2 goes to MongoDB and Request 3 goes to HBase and so on. The Request will have the information as to which database the application should hit and there is a finite list of databases.
Are there any alternatives?
Any pointers?
Let me know if any other information is required.
From your description I would say "yes", except accessing SQL Server (not supported).
Two things I can tell you as BIG tips to begin:
Create your datastores with this DataStoreFactory#createDataStore() method that allows to configure a different "gora.properties" content, and Configuration.
Remember that each gora-xxx-mapping.xml is shared between all the connections to a same backend.
Alternatives:
Kundera, maybe?
-- Edit from comments:
There is a gora-sql module but it had to be disabled years ago because of some license issues. If you look at the modules in the pom, you will see that gora-sql is not being compiled. No one took the staff to rebuild it :(
About point 2, it can exist Application1MongoDB and Application2MongoDB: If they are different applications, they can have a different gora-xxx-mapping.xml in each one's classpath.
If they are datastores instances from calls to #createDataStore() (in the same application), then all the mappings will have to go in the casspath's gora-xxx-mapping.xml. It is just a tip I advised that I found tricky.
More alternatives:
Hibertane OGM as told in the comments.
EclipseLink (although does not support much backends)
DataNucleus

Create App Engine project via API

I would need to automate the creation of new App Engine projects. Is this possible? I see there is a Google Cloud SQL Admin API which can create new Cloud SQL instances, but what about App Engine? Is there anything similar?
Update:
We have developed an application that runs on GAE and uses Cloud SQL and plenty of API integration with most of Google Apps. We foresee dozens, if not hundreds, of customers in a near future. All of them will be using their own Google domain and Google Apps.
While we could actually just deploy the application in our App Engine and modify the Cloud SQL tables to include the id of the customer who owns the record, we thought it would be better if we deploy an app instance and Cloud SQL for every one of them (on our own account). The main reasons coming to mind are that we can track how much every customer spends in terms of billing, and speed up the database since Cloud SQL is just a MySQL instance.
Steps for the creation would require editing a properties file in the packaged .war file, adding the certificate used to log in as a service account, and probably something that I am missing at this moment :-P
This question is somehow related Create an App Engine Project ID command line
As far as I know this is not possible (and is unlikely to be possible anytime soon).
Update:
I can see why splitting into separate projects for billing purposes would be really nice (multi-tenancy is great, but getting one bill per customer from Google sounds easier), but unfortunately I don't think that it's going to be your best option.
For AppEngine, you may want to look into the multi-tenancy features (or in Python) and how to get stats for billing.
Keep in mind however, CloudSQL is not simply a MySQL instance. It happens to speak MySQL but is not the same as running MySQL on Compute Engine for example. I would recommend that you run some benchmarks to be sure that the "adding the customer ID to the table" idea you had won't work.
Lastly, a possibly relevant read: http://signalvnoise.com/posts/1509-mr-moore-gets-to-punt-on-sharding
I guess the conclusion is that there’s no use in preempting the technological progress of tomorrow. Machines will get faster and cheaper all the time, but you’ll still only have the same limited programming resources that you had yesterday.
If you can spend them on adding stuff that users care about instead of prematurely optimizing for the future, you stand a better chance of being in business when that tomorrow finally rolls around.

Running Solr on Azure

Can Solr be run on Azure?
I know this thread is old, but I wanted to share our two cents. We are running SOLR on Azure with no big problems. We've created a startup task to install Java and create the deployment, and we have a SOLR instance on each web role. From there on, it's a bit of magic figuring out which master/slave configuration, but we've solved that too.
So yes, it can be done with a little bit of work. Most importantly, the startup task is key. Index doesn't have to be stored anywhere but on local disk (Local Resource), because indexing is part of the role startup. If you have to speed it up and a few minute differences are acceptable, you can have the index synced with a blob storage copy every once in a while by the master. But in this case you need to implement a voting algorithm so that the SOLR instances don't override each other.
We'll be posting info on our blog, but I don't want to post links in answers for old threads because I'll look like a spammer :o)
Abit of a dated question, but wanted to provide an updated answer. You can run Apache Solr on Azure. Azure offers IaaS (Infrastructure as a service), which is raw Virtual Machines running Linux/Windows. You can choose to set up your entire Solr cluster on a set of VMs and configure SolrCloud and Zookeeper on them.
If you are interested, you could also check out Solr-as-a-Service or Hosted Solr solutions as they remove the headache of setting up SolrCloud on Azure. There's a lot that goes into running, managing and scaling a search infrastructure and companies like Measured Search help reduce time and effort spent on doing that. You get back that time in developing features and functionality that your applications or products need.
More specifically, if you are doing it yourself, it can take many days to weeks to give the proper love and care it needs. Here's a paper that goes into the details of the comparison between doing it yourself and utilizing a Solr-as-a-Service solution.
https://www.measuredsearch.com/white-papers/why-solr-service-is-better-than-diy-solr-infrastructure/
Full disclosure, I run product for Measured Search that offers Cloud Agnostic Solr-as-a-Service. Measured Search enables you to standup a Solr Cluster on Azure within minutes.
For the new visitor their is now two Solr instances available via . We tested them and they are good. But ended up using the Azure Search service which so far looks very solid.
I haven't actually tried, but Azure can run Java, so theoretically it should be able to run Solr.
This article ("Run Java with Jetty in Windows Azure") should be useful.
The coordinator for "Lucene.Net on Azure" also claims it should run.
EDIT : The Microsoft Interop team has written a great guide and config tips for running Solr on Azure!
Azure IaaS allows you to create linux based VMs, flavors including Ubuntu, SUSE and CentOS. This VM comes with local root storage that exists only for the VM is rebooted.
However, you can add additional volumes on which data will persist even through reboots. Your solr data can be stored here.

Resources