Running Solr on Azure - solr

Can Solr be run on Azure?

I know this thread is old, but I wanted to share our two cents. We are running SOLR on Azure with no big problems. We've created a startup task to install Java and create the deployment, and we have a SOLR instance on each web role. From there on, it's a bit of magic figuring out which master/slave configuration, but we've solved that too.
So yes, it can be done with a little bit of work. Most importantly, the startup task is key. Index doesn't have to be stored anywhere but on local disk (Local Resource), because indexing is part of the role startup. If you have to speed it up and a few minute differences are acceptable, you can have the index synced with a blob storage copy every once in a while by the master. But in this case you need to implement a voting algorithm so that the SOLR instances don't override each other.
We'll be posting info on our blog, but I don't want to post links in answers for old threads because I'll look like a spammer :o)

Abit of a dated question, but wanted to provide an updated answer. You can run Apache Solr on Azure. Azure offers IaaS (Infrastructure as a service), which is raw Virtual Machines running Linux/Windows. You can choose to set up your entire Solr cluster on a set of VMs and configure SolrCloud and Zookeeper on them.
If you are interested, you could also check out Solr-as-a-Service or Hosted Solr solutions as they remove the headache of setting up SolrCloud on Azure. There's a lot that goes into running, managing and scaling a search infrastructure and companies like Measured Search help reduce time and effort spent on doing that. You get back that time in developing features and functionality that your applications or products need.
More specifically, if you are doing it yourself, it can take many days to weeks to give the proper love and care it needs. Here's a paper that goes into the details of the comparison between doing it yourself and utilizing a Solr-as-a-Service solution.
https://www.measuredsearch.com/white-papers/why-solr-service-is-better-than-diy-solr-infrastructure/
Full disclosure, I run product for Measured Search that offers Cloud Agnostic Solr-as-a-Service. Measured Search enables you to standup a Solr Cluster on Azure within minutes.

For the new visitor their is now two Solr instances available via . We tested them and they are good. But ended up using the Azure Search service which so far looks very solid.

I haven't actually tried, but Azure can run Java, so theoretically it should be able to run Solr.
This article ("Run Java with Jetty in Windows Azure") should be useful.
The coordinator for "Lucene.Net on Azure" also claims it should run.
EDIT : The Microsoft Interop team has written a great guide and config tips for running Solr on Azure!

Azure IaaS allows you to create linux based VMs, flavors including Ubuntu, SUSE and CentOS. This VM comes with local root storage that exists only for the VM is rebooted.
However, you can add additional volumes on which data will persist even through reboots. Your solr data can be stored here.

Related

MongoDB vs MongoDB Atlas

I am a new web developer and have some questions regarding MongoDB.
The site I am working on uses references that saved data locally with MongoDB. But after doing some research, I saw something called MongoDB Atlas, which saves data to a cloud. I guess my question is, if I were to host a website would it matter which one I chose to use? Or would I be restricted to Atlas? And why would someone pick one over the other?
MongoDb Atlas is a MongoDb server hosting provided by the same guys who make MongoDb (which means they typically know what they're doing). It's handy to use because everything is automatically configured for you, you get some dashboards, monitoring, backup, upgrades, etc. They have a free layer also (aka M0, it has some important restrictions though, read more at their site). As usual with Cloud offerings, they have good pricing for starters, but these can skyrocket if you're operating at significant scale.
If you choose to install MongoDb server "locally", you would need to configure the cluster yourself (althougth there are plenty of e.g. pre-configured MongoDb docker images out there), configure the backup, arrange monitoring, etc. A lot of work, if you want to do it properly.
Considering above, here is my advice...
Choose MongoDb Atlas when:
You have a small personal project
You're a startup and you believe that you will have tens of thousands users soon - Altas allows bootstrap things fast and for a relatively small cost
You're a medium sized company, you're fine with MongoDb pricing, and you don't expect to grow too much
Choose manual installation of MongoDb when:
You have a small single-server project that is not likely to grow into a multi-server deployment. You can run MongoDb docker in the same server - this is usually a bad practice in general, but it works fine for small workloads. I've used this setup (as part of Meteor Up deployment) and it worked fine with thousands of regular users (depends on your application's usage patterns though).
You're a Unicorn-level startup or bigger
You're building something for internal usage and have restriction of using cloud deployments
Your main servers are not located in the cloud. MongoDb cannot batch requests, so it is very important that your MongoDb server is located in the same datacenter as the backend servers, otherwise latency will kill your performance

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

Clustering several instances of Apache Zeppelin?

I have been testing Apache Zeppelin to query several sources hosted on Apache Drill and then creating charts to analyze our data.
Since the product seems robust enough, I have been planning on rolling our analyst team over this solution for monitoring and discovering business data.
The problem I face now is that only 1 instance of Zeppelin will not be enough to manage concurrently the users (and thinking about HA, it's not a good idea on relying exclusively on 1 host). I have already built an Apache Drill cluster to be able to handle the traffic volumes, but I couldn't find anything on the documentation on how to build a cluster of several Zeppelin instances to share their notebooks and user sessions behind a load balancer.
Can you advice if what I'm trying to do is posible? If so, can you point me on the right direction?
Thanks
EDIT: Been playing with Zeppelin 0.8.0-snapshot and MongoDB integration to store notebooks. Although it seems to be able to write new and update notebooks, other connected Zeppelin instances will only update their internal notebooks after a restart.

Create App Engine project via API

I would need to automate the creation of new App Engine projects. Is this possible? I see there is a Google Cloud SQL Admin API which can create new Cloud SQL instances, but what about App Engine? Is there anything similar?
Update:
We have developed an application that runs on GAE and uses Cloud SQL and plenty of API integration with most of Google Apps. We foresee dozens, if not hundreds, of customers in a near future. All of them will be using their own Google domain and Google Apps.
While we could actually just deploy the application in our App Engine and modify the Cloud SQL tables to include the id of the customer who owns the record, we thought it would be better if we deploy an app instance and Cloud SQL for every one of them (on our own account). The main reasons coming to mind are that we can track how much every customer spends in terms of billing, and speed up the database since Cloud SQL is just a MySQL instance.
Steps for the creation would require editing a properties file in the packaged .war file, adding the certificate used to log in as a service account, and probably something that I am missing at this moment :-P
This question is somehow related Create an App Engine Project ID command line
As far as I know this is not possible (and is unlikely to be possible anytime soon).
Update:
I can see why splitting into separate projects for billing purposes would be really nice (multi-tenancy is great, but getting one bill per customer from Google sounds easier), but unfortunately I don't think that it's going to be your best option.
For AppEngine, you may want to look into the multi-tenancy features (or in Python) and how to get stats for billing.
Keep in mind however, CloudSQL is not simply a MySQL instance. It happens to speak MySQL but is not the same as running MySQL on Compute Engine for example. I would recommend that you run some benchmarks to be sure that the "adding the customer ID to the table" idea you had won't work.
Lastly, a possibly relevant read: http://signalvnoise.com/posts/1509-mr-moore-gets-to-punt-on-sharding
I guess the conclusion is that there’s no use in preempting the technological progress of tomorrow. Machines will get faster and cheaper all the time, but you’ll still only have the same limited programming resources that you had yesterday.
If you can spend them on adding stuff that users care about instead of prematurely optimizing for the future, you stand a better chance of being in business when that tomorrow finally rolls around.

Solr in a multi-tenant environment

I am considering using Solr in a multi-tenant application and I am wondering if there are any best practices or things I should watch out for?
One question in particular is would it make sense to have a Solr Core per tenant. Are there any issues with have a large number of Solr Cores?
I am considering use a core per tenant because I could secure each core separately.
Thanks
Solr Cores are an excellent idea for multitenant, particularly as they can be managed at runtime (so not requiring a server restart). You shouldn't run into too many problems with performance for having multiple Solr cores, but be aware the performance of one core will be impacted by the work on other cores - they're probably going to be sharing the same disk.
I can see why you might want to give direct API access - for example if each 'user' is a Drupal site or similar, for a shared hosting type environment. The best thing would be to secure the different URLs, e.g. if you had /solr/admin/cores, /solr/client1 for a client core, and /solr/client2 for another, you would have three different authentications, one for your admin, and one each for your tenants. This is done in the container (Jetty, Tomcat etc.), take a look at the general Solr Security page: http://wiki.apache.org/solr/SolrSecurity - you'll want to setup a basic access login for each path in the same way.
You would no more use a separate table in a database for each tenant than you would a solr core for each tenant.
If you think of a core like a database table and organize your project in such a way that each core represents an object in your problem space then you can better leverage solr.
Where solr shines in when you need to index text and then search it quickly. If you are not doing that you might as well use a relational database.
Also from your question about securing solr for each tenant , I hope you're not suggesting allowing your logged in users to access the solr output directly? Your users should not be able to directly access your solr instance.
Good luck.
That's OK .. you can not use cache(inbuild) properly and for your requirements. You add permission bit in which you can change the query component in which you can. It should work properly according to the permission. There is a bitwise operation also available for this. Make use of this for your needs.

Resources