Azure Search Service Geo-replication support - azure-cognitive-search

We are using Azure Search Service in our application. From documentations we know that Azure Search Service is not supporting “Geo Replication” out of the box. We want to avoid single point of failure around single instance of Azure search service. We have a couple of questions around this. Please clarify them below.
Is out of the box “Geo replication” feature in Azure search service is planned in future ? If yes, can you please share the ETA?
If we maintain at least three replicas, does it mean each instance will be deployed in different zones of same region automatically? In case, even if one zone goes down, will the requests be served by other zones without any manual intervention ?
We are finding the best way to avoid single point of failure.

Unfortunately we do not currently support out of the box Geo Replication and do not have an ETA for when this will be available.
[Edited]: Sorry for my last response. I see that your question was about replicas spanning zones, not regions. We do make replicas automatically span zones in regions where Azure supports AZs.
We have more information on this topic here: https://learn.microsoft.com/en-us/azure/search/search-performance-optimization
Liam

Related

Cloud solutions eg Snowflake for server data filling up fast

Firstly I'm new to development and currently I have a problem with server data filling up rapidly. I'm looking at solutions such as watcher programs to help me detect when the server data is reaching the limit but I wanted to know if cloud solutions could help in this regard. Additionally I also wanted to know if companies such as Snowflake can help to handle fast growing data and in what way can a developer use it or will it be too costly to use this approach from an enterprise point of view.
I have tried to look up the documentations of Snowflake but I am unable to reach any conclusions as to whether it can help me. I could just see articles about storage and that they store data by compressing it but I wanted more clarity on this solution.
Snowflake stores the data using Cloud Storege Services (AWS S3, Google Cloud Storage, or Microsoft Azure), so you can't fill the server data in normal conditions (never heard that S3 is full on any region).
Check the pricing page to see if it will be costly for you (or not):
https://www.snowflake.com/pricing/

Azure Search - size maxed - any options?

Azure Search service maxes out at 300GB of data. As of today, we've exceeded that. Our database table consists mainly of unstructured text from website news articles around the world.
Do we have any options at all here? We like Azure Search and have built our entire back-end infrastructure around it, but now we're dead in the water with being able to add any more documents to it. Does Azure Search allow compression on the documents?
Azure Search offers a variety of SKUs. The biggest one allows you to index up to 2.4 TB per service. You can find more details here.
Note, changing SKUs requires re-indexing the data.
We don't provide data compression. If you'd like to talk to Azure Search program managers about your capacity requirements, feel free to reach out to #liamca.

Is it possible to get data from other companies' databases?

I was wondering how so many job sites have so many job offers/information regarding other companies' offers. For instance, if I were to start my own job searching engine, how would I be able to get the information that sites like indeed.com have in my own databases? One site (jobmaps.us) says that it's "powered by indeed" and seems to be following the same format as indeed.com (as do all other job searching websites). Is there some universal job searching template that I can use?
Thanks in advance.
Some services offer an API which allows you to "federate" searches (relay them to multiple data sources, then gather all the results together for display in one place). Alternatively, some offer a mechanism that would allow you to download/retrieve data, so you can load it into your own search index.
The latter approach is usually faster and gives you total control but requires you to maintain a search index and track when data items are updated/added/deleted on remote systems. That's not always trivial.
In either case, some APIs will be open/free and some will require registration and/or a license. Most will have rate limits. It's all down to whoever owns the data.
It's possible to emulate a user browsing a website, sending HTTP requests and analysing the response from a web server. By knowing the structure of the HTML, it's possible to extract ("scrape") the information you need.
This approach is often against site policies and is likely to get you blocked. If you do go for this approach, ensure that you respect any robots.txt policies to avoid being blacklisted.

Resolving Overloaded Webserver Issues

I am new to the area of web development and currently interviewing companies, the most favorite questions among what people ask is:
How do you scale your webserver if it
starts hitting a million queries?
What would you do if you have just one
database instance running at that
time? how do you manage that?
These questions are really interesting and I would like to learn about them.
Please pour in your suggestions / practices (that you follow) for such scenarios
Thank you
How to scale:
Identify your bottlenecks.
Identify the correct solution for the problem.
Check to see you you can implement the correct solution.
Identify alternate solution and check
Typical Scaling Options:
Vertical Scaling (bigger, faster server hardware)
Load balancing
Split tiers/components out onto more/other hardware
Offload work through caching/cdn
Database Scaling Options:
Vertical Scaling (bigger, faster server hardware)
Replication (active or passive)
Clustering (if DBMS supports it)
Sharding
At the most basic level, scaling web servers consists of writing your app in such a way that it can run on > 1 machine, and throwing more machines at the problem. No matter how much you tune them, the eventual scaling will involve a farm of web servers.
The database issue is way more sticky to deal with. What is your read / write percentage? What kind of application is this? OLTP? OLAP? Social Media? What is the database? How do we add more servers to handle the load? Do we partition our data across multiple dbs? Or replicate all changes to loads of slaves?
Your questions call more questions, i.e. in an interview, if someone just "has the answer" to a generic question like you've posted, then they only know one way of doing things, and that way may or may not be the best one.
There are a few approaches I'd take to the first question:
Are there hardware upgrades that may get things up enough to handle the million queries in a short time? If so, this is likely an initial point to investigate.
Are there software changes that could be made to optimize the performance of the server? I know IIS has a ton of different settings that could be used to improve performance to some extent.
Consider going into a web farm situation rather than use a single server. I actually did have a situation where I worked once where we did have millions of hits a minute and it was thrashing our web servers rather badly and taking down a number of sites. Our solution was to change the load balancer so that a few of the servers served up the site that would thrash the servers so that other servers could keep the other sites up as this was in the fall and in retail this is your big quarter. While some would start here, I'd likely come here last as this can be opening a bit can of worms compared to the other two options.
As for the database instance, it would be a similar set of options to my mind though I may do the multi-server option first as redundancy may be an important side benefit here that I'm not sure it is as easy with a web server. I may be way off, but that is how I'd initially tackle this.
Use a caching proxy
If you serve identical pages to all visitors (say, a news site) you can reduce load by an order of magnitude by caching generated content with a caching proxy such as Varnish or Apache Traffic Server.
The proxy will sit between your server and your visitors. If you get 10,000 hits to your front page it will only have to be generated once, the proxy will send the same response to the other 9999 visitors without asking your app server again.
probably before developer starting to develop the system,
they will consider the specification of the server
maybe you can decrease use of SEO and block it from search engine to craw it
(which is the task that taking a lot of resource)
try to index everything well and avoid to making search easily
Deploy it on the cloud, make sure your web server and webapp cloud ready and can scale across different nodes. I recommend cherokee web server (very easy to load balance across different servers, and benchmarks proves faster than Apache,). For ex, google cloud (appspot) needs your web app to be Python or Java
Use caching proxy eg. Nginx.
For database use memcache on some queries which are suppose to be repeated.
If the company wants data to be private , build a private cloud , Here , Ubuntu is doing very good job at it fully free and opensource : http://www.ubuntu.com/cloud/private

Running Solr on Azure

Can Solr be run on Azure?
I know this thread is old, but I wanted to share our two cents. We are running SOLR on Azure with no big problems. We've created a startup task to install Java and create the deployment, and we have a SOLR instance on each web role. From there on, it's a bit of magic figuring out which master/slave configuration, but we've solved that too.
So yes, it can be done with a little bit of work. Most importantly, the startup task is key. Index doesn't have to be stored anywhere but on local disk (Local Resource), because indexing is part of the role startup. If you have to speed it up and a few minute differences are acceptable, you can have the index synced with a blob storage copy every once in a while by the master. But in this case you need to implement a voting algorithm so that the SOLR instances don't override each other.
We'll be posting info on our blog, but I don't want to post links in answers for old threads because I'll look like a spammer :o)
Abit of a dated question, but wanted to provide an updated answer. You can run Apache Solr on Azure. Azure offers IaaS (Infrastructure as a service), which is raw Virtual Machines running Linux/Windows. You can choose to set up your entire Solr cluster on a set of VMs and configure SolrCloud and Zookeeper on them.
If you are interested, you could also check out Solr-as-a-Service or Hosted Solr solutions as they remove the headache of setting up SolrCloud on Azure. There's a lot that goes into running, managing and scaling a search infrastructure and companies like Measured Search help reduce time and effort spent on doing that. You get back that time in developing features and functionality that your applications or products need.
More specifically, if you are doing it yourself, it can take many days to weeks to give the proper love and care it needs. Here's a paper that goes into the details of the comparison between doing it yourself and utilizing a Solr-as-a-Service solution.
https://www.measuredsearch.com/white-papers/why-solr-service-is-better-than-diy-solr-infrastructure/
Full disclosure, I run product for Measured Search that offers Cloud Agnostic Solr-as-a-Service. Measured Search enables you to standup a Solr Cluster on Azure within minutes.
For the new visitor their is now two Solr instances available via . We tested them and they are good. But ended up using the Azure Search service which so far looks very solid.
I haven't actually tried, but Azure can run Java, so theoretically it should be able to run Solr.
This article ("Run Java with Jetty in Windows Azure") should be useful.
The coordinator for "Lucene.Net on Azure" also claims it should run.
EDIT : The Microsoft Interop team has written a great guide and config tips for running Solr on Azure!
Azure IaaS allows you to create linux based VMs, flavors including Ubuntu, SUSE and CentOS. This VM comes with local root storage that exists only for the VM is rebooted.
However, you can add additional volumes on which data will persist even through reboots. Your solr data can be stored here.

Resources