How to design multiple concurent imports using DIH in SOLR? - solr

There is a case when an external application should send a unknown number of different indexing requests to SOLR. In fact, those requests should be processed by SOLR Data Import Handlers according to the config submitted inside the request.
There is a SOLR constraint - only one indexing request can be processed by a particular DIH.
Because the number of the requests can be quite large and they arrive in parallel, it is inpractical to define multiple DIH specifications in the solrconfig.xml.
How that problem can be overcome ?
May be SOLR provides some admin API to create DIH specifications dynamically from a client ?

The best way to do this is to create a layer outside of Solr that handles your import tasks. Using DIH will limit what you can do (as you've discovered), and will be hard to make work properly in parallel across multiple nodes and indexing services (it's designed for a far simpler scenario).
Using a simple queue (Redis, Celery, ApacheMQ, whatever fits your selection of languages and technology) that the external application can put requests into and that your indexing workers pick up tasks from will be scalable and customizable. It'll allow you to build out onto multiple index nodes as the number of tasks grow, and it'll allow you to pull data from multiple sources as necessary (and apply caching if required).

Related

Load balancing and indexing in SolrCloud

I have some questions regarding SolrCloud:
If I send a request directly to a solr node, which belons to a solr cluster, does it delegate the query to the zookeeper ensemble to handle it?
I want to have a single url to send requests to SolrCloud. Is there a better way of achieving this, than setting up an external load balancer, which balances directly between individual solr nodes? If 1 isn't true, this approach seems like a bad idea. On top I feel like it would somewhat defeat the purpose of zookeeper ensemble.
There is an option to break up a collection in shards. If I do so, how exactly does SolrCloud decide which document goes to which shard? Is there a need and/or an option to configure this process?
What happens if I send a collection of documents directly to one of the solr nodes? Would the data set somehow distribute itself across the shards evenly? If so, how does it happen?
Thanks a lot!
Zookeeper "just" keeps configuration data available for all nodes - i.e. the state of the cluster, etc. It does not get any queries "delegated" to it; it's just a way for Solr nodes and clients to know which collections are handled by which nodes in the cluster, and have that information be stored in resilient and available manner (i.e. dedicate the hard part out of managing a cluster to Zookeeper).
The best is to use a cloud aware Solr client - it will connect to any of the available Zookeeper nodes given in its configuration, retrieve the cluster state and connect directly to one the nodes that has the information it needs (i.e. the collection it needs to query). If you can't do that, you can either load balance with an external load balancer across all nodes in your cluster or let the client load balance if the client you use supports round robin, etc. - but having an external load balancer gives you other gains (such as being able to remove a node from load balancing for all clients at the same time, having dedicated http caching in front of th enodes, etc.) for a bit more administration.
It will use the unique id field to decide which node a given document should be routed to. You don't have to configure anything, but you can tell Solr to use a specific field or a specific prefix of a field, etc. as the route key. See Document Routing. for specific information. It allows you to make sure that all documents that belong to a specific client/application is placed on the same node (which is important for some calculations and possible operations).
It gets routed to the correct node. Whether that is evenly depends on your routing key, but by default, it'll be about as even as you can get it.

Which SOLR server should a distributed request be sent to when specifying shards in the URL?

I am setting up a distributed search with shards in SOLR.
Which server should I send this request to? or does it not matter?
host1:8983/solr/core?q=:&shards=host1:8983/solr/core,host2:8983/solr/core
vs
host2:8983/solr/core?q=:&shards=host1:8983/solr/core,host2:8983/solr/core
Similarly, would it be a better idea to have a separate empty solr server to direct these searches to instead of using one of the shards?
Unless you're seeing performance issues I wouldn't be too concerned about the performance difference between those two. The queries will run on both servers anyway, it'll just be a different server that's responsible for merging the end result to the client. If you want to spread this load across both servers, that's fine - in that case I'd go with alternating between both in a round robin manner (for example by placing an HTTP load balancer in front or letting your Solr library load balance between the available servers).
If you start getting replicas into the mix it becomes harder, where a load balancer will be useful. In that case it might be a good idea to look into Solr in cloud mode instead, where Solr will handle all this for you transparently (both load balancing and replica balancing, as long as your library is Zookeeper aware).

Load balance solr search

I am trying to implement search in datastax cassandra using solr. I have two nodes running both cassandra and solr. I am able to perform solr search using solrj. However I have hardcoded solr url of one of the node. I would like to know what configuration/code change I need to change so that solr nodes can be chosen directly.
At this stage, I am reading solrUrl from an external file and passing it as an argument to HttpSolrServer.
HttpSolrServer solrServer = new HttpSolrServer(solrUrl);
External file contains solrUrl
Solr.URL=http://192.168.100.12:8983/solr/
Also what improvements I can do to existing approach?
You can use the LBHttpSolrServer (remember: only use it for querying), which allows you to provide several servers that SolrJ will use to distribute its queries.
If you have Solr Cloud cluster, you can use the ZooKeeper-aware server in SolrJ to get your queries automagically distributed.
Third, you can set up a regular HTTP load balancer (such as haproxy, varnish, etc.) to distribute the requests for you and handle new servers coming online and servers disappearing.
You could also read a random line in the file instead of one specific server, or use a separator for the configuration line and split on that separator and pick a server on random. It won't allow you to dynamically adjust the weights depending on query times (which a HTTP Load Balancer could do), but it would probably work Good Enough.

What's the simplest way to process lots of updates to Solr in batches?

I have a Rails app that uses Sunspot, and it is generating a high volume of individual updates which are generating unnecessary load on Solr. What is the best way to send these updates to Solr in batches?
Assuming, the changes from the Rails apps also update a persistence store, you can check for Data Import Handler (DIH) handler which can be scheduled periodically to update Solr indexes.
So instead of each update and commits triggered on Solr, the frequency can be decided to update Solr in batches.
However, expect a latency in the search results.
Also, Are you updating the Individual records and commit ? If using Solr 4.0 you can check for Soft and Hard Commits as well.
Sunspot makes indexing a batch of documents pretty straightforward:
Sunspot.index(array_of_docs)
That will send off just the kind of batch update to Solr that you're looking for here.
The trick for your Rails app is finding the right scope for those batches of documents. Are they being created as the result of a bunch of user requests, and scattered all around your different application processes? Or do you have some batch process of your own that you control?
The sunspot_index_queue project on GitHub looks like a reasonable approach to this.
Alternatively, you can always turn off Sunspot's "auto-index" option, which fires off updates whenever your documents are updated. In your model, you can pass in auto_index: false to the searchable method.
searchable auto_index: false do
# sunspot setup
end
Then you have a bit more freedom to control indexing in batches. You might write a standalone Rake task which iterates through all objects created and updated in the last N minutes and index them in batches of 1,000 docs or so. An infinite loop of that should stand up to a pretty solid stream of updates.
At a really large scale, you really want all your updates going through some kind of queue. Inserting your document data into a queue like Kafka or AWS Kinesis for later processing in batches by another standalone indexing process would be ideal for this at scale.
I used a slightly different approach here:
I was already using auto_index: false and processing solr updates in the background using sidekiq. So instead of building an additional queue, I used the sidekiq-grouping gem to combine Solr update jobs into batches. Then I use Sunspot.index in the job to index the grouped objects in a single request.

What are some strategies for updating volatile data in Solr?

What are some strategies for updating volatile data in Solr? Imagine if you needed to model YouTube video data in a Solr index: how would you keep the "views" data fresh without swamping Solr in updates?
I would imagine that storing the "views" data in a different data store (something like MongoDB or Redis) that is better at handling rapid updates would be the best idea.
But what is the best way to update the index periodically with that data? Would a delta-import make sense in this context? What does a delta-import do to Solr in terms of performance for running queries?
First you need to define "fresh".
Is "fresh" 1ms? If so, by the time the value (the rendered html) gets to the browser, it's not fresh anymore, due to network latency. Does that really matter? For the vast majority of cases, no, true real-time results are not needed.
A more common limit is 1s. In that case, Solr can deal with that with RankingAlgorithm (a plugin) or soft commits (currently available in Solr 4.0 trunk only).
"Delta-import" is a term from DataImportHandler that doesn't have much intrinsic meaning. From the point of view of a Solr server, there's only document additions, it doesn't matter where they come from or if a set of documents represent the "whole" dataset or not.
If you want to have an item indexed within 1s of its creation/modification, then do just that, add it to Solr just after it's created/modified (for example with a hook in your DAL). This should be done asynchronously, and use RA or soft commits.
You might be interested in so-called "near-realtime search", or NRT, now available on Solr's trunk, which is designed to deal with exactly this problem. See http://wiki.apache.org/solr/NearRealtimeSearch for more info and links.
How about using the external file field ?
This helps you to maintain data outside of your index in a separate file, which you can refresh periodically without any changes to the index.
For data such as downloads, views, rank which is fast changing data this can be an good option.
More info # http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
This has some limitations, so you would need to check depending upon your needs.

Resources