Solr full refresh without deleting index - solr

I want to do solr full refresh with out deleting the index so that the data can be accessed until full refresh is done. Once the full refresh is completed the old index has to be removed. How can i do this, please help.

I would suggest the use of multiple cores in your Solr implementation. A "live" core and an "ondeck" core, where "live" is the current index and "ondeck" is the one that you will refresh into. (Note: you can name the cores anything that is meaningful to you) Once the refresh has been completed, you can issue a SWAP command that will switch the two cores out in real time without any impact to the users (eg. Solr will manage the searches being executed against the cores behind the scenes for you).
We have implemented this exact scenario on a couple of other indexes at my current company with very good success.

Related

full build solr index with large amount of data

I have a text file containing over 10 million records of web pages.
I want to build solr index with this file every day(because this file is updated daily).
Is there any effective solutions to full build solr index at once? Such as using map reduce model to accelerate building process.
I think using solr api to add document is a little bit slow.
It is not clear how much content is in those 10 million records, but it may actually be simple enough to index those in bulk. Just check your solrconfig.xml for your commit settings, you may, for example, have autoCommit configured with low maxDocs settings. In your case, you may want to disable autoCommit completely and just do it manually at the end.
However, if it is still a bit slow, before going to map-reduce, you could think about building a separate index and then swapping it with the current index.
This way, you actually have the previous collection to roll-back to and/or to compare if needed. The new collection can even be built on a different machine and/or more close to the data.

Running a weekly update on a live Solr environment

I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.
Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.
With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.
We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.
Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

What are some strategies for updating volatile data in Solr?

What are some strategies for updating volatile data in Solr? Imagine if you needed to model YouTube video data in a Solr index: how would you keep the "views" data fresh without swamping Solr in updates?
I would imagine that storing the "views" data in a different data store (something like MongoDB or Redis) that is better at handling rapid updates would be the best idea.
But what is the best way to update the index periodically with that data? Would a delta-import make sense in this context? What does a delta-import do to Solr in terms of performance for running queries?
First you need to define "fresh".
Is "fresh" 1ms? If so, by the time the value (the rendered html) gets to the browser, it's not fresh anymore, due to network latency. Does that really matter? For the vast majority of cases, no, true real-time results are not needed.
A more common limit is 1s. In that case, Solr can deal with that with RankingAlgorithm (a plugin) or soft commits (currently available in Solr 4.0 trunk only).
"Delta-import" is a term from DataImportHandler that doesn't have much intrinsic meaning. From the point of view of a Solr server, there's only document additions, it doesn't matter where they come from or if a set of documents represent the "whole" dataset or not.
If you want to have an item indexed within 1s of its creation/modification, then do just that, add it to Solr just after it's created/modified (for example with a hook in your DAL). This should be done asynchronously, and use RA or soft commits.
You might be interested in so-called "near-realtime search", or NRT, now available on Solr's trunk, which is designed to deal with exactly this problem. See http://wiki.apache.org/solr/NearRealtimeSearch for more info and links.
How about using the external file field ?
This helps you to maintain data outside of your index in a separate file, which you can refresh periodically without any changes to the index.
For data such as downloads, views, rank which is fast changing data this can be an good option.
More info # http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
This has some limitations, so you would need to check depending upon your needs.

Resources