Nutch Crawl - Deleting segments on each crawl implications - solr

I noticed that during each Nutch crawl, the indexes sent to Solr were not consistent. Sometimes the latest changes to the webpages were shown, sometimes older changes were shown instead.
Cause
Noticed that Nutch was giving indexes from an older segment to Solr.
Current Solution
Deleting all old segments before fetching and seemed to solve the problem.
Question
Would like to know if there are any implications of such an approach or my understanding to this is incorrect. Would also like to know why does Nutch not automatically remove older segments during a crawl.
Thanks.

If multiple segments are indexed (again) and the same is contained in two or more segments, there is no guarantee that the most recent version is indexed. It's a known problem (NUTCH-1416). The easiest solution is to send only the recently fetched segments to the indexer. The script bin/crawl does this, the index step is done at the end of each cycle for the segment fetched in this cycle.

Related

full build solr index with large amount of data

I have a text file containing over 10 million records of web pages.
I want to build solr index with this file every day(because this file is updated daily).
Is there any effective solutions to full build solr index at once? Such as using map reduce model to accelerate building process.
I think using solr api to add document is a little bit slow.
It is not clear how much content is in those 10 million records, but it may actually be simple enough to index those in bulk. Just check your solrconfig.xml for your commit settings, you may, for example, have autoCommit configured with low maxDocs settings. In your case, you may want to disable autoCommit completely and just do it manually at the end.
However, if it is still a bit slow, before going to map-reduce, you could think about building a separate index and then swapping it with the current index.
This way, you actually have the previous collection to roll-back to and/or to compare if needed. The new collection can even be built on a different machine and/or more close to the data.

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?
I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)
Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.
I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

Solr full refresh without deleting index

I want to do solr full refresh with out deleting the index so that the data can be accessed until full refresh is done. Once the full refresh is completed the old index has to be removed. How can i do this, please help.
I would suggest the use of multiple cores in your Solr implementation. A "live" core and an "ondeck" core, where "live" is the current index and "ondeck" is the one that you will refresh into. (Note: you can name the cores anything that is meaningful to you) Once the refresh has been completed, you can issue a SWAP command that will switch the two cores out in real time without any impact to the users (eg. Solr will manage the searches being executed against the cores behind the scenes for you).
We have implemented this exact scenario on a couple of other indexes at my current company with very good success.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

What are some strategies for updating volatile data in Solr?

What are some strategies for updating volatile data in Solr? Imagine if you needed to model YouTube video data in a Solr index: how would you keep the "views" data fresh without swamping Solr in updates?
I would imagine that storing the "views" data in a different data store (something like MongoDB or Redis) that is better at handling rapid updates would be the best idea.
But what is the best way to update the index periodically with that data? Would a delta-import make sense in this context? What does a delta-import do to Solr in terms of performance for running queries?
First you need to define "fresh".
Is "fresh" 1ms? If so, by the time the value (the rendered html) gets to the browser, it's not fresh anymore, due to network latency. Does that really matter? For the vast majority of cases, no, true real-time results are not needed.
A more common limit is 1s. In that case, Solr can deal with that with RankingAlgorithm (a plugin) or soft commits (currently available in Solr 4.0 trunk only).
"Delta-import" is a term from DataImportHandler that doesn't have much intrinsic meaning. From the point of view of a Solr server, there's only document additions, it doesn't matter where they come from or if a set of documents represent the "whole" dataset or not.
If you want to have an item indexed within 1s of its creation/modification, then do just that, add it to Solr just after it's created/modified (for example with a hook in your DAL). This should be done asynchronously, and use RA or soft commits.
You might be interested in so-called "near-realtime search", or NRT, now available on Solr's trunk, which is designed to deal with exactly this problem. See http://wiki.apache.org/solr/NearRealtimeSearch for more info and links.
How about using the external file field ?
This helps you to maintain data outside of your index in a separate file, which you can refresh periodically without any changes to the index.
For data such as downloads, views, rank which is fast changing data this can be an good option.
More info # http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
This has some limitations, so you would need to check depending upon your needs.

Resources