Run incremental crawl using NCrawler - solr

When we use NCrawler with SOLR, is there any way to run incremental crawling and indexing? I dont want my crawler to fetch the complete data every time it crawls. Is there any way to make the crawl incrtemental ?
Thanks in advance.

There is not anything built into NCrawler for this. You will need to create your own processing to handle this. However, the extensible IPipelineStep mechanism will allow you to create any process around your crawling that you want. For example you could store each visited url in a database with along with a hash of the page content to determine when pages change and only process the changed pages to the index.

Related

Posting large directory of files to SOLR using post tool, how to commit after every file

I am using the java post tool for solr to upload and index a directory of documents. There are several thousand documents. Solr only does a commit at the very end of the process and sometimes things stop before it completes so I lose all the work.
Has anyone a technique to fetch the name of each doc and call post on that so you get the commit for each document? Rather than the large commit of all the docs at the end?
From the help page for the post tool:
Other options:
..
-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request)
This should allow you to use -params "commitWithin=1000" to make sure each document shows up within one second of being added to the index.
Committing after each document is an overkill for the performance, in any case it's quite strange that you had to resubmit anything from start if something goes wrong. I suggest to seriously to change the indexing strategy you're using instead of investigating in a different way to commit.
Given that, if you not have any other way that change the commit configuration, I suggest to configure autocommit in your Solr collection/index or use the parameter commitWithin, as suggested by #MatsLindh. Just be aware if the tool you're using has the chance to add this parameter.
autoCommit
These settings control how often pending updates will be automatically pushed to the index. An alternative to autoCommit is
to use commitWithin, which can be defined when making the update
request to Solr (i.e., when pushing documents), or in an update
RequestHandler.

Get information from publishItem to publish end in sitecore

I got a setup with sitecore and solr.
Im looking to gather information (the different TemplatesIds) in publishItem, and then when the publish has ended, call solr with the names which needs to be reindex.
Ive managed to get all the template IDs both using PublishItemProcessor and as a publish:itemProcessed event, where i store the template ids in the PublishContext.CustomData as a Hashset.
But how can i, when the publishing is done get this information i've gathered during publishing? I want to call solr, once, and only once, after everything is published, with information gathered during the publishing.
Hope this makes sense guys, please help out.
You don't need to make a hack to reindex indexes after a publishing.
Sitecore has out of the box this functionality.
You use index update strategies to maintain indexes. You can configure each index with a unique set of index update strategies. You should not specify more than three update strategies per index for performance reasons.
Sitecore provides a varied set of index update strategies, and you can extend this set with more strategies.
All the strategies that are delivered with Sitecore are defined under the following node in the Sitecore.ContentSearch.Solr.Index.IndexName configuration files:
<configuration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration" />
<strategies hint="list:AddStrategy">
You need to use of these default strategies:
RebuildAfterFullPublish
OnPublishEndAsync
More information about search, indexing and crawling you can find here:
https://doc.sitecore.net/sitecore_experience_platform/setting_up__maintaining/search_and_indexing

How to programmatically create indexing from the list of urls in java

I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.

When did Google index a page?

How can I find out (any language but better if Python) when Google indexed a specific html page?
Ideally I would have a list of URLs to check for.
I have already tried the WayBack machine but it doesn't have the majority of the pages I need. Also if anyone can suggest an API to extract dates in multiple language from text.
You can use this pattern to access the cached version of your webpage.
http://webcache.googleusercontent.com/search?q=cache:<URL>
Say for example, you can see the cache version of my blog datafireball.com like this, as you can see, it is indexed 20141020, 23:33:30, strip= will avoid loading javascript, css..etc. To get the time when the indexed was index, you can use some browser automation tool like Selenium or Phantomjs... etc. to get the page.

How to implement Solr into Sitecore

I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.

Resources