Autosuggestion with solr - solr

I'm implementing auto-suggestion in a web page (ASP.NET MVC) with solr and have understood that there are a number of ways to do this, including:
jQuery Autocomplete, Suggester, facets or NGramFilterFactory.
Which one is the fastest one to use for auto-suggestion?
Any good information about this?

You should take a good look at 'AJAX Solr' at https://github.com/evolvingweb/ajax-solr .
AJAX Solr has autocomplete widget among other things. Demo site - http://evolvingweb.github.io/ajax-solr/examples/reuters/index.html .

Here's my shot at addressing your need, with this disclaimer:
'Fastest' is a very vague term, and extends to a broader spectrum, i.e browser used, page weight, network etc. These need to be optimized outside the search implementation, if need be.
I would go for the straight-forward implementation and then optimize it based on performance stats
Ok, now to the implementation, at a high level:
1) Create a Solr index with a field having an NGramTokenizerFactory tokenizer.
- to reduce chatter, keep minLength of NGram to be 2, and fire autosuggest with minLength =2
2) Depending on technology used, you can either route search requests through your application, or hit Solr directly. Hitting Solr directly could be faster (Ref AjaxSolr as mentioned already).
3) Use something like Jquery-ui to have an autosuggest implemtation, backed with ajax requsts to Solr.
Here are couple of reference implementations:
http://isotope11.com/blog/autocomplete-with-solr-and-jquery-ui
https://gist.github.com/pduey/2648514
Note that there are similar implementation that work well for live sites, so I would be tempted to try this out and see if there is still a bottleneck, and not tempted to do any premature optimization.

'AJAX Solr' has limitations with respect to autosuggestions as it provides only word level suggestions. Internally it uses faceting to generate them.
But SOLR provides different suggesters which we can leverage to generate intelligent autosuggestions(words/phrase)
Checkout this blog post to know more.
http://lucidworks.com/blog/solr-suggester/
For implementation, you can use combination of suggesters(FST + Analyzing Infix) and jQuery autocomplete.

Related

Does Cloudant support rewrites as functions?

I have a Cloudant database, and I want to make pretty URLs for my slash-containing documents. So I define a rewrite function like so:
{
"_id": "_design/myRewrites",
"rewrites": "function (req2) {\n return {path: \"../../../\" + req2.path.slice(4).join(\"%2F\")};\n}"
}
Rewrite function formatted more nicely:
function (req2) {
return {path: "../../../" + req2.path.slice(4).join("%2F")};
}
According to the CouchDB docs, CouchDB has supported this kind of rewriting (as stringified functions) since CouchDB 1.7, but Cloudant's documentation doesn't speak about this particular functionality (only rewrites from arrays).
This is reflected in my experience when I try it out https://myAccount.cloudant.com/myDb/_design/myRewrites/_rewrite/hello/world/, I get the following response:
{"error":"unknown_command","reason":"unknown ddoc command 'rewrites'"}
However I read somewhere that Cloudant and CouchDB match their source code since 2.0, so I would expect Cloudant to support all CouchDB features. What's the deal?
Also see following tweet about this, in which IBM asks me to make a question on StackOverflow and suggests I might be on an outdated cluster: https://twitter.com/digitalheir/status/845910843934085120
My data location says "Porter, London". Could it help if I changed this?
tl;dr: Sorry, but no. Cloudant doesn't support rewrites as functions :(
We tried your example and got the same results. Digging deeper, I can now confirm that Cloudant does not support URL rewrites via stringified functions. The service only supports rewrites using the array approach.
I can't say for sure, but I suspect that the team overlooked this feature. That said, it's unlikely that Cloudant will support rewrites as JS functions anytime soon because the current approach does not scale well, as it can bog down the database if views are frequently updated. It's similar to the reason that Cloudant recommends people use the built-in reduce functions (which are implemented in Erlang), rather than writing their own custom JavaScript reduces.
Rewrites as arrays, however, does scale. But this approach obviously won't work if you're dynamically generating URLs. In this case, we suggest moving the URL rewrite functionality to an app server. Unfortunately, this all might be a moot point if you're building a CouchApp :/
This was confusing, so thank you for pointing it out. I'm going to ask the Cloudant team to note this difference in the documentation. Hope this at least helps provide some closure. You weren't wrong for expecting it to work.

How to programmatically create indexing from the list of urls in java

I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.

When did Google index a page?

How can I find out (any language but better if Python) when Google indexed a specific html page?
Ideally I would have a list of URLs to check for.
I have already tried the WayBack machine but it doesn't have the majority of the pages I need. Also if anyone can suggest an API to extract dates in multiple language from text.
You can use this pattern to access the cached version of your webpage.
http://webcache.googleusercontent.com/search?q=cache:<URL>
Say for example, you can see the cache version of my blog datafireball.com like this, as you can see, it is indexed 20141020, 23:33:30, strip= will avoid loading javascript, css..etc. To get the time when the indexed was index, you can use some browser automation tool like Selenium or Phantomjs... etc. to get the page.

Pulling out Popular terms from a Solr core

I have an Apache Solr core where i need to pull out the popular terms out of it, i am already aware of luke, facets, and Apache Solr stopwords but i am not getting what i want, for example, when i try to use luke to get the popular terms and after applying the stopwords on the result set i get a bunch of words like:
http, img, que ...etc
While what i really want is:
Obama, Metallica, Samsung ...etc
Is there any better way to implement this in Solr?, am i missing something that should be used to do this?
Thank You
Finding relevant words from a text is not easy. The first thing I would have a deeper look at is Natural Language Processing (NLP) with Solr. The article in Solr's wiki is a starting point for this. Reading the page you will stumble over the Full Example which extracts nouns and verbs, probably that already helps you.
During the process of getting this running you will need to install additional software (Apache's OpenNLP project) so after reading in Solr's wiki that project's home page maybe the next step.
To get a feeling what is possible with that you should have a look on the demonstration of the searchbox guy. There you can paste a sample text and get relevant words and terms extracted from it.
There are several tutorials out there you may have a look at for further reading.
If you went down the path and the results are not as expected or not as good as required, you may go down that road even further and start thinking about text mining with Apache Mahout. There are again several tutorials out there to cross it with Solr.
In any case you should then search Stackoverflow or the web for tutorials and How-Tos you will certainly need.
Update about arabic
If you are going to use OpenNLP for not supported languages, which Arabic unfortunately is out of the box as of version 1.5, you will need to train OpenNLP for the language. The reference about it is found on the developer docs of OpenNLP. Probably there is already something out there from the arabic community, but my arabic google-fu is not that good.
Should you decide to do the work and train it for the arabic language, why not share your traning with the project?
Update about integration in Solr/Lucene
There is work going on to integrate it as a module. In my humble opinion this is as far as it will and should get. If you compare this problem field to stemming stemming appears to be rather easy. But even stemming got complex when supporting different languages. Analysing a language to the level that you can extract nouns, verbs and so forth is so complex that a whole project evolved around it.
Having a module/contrib at hand, which you could simply copy to solr_home/lib would already be very handy. So there would be no need to run a different installer an so forth.
Well , this is a bit open ended.
First you will need to facet and find "popular terms" out of your index, then add all the non useful items such as http , img , time , what, when etc to your stop word list and re-index to get cream of the data you care about. I do not think there is an easier way of knowing popular names unless you can bounce your data against a custom dictionary of nouns during indexing (that is an option by the way)- You can choose to index only names by having a custom token filter (look how stopword filter works) and have your own nouns.txt file to go with your own nouns filter, in the case you allow only words in your dictionary to into index, and this approach is only possible if you have finite known list of nouns.

How to implement Solr into Sitecore

I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.

Resources