I need to know in common language what is the difference between them and what are they used for exactly and how? I'm having a Solr project that suggests results based on queries as a personalization approach.
which one can be used?
They're very different features. Carrot2 is a clusterer - i.e. it finds clusters of similar documents that belong together. That means that it attempts to determine which documents describe the same thing, and group them together based on these characteristics.
The suggester is component is mainly used for autocomplete-like features, where you're giving the user suggestions on what to search for (i.e. trying to guess what the user wants to accomplish before he or she has typed all of their query).
Neither is intended for personalization. You might want to look at Learning to rank to apply certain models based on what you know about the input from the user. You'll have to find out which features you have that describe your users and apply those as external feature information
There's also a feature to examine semantic knowledge graphs (i.e. "this concept is positively related to this other concept"), but that's probably on the side of what you're looking for.
Related
I am trying to move my website from SQL server to Azure Search (or at least the core searching functionality). I believe I understand how to get most of the functionality rebuilt but I'm stuck on one feature that is key to my site.
I would like to be able to sort the search results based on the weight on any of a fairly large number of tags. By weight, I mean that I maintain a count of the number of users that have tagged a document with a particular tag.
It looks like you can do this in elasticsearch: (Elastic search - tagging strength (nested/child document boosting)). But that uses features of elasticsearch that aren't exposed in Azure Search.
I don't see a way to use scoring profiles (https://msdn.microsoft.com/en-us/library/azure/dn798928.aspx) to do this either.
The only thing I can see that might work in a limited sense is to add a field for each tag that I want to sort on. This might work for my particular case for now, but in the long run I'd like to make this work for user-defined tags.
Is this possible in the broad sense that is outlined in the elastic search case?
I agree that for right now, the best way to do this would be to have a separate field that is periodically updated with the count of the # of users that have tagged a document. Please note that you can be pretty efficient with this update by just posting this numeric value using merge or mergeOrUpload. If you would like to see this feature added to Azure Search, it would be great if you could cast your vote.
Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.
I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.
I am working on providing auto-suggest functionality in our search application (which uses Solr) based on terms used in previous successful searches. In the Solr suggest documentation (http://wiki.apache.org/solr/Suggester), I see how to do this using a dictionary file. My question is: Does Solr have any utilities for populating a dictionary file, or do I need to write my own?
I haven't found utilities for doing this in the Solr documentation. But before I started to write my own job to build the dictionary file, I figured it's worth asking this question.
We had a Similar requirement for providing auto suggest on popular searches.
Solr did not provide and out of the box solution for the searches, it does provide it based on the Index or dictionary based.
You can use Solr Suggester and build an new index with the successful Searches and have suggestion based on it.
Else you would need to build up a custom solution for searches with popularity.
What we tried was -
Building an index with the searches and their count and providing autosuggest based on it.
We didn't wanted to maintain the count, and the overhead to maintain and increment it, so we indexed all the Successful search terms multiple times as they were being searched upon.
We used the Terms Component as it provides a terms.prefix option to get the auto suggest feature and as the terms are returned based on the frequency that took care of the popularity.
As the field being used for terms were only marked indexed and not stored, the index size did not grew considerably with same terms being pushed multiple times.
I am looking for a project (application) that makes use of Ontology (for an academic course). Every body is talking about the health care application. I want to work on a different project. please any suggestion could help.
I think that (almost) everything can be represented through an ontology. The idea behind it is to embed semantic meaning to the data you are putting into it.
Take for example Swoogle, it's a search engine that look into several ontologies to retrieve the information.
In the same way you can use it for any purpose:
Tourism: travel information, retrieve meaningful suggestions to you clients
Documents: search for topic that are related, not just by the keywords but by the meaning of those keywords
Shopping store
FAQs
etc
The list goes all the way, if you can use it as a search engine, you can use it everywhere.