Implement k means clustering in solr - solr

How can i implement k means clustering in solr 6.5 ?
Requirements :-
1) I want to cluster the docs at the query time on the basis of their score
2) I have written my own handler and i want to add the clustering function in that handler only such that it does not the ordering of the docs
I had tried to write the clustering search component as below :-
<searchComponent name="clustering" enable="${solr.clustering.enabled:true}" class="solr.clustering.ClusteringComponent">
<lst name="engine">
<str name="name">kmeans</str>
<str name="carrot.algorithm">org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm</str>
<str name="BisectingKMeansClusteringAlgorithm.clusterCount">4</str>
<str name="documents">100</str>
<str name="BisectingKMeansClusteringAlgorithm.maxIterations">4</str>
</lst>
</searchComponent>
My Request Handler is as :
<requestHandler name="abc" class="solr.SearchHandler">
<lst name="invariants">
<str name="defType">synonym_edismax</str>
<str name="synonyms">true</str>
<str name="indent">on</str>
</lst>
<lst name="appends">
<str name="fq">search_term</str>
</lst>
<lst name="defaults">
<str name="echoParams">none</str>
<str name="wt">json</str>
<str name="timeAllowed">15000</str>
<str name="qf">Field1</str>
<str name="qf">Field2^0.5</str>
<str name="pf">Field3</str>
<float name="tie">0.2</float>
<str name="fl">Field5,Field6</str>
<str name="facet">false</str>
<str name="mm">2<-1 4<70%</str>
<!-- spellcheck -->
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.count">1</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
How can i add the clustering in this request handler such that my number of clusters is 4 and iterations is also 4
Also whats the difference between
carrot.url
carrot.snippet
carrot.title
I read the docs definition but i m unable to understand it.

To add the clustering component to a request handler just :
<arr name="last-components">
<str>spellcheck</str>
<str>clustering</str>
</arr>
Then :
<str name="carrot.url">id</str> -> unique key of your document
This is the unique identifier for your document.
<str name="carrot.title">doctitle</str> -> the title(s)/label(s) for your document
This is the field or list of fields, which are short and tend to be more important to group your documents together
<str name="carrot.snippet">content</str> -> the content/text/body of your document
From the wiki :
carrot.title
The field (alternatively comma- or space-separated list of fields) that should be mapped to the logical document’s title. The clustering algorithms typically give more weight to the content of the title field compared to the content (snippet). For best results, the field should contain concise, noise-free content. If there is no clear title in your data, you can leave this parameter blank.
carrot.snippet
The field (alternatively comma- or space-separated list of fields) that should be mapped to the logical document’s main content. If this mapping points to very large content fields the performance of clustering may drop significantly. An alternative then is to use query-context snippets for clustering instead of full field content. See the description of the carrot.produceSummary parameter for details.
carrot.url
The field that should be mapped to the logical document’s content URL. Leave blank if not required.

Related

Getting SolrException : Boosting query defined twice for query

I have created two query documents with names 'makeup', and 'make up' in elevate.xml.
When I execute the elevate solr query, I am getting exception "Boosting query defined twice for query".
whereas when I save two documents with names 'ChildCare', and 'Child Care', Solr is returning the results.
Below is my Solr query:
http://localhost:8983/solr/oneweb-collection/elevate?
q=*:*&defType=edismax&fl=id&fl=title&fl=subtitle&fl=course_code&
fl=cricos_code&fl=course_introduction&fl=outcome&fl=page_url&
fl=score&fl=%5Btafe_elevated%5D&rows=3&wt=json
When I save the document nodes, system internally replacing the spaces and storing the documents with same name.
What is the resolution for this issue?
Config for elevator:
<searchComponent name="elevator" class="solr.QueryElevationComponent" >
<str name="queryFieldType">text_general</str>
<str name="config-file">elevate.xml</str>
<str name="forceElevation">true</str>
<str name="exclusive">true</str>
<str name="editorialMarkerFieldName">test_elevated</str>
</searchComponent>
<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">edismax</str>
<int name="rows">3</int>
<str name="fl">id,title,subtitle,course_code,cricos_code,course_introduction,outcome,page_url,[test_elevated],score</str>
<str name="q.alt">*:*</str>
</lst>
<arr name="last-components">
<str>elevator</str>
</arr>
</requestHandler>

Solr Suggester - dynamic or passed at runtime field

Is it possible to have dynamic field or pass field for suggestions at runtime (in query for example) for SuggestComponent?
Depending on user's language I would like to suggest him different things. I have dynamic field name_* that has concrete fields: name_pl, name_de and name_en (can be more, I want to have flexibility here) and I would like to search for suggestions depending on language: for pl I want to get suggestions in name_pl, for en in name_en and so on.
So far I have standard Suggester with field specified:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="">name_pl</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler"
startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
But actually I need either to use name_* or preferably at runtime to pass the field name for example: http://localhost:8983/solr/services/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&suggest.q=name&suggest.field=name_pl
How would you implement such mechanism?
It is not the answer you may expect but I started a comment and I ended up with this.
By using a dynamic field here you would have to rebuild the suggester at each query, I suggest ;) you require a specific suggestComponent' dictionary on query.
The value for field should remain static because it is parsed once to build a dictionary index from that field. Or you would have to delete/rebuild that index each time a suggest query requires a dictionary other than the one previously built.
Instead you should replicate the suggester definition for each language you may have so that Solr can build one dictionary index per field/language (just name the suggesters according to the target field language) :
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggest_nl</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="field">name_pl</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
<lst name="suggester">
<str name="name">suggest_en</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="field">name_en</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
<!-- etc. -->
</searchComponent>
Now you can query the target dictionary dynamically :
.../suggest?suggest=true&suggest.q=name&suggest.dictionary=suggest_nl
There is an easy way to do this, not sure if you are aware of it:
you create one dictionary per language: suggester_pl, suggester_en...each using the right field. They are all defined inside a single SuggestComponent
when calling, you select which one to hit with &suggest.dictionary=suggester_en
check the docs here

Solr suggester returning terms from deleted documents

I have a SolrCloud setup and I'm testing the suggestion component. I have several hundred documents in the index. I did not want some of the documents in the index because they contain gibberish (they were binary files that got improperly converted to text). I've removed them from the index, but the gibberish words from them are still showing up in the suggestions.
My suggest configuration looks like this:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">fuzzySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="storeDir">suggester_fuzzy_dir</str>
<str name="field">dictionary_text</str>
<str name="suggestAnalyzerFieldType">phrase_suggest</str>
<str name="exactMatchFirst">true</str>
<float name="threshold">0.001</float>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">fuzzySuggester</str>
<str name="suggest.onlyMorePopular">true</str>
<str name="suggest.count">5</str>
<str name="suggest.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Note that buildOnCommit is set to true. I also tried to remove them using a /suggest query with the suggest.build=true parameter, but that had no effect.
Is there something else required to remove terms from the dictionary?
Despite using expungeDeletes=true in the update, the deleted documents were still hanging around. Optimizing removed them and appears to have removed all the gibberish terms from suggestions.

Removing unwanted items from solr autosuggester

I am trying to implement auto suggest from a huge set of paragraphs that are indexed. But I would want to filter out certain unwanted words appearing in auto suggest. For example words like "and", "how", "when", etc needs to be avoided. How do i go about it.
This is the configuration I have done for autosuggest in solrconfig.xml..
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">keywords</str>
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
I would recommend adding the StopFilterFactory to the backing fieldType definition for your keywords field in your schema.xml file. If you need those words ("and", "how", "when") in your keywords field for other searching requirements, I would suggest creating a custom field in your schema.xml just for the suggester and you can use the copyField directive to populate this new field.

Solr and spellcheck component : spellcheck.q doesn't take into consideration

I use spellcheck component and when I request solr I have results. But if I use spellcheck.q, i haven't result.
Someone has an idea ?
Thanks
<!-- The spell check component can return a list of alternative spelling
suggestions. -->
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spellCheck</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="buildOnCommit">true</str>
<str name="accuracy">0.4</str>
<float name="thresholdTokenFrequency">.0004</float>
</lst>
</searchComponent>
<!--<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>-->
<!-- Handler par défaut -->
<requestHandler name="default" class="solr.SearchHandler" lazy="true" default="true">
<lst name="defaults">
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">10</str>
<str name="hl.usePhraseHighLighter">true</str>
<str name="hl.highlightMultiTerm">true</str>
<str name="hl.mergeContiguous">true</str>
</lst>
<arr name="last-components">
<str>highlight</str>
<str>spellcheck</str>
</arr>
</requestHandler>
Have you added your spellcheck component to the corresponding request handler (in solr config), set spellcheck parameter to true (or on) and configured the correct dictionary to use (if its name different than "default")?
If you don't use the spellcheck.q parameter, then the default is to use the q parameter (from http://wiki.apache.org/solr/SpellCheckComponent#q_OR_spellcheck.q). From that wiki:
Essentially, if you have a spelling "ready" version in your application, then it is probably better to send spellcheck.q, otherwise, if you just want Solr to do the job, use the q parameter
The reason that it works if you change the definition of the field type is probably due to the new field type being "spelling ready". It would help if you posted the query you are using and the relevant lines in the schema.xml.

Resources